Generates BED file from random regions that best matches the CG contents of a reference region

This program generates "control" regions with similar CG contents. It also outputs the CG enrichement profile for a bunch of equal length genomic intervals (aka bed file), without actually generating the control regions.

The CG-enrichment is defined in windows (option -w) as

(#C + #G )  / (#C + #G + #A + #T + #N) - CG-background

where #N is used in the case the base is not assigned in the genome. The CG-background depends on the genome (mouse is 0.42).

Given a reference bed file with intervals of the same length and another bed file with intervals of the same length, it generates an output bed file from regions from the "control" bed file such that the CG content profile along the sequences enoded in the reference bed files match with those of the control. The output bed file has the same size as the reference one. Obviously the length of the "other" bed file should be larger.

The program randomly samples from the "other" bed file and proposes a CG-profile. It uses a MonteCarlo move to accept or reject the regions from the "other" bed file. It calculates the differences with the reference profile, and if their root mean square difference is below a threshold it outputs the candidate random regions and stops. If not, it continues until the maximum number of iterations is exausted.

It requires a genome file, i.e. a single fasta file that is compatible with the bed files. Issue cg_contents -h to get a help message, with all commands available.

Assuming the cg_content is in the path, a typical execution to simply generate a CG enrichment profile would be

cg_content -i genome.fa -b genomic_regions.bed -pf -cgref cg_enrichment_profile.dat

or to generate the random control regions with the same CG profile,

cg_content -i genome.fa -b reference.bed -br random_regions.bed -max_iter 5000000 -cutoff 0.00010 -bo rand_regions_fit_CG_.bed -v

How to compile

Clone this repository, and move to the root directory. Then simply issue

cmake . 
make
# or  make -j 4 to speed up

and you should get the compiled program. You would need a modern version of CMake (3.5 or above), and Seqan library (2.2 or above). If cmake can not find the dependencies by itself, either fix the CMakeLists.txt file or give it a little help, e.g.

cmake . -DCMAKE_C_COMPILER=/opt/local/bin/clang-mp-3.9 -DCMAKE_CXX_COMPILER=/opt/local/bin/clang++-mp-3.9 -DSEQAN_INCLUDE_PATH=/YOUR_PATH/seqan-library-2.2.0/include

Requirements

Compiler supporting at least C++14 (required to compile Seqan anyway)
CMake 3.5.0 or above
Seqan 2.2.0 (other versions untested, might work as well)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.clang_complete		.clang_complete
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
cg_content.cpp		cg_content.cpp
commandline_parse.hpp		commandline_parse.hpp
compute_contents.hpp		compute_contents.hpp
utils_common.hpp		utils_common.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Generates BED file from random regions that best matches the CG contents of a reference region

How to compile

Requirements

About

Uh oh!

Releases

Packages

Languages

gportella/fit_randseq_cg_contents

Folders and files

Latest commit

History

Repository files navigation

Generates BED file from random regions that best matches the CG contents of a reference region

How to compile

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages