gobics.de

Göttingen
Bioinformatics

rasbhari:

optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

Introduction

Many algorithms for sequence analysis use patterns or spaced seeds consisting of match and don’t-care positions, such that only characters at the match positions are considered when sub-words of the sequences are counted or compared. The performance of these approaches depends on the underlying patterns.
rasbhari is a novel tool which generates optimized sets of patterns for database searching, read mapping and alignment-free sequence comparison. rasbhari uses an improved hill-climbing algorithm which produces patterns with slightly higher sensitivity than seeds calculated with other tools.
rasbhari is described in this paper: rasbhari

Related Approaches

Spaced-Words (Leimeister and Morgenstern, 2014) is an approach to fast alignment-free sequence comparison. The Spaced-Words approach calculates distances between pairs of sequences based on spaced-word frequencies. rasbhari is integrated in the Spaced-Words software to generate sets of patterns which reduce the variance of the number of word matches.



rasbhari can be downloaded here

Usage

For compilation change your directory to the folder containing the archiv and extract it.
Use the following commands for compiling
cd rasbhari
make

After compiling the program can be run with
./rasbhari [options]

Options - Extract

Some of the available options are the following. For more information please have a look at the README.

Options for the algorithms
--variance: Calculate the variance instead of Overlap Complexity.
--permut [int]: Select [int] times a pattern and try to modify and improve the patternset.
--sens: Activates the sensitivity calculation.

Options for the pattern
-m [int]: Number of patterns (default: m=10).
-d [int]: Number of don't care positions.
-w [int]: Number of match positions, the pattern weight.

Options for the variance
-S [int]: Sequence length of a theoretical dataset.
-p [double]: Background probability for all 4 nucleotides (A,C,G,T)

An example program usage could be
./rasbhari -m 10 -w 8 -d 6-15 --permut 25000

Warning!

The sensitivity calculation for a patternset can cause a huge memory (RAM) usage (> 40GB RAM)!
Therefore please save your important files and be careful by using the sensitivity calculation.

In this cases an error instance
std::bad_array_new_length
can be thrown!

Output

The output of the program is a a set of pattern, e.g.:
10110001
10100101
11001001
10011001
10000111
11100001
10101001

Contact

For comments, or if you encounter any technichal issues, please send an email at: lhahn(at)biologie.uni-goettingen.de

References

Scientific publications using rasbhari should cite:

L. Hahn, C.-A. Leimeister, R. Ounit, S. Lonardi, B. Morgenstern (2016)
rasbhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison
arXiv:1511.04001v2[q-bio.GN]