gobics.de

Göttingen
Bioinformatics

rasbhari:

optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

Introduction

Many algorithms for sequence analysis use patterns or spaced seeds consisting of match and don’t-care positions, such that only characters at the match positions are considered when sub-words of the sequences are counted or compared. The performance of these approaches depends on the underlying patterns.
rasbhari is a novel tool which generates optimized sets of patterns for database searching, read mapping and alignment-free sequence comparison. rasbhari uses an improved hill-climbing algorithm which produces patterns with slightly higher sensitivity than seeds calculated with other tools.
rasbhari is described in this paper:
  • Hahn L., Leimeister C.-A., Ounit R., Lonardi S., Morgenstern B. (2016)
    rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.
    PLos Comput Biol 12(10): e1005107
    doi:10.1371/journal.pcbi.1005107

Related Approaches

Spaced-Words (Leimeister et al., 2014) is an approach to fast alignment-free sequence comparison. The Spaced-Words approach calculates distances between pairs of sequences based on spaced-word frequencies. rasbhari is integrated in the Spaced-Words software to generate sets of patterns which reduce the variance of the number of word matches.



rasbhari can be downloaded here

Usage

For compilation change your directory to the folder containing the archiv and extract it.
Use the following commands for compiling
cd rasbhari-[version]
make

After compiling the program can be run with
./rasbhari [options]

Options - Extract

Some of the available options are the following. For more information please have a look at the README.

Options for the algorithms
--variance: Calculate the variance instead of Overlap Complexity.
--permut [int]: Select [int] times a pattern and try to modify and improve the patternset.
Options for the pattern
-m [int]: Number of patterns (default: m=10).
-d [int]: Number of don't care positions.
-w [int]: Number of match positions, the pattern weight.

Options for the variance
-S [int]: Sequence length of a theoretical dataset.
-p [double]: Background probability for all 4 nucleotides (A,C,G,T)

An example program usage could be
./rasbhari -m 10 -w 8 -d 6-15 --permut 25000

Output

The output of the program is a a set of pattern, e.g.:
10110001
10100101
11001001
10011001
10000111
11100001
10101001

Contact

For comments, or if you encounter any technichal issues, please send an email at: lhahn(at)biologie.uni-goettingen.de

References

Scientific publications using rasbhari should cite:

  • Hahn L., Leimeister C.-A., Ounit R., Lonardi S., Morgenstern B. (2016)
    rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison.
    PLos Comput Biol 12(10): e1005107
    doi:10.1371/journal.pcbi.1005107