Abstract

The Rhodes University Department of Biochemistry, Microbiology and Biotechnology has found need to identify the distinct species of bacteria given a large set of DNA sequences. However, without existing tools to solve this specific problem, they have found it takes an inordinate amount of time. Based on assumptions that they are able to make given, their specific problem, an approximate string matching algorithm can be applied to speed up this process and aid them in their research. By implementing this algorithm we will show that significant speedup was indeed attained.

 

Further Information

At the beginning of this year the bioinformaticians approached the computer science department with a research idea on the computer science side of bioinformatics, specifically to create a tool to aid them in their research of aquatic DNA samples. Their problem was that with their ever increasing sample size processing of their data using existing tools became incredible slow. I undertook the challenge to do further research in the area and create a tool to allow them to quickly solve their problem.

In their research they wished know the number of different species of bacteria and the number of each of the species present in a sample. With their existing tools however they had to do a complete, and expensive, sequence alignment calculation for the comparison of each pair of sequences, this adds up considerably when you realise that comparing 10,000 sequences with each other will require about 50 million comparisons, in fact it took nearly ten days to process a data sample. However, the entire alignment is not necessary in this case as all they are interested in is if the sequences are significantly similar or not. Thus a specific tool made for this purpose could take advantage of different algorithmes and branch pruning among other efficiencies.

 

 

Rhodes University Computer Science Department

Aquired this template form: