Big Idea/Goal/What is this?/Why did we do this?
Nitrogenase Finder is a genetics tool designed to help find nitrogenase genes within a metagenome; both the gene and metagenome should be in FASTA format. This project was originally born out of a suggestion by Jean Huang, a microbial biology professor at Olin College. We pursued this project so that we could learn about the internal functions of the tools such as BLAST and Genius that we used for bioinformatics analysis.
Background Information
Getting Started
- Clone this branch
- Run
nitrogenase_finder.py
- Run
visualization.py
- If you would like to:
- run this code for metagenomes other than those provided by Jean Huang, edit the load_metagenome function in
load.py
* - run this code to look for genes other than nitrogenase, edit the load_nitrogenase function in
load.py
to take in a .txt file that contains the proper gene sequence.*
*If the format for the nitrogenase and/or metagenome is different than what we have, either change the code to match your formatting or change your formatting
For more information, take a look at our README!
Project evolution/narrative
We started with code from the GeneFinder project from earlier this semester; that program accurately determined regions of the Salmonella bacterium's DNA that code for proteins. Building off functions in that code, we were able to search DNA sequences for a nitrogen fixing gene. From there, we implemented a memoized levenshtein algorithm that could identify genes that don't exactly match the nifH nitrogenase that we are searching against but that likely still fix nitrogen. Our nitrogenase finder code passes important information such as the start and end index of the gene in the open reading frame, the length of the gene, percentage match with nitrogenase, as well as a whether or not the open reading frame is a reverse complement to a pickle file. That file can then be accessed by our data visualization.
Implementation information
This code takes a list of contigs from a microbial community and looks for genes that can fix nitrogen. Since a gene that fixes nitrogen doesn't have to be an exact match for the nitrogenase sequence, we used a levenshtein algorithm to find a percent match for nitrogenase which is passed, along with other information of every other possible nitrogenase ORF, into a pickle file that is read by our data visualization code to produce a visualization.
Next Steps
As of right now, it can take at least 6 hours to finish running for an entire list of contigs from a microbial community using nitrogenase_finder.py. Even when we use PyPy, it takes about 15 minutes to produce results. Most of this time is spent searching in strings. In the future, we may implement a variation of the Boyer-Moore algorithm to perform string matching in less time.