Nitrogenase Finder by rebeccagettys

Visualization — Our Nitrogenase Finder visualization in action!

Big Idea/Goal/What is this?/Why did we do this?

Nitrogenase Finder is a genetics tool designed to help find nitrogenase genes within a metagenome; both the gene and metagenome should be in FASTA format. This project was originally born out of a suggestion by Jean Huang, a microbial biology professor at Olin College. We pursued this project so that we could learn about the internal functions of the tools such as BLAST and Genius that we used for bioinformatics analysis.

Background Information

A gene is made out of nitrogenous bases that can be transcribed and then translated into proteins that perform bodily functions. Read more about genes here

The nitrogenase, or nitrogen fixing gene, we are locating is the nifH gene. Although we know the DNA sequence of nitrogenase, the DNA sequencing of a gene does not have to exactly match that of nitrogenase to be able to fix nitrogen. Read more about nif genes here

A contig is a set of overlapping DNA segments. Read more about contigs here

Getting Started

Clone this branch
Run nitrogenase_finder.py
Run visualization.py
If you would like to:

run this code for metagenomes other than those provided by Jean Huang, edit the load_metagenome function in load.py*
run this code to look for genes other than nitrogenase, edit the load_nitrogenase function in load.py to take in a .txt file that contains the proper gene sequence.*

*If the format for the nitrogenase and/or metagenome is different than what we have, either change the code to match your formatting or change your formatting

For more information, take a look at our README!

Project evolution/narrative

We started with code from the GeneFinder project from earlier this semester; that program accurately determined regions of the Salmonella bacterium's DNA that code for proteins. Building off functions in that code, we were able to search DNA sequences for a nitrogen fixing gene. From there, we implemented a memoized levenshtein algorithm that could identify genes that don't exactly match the nifH nitrogenase that we are searching against but that likely still fix nitrogen. Our nitrogenase finder code passes important information such as the start and end index of the gene in the open reading frame, the length of the gene, percentage match with nitrogenase, as well as a whether or not the open reading frame is a reverse complement to a pickle file. That file can then be accessed by our data visualization.

Project Proposal

Implementation information

This code takes a list of contigs from a microbial community and looks for genes that can fix nitrogen. Since a gene that fixes nitrogen doesn't have to be an exact match for the nitrogenase sequence, we used a levenshtein algorithm to find a percent match for nitrogenase which is passed, along with other information of every other possible nitrogenase ORF, into a pickle file that is read by our data visualization code to produce a visualization.

Next Steps

As of right now, it can take at least 6 hours to finish running for an entire list of contigs from a microbial community using nitrogenase_finder.py. Even when we use PyPy, it takes about 15 minutes to produce results. Most of this time is spent searching in strings. In the future, we may implement a variation of the Boyer-Moore algorithm to perform string matching in less time.

Authors and Contributors

Rebecca Gettys, Olin College Class of 2018 (@rebeccagettys)

Liv Kelley, Olin College Class of 2019 (@livkelley)

Erica Lee, Olin College Class of 2019 (@ericasaywhat)

Resources We Used

Recursive Levenshtein Distance Example - https://programmingpraxis.com/2014/09/12/levenshtein-distance/

Softdes professors and NINJAs

Jean Huang

Starter code for GeneFinder softdes project

Modules

Pygame - http://pygame.org

Pickle and Sys - built into python.

PyPy - See more about PyPy Download here