Visualization
Our Nitrogenase Finder visualization in action!

Big Idea/Goal/What is this?/Why did we do this?

Nitrogenase Finder is a genetics tool designed to help find nitrogenase genes within a metagenome; both the gene and metagenome should be in FASTA format. This project was originally born out of a suggestion by Jean Huang, a microbial biology professor at Olin College. We pursued this project so that we could learn about the internal functions of the tools such as BLAST and Genius that we used for bioinformatics analysis.

Background Information

  • A gene is made out of nitrogenous bases that can be transcribed and then translated into proteins that perform bodily functions. Read more about genes here
  • The nitrogenase, or nitrogen fixing gene, we are locating is the nifH gene. Although we know the DNA sequence of nitrogenase, the DNA sequencing of a gene does not have to exactly match that of nitrogenase to be able to fix nitrogen. Read more about nif genes here
  • A contig is a set of overlapping DNA segments. Read more about contigs here
  • Getting Started

    1. Clone this branch
    2. Run nitrogenase_finder.py
    3. Run visualization.py
    4. If you would like to:

    *If the format for the nitrogenase and/or metagenome is different than what we have, either change the code to match your formatting or change your formatting

    For more information, take a look at our README!

    Project evolution/narrative

    We started with code from the GeneFinder project from earlier this semester; that program accurately determined regions of the Salmonella bacterium's DNA that code for proteins. Building off functions in that code, we were able to search DNA sequences for a nitrogen fixing gene. From there, we implemented a memoized levenshtein algorithm that could identify genes that don't exactly match the nifH nitrogenase that we are searching against but that likely still fix nitrogen. Our nitrogenase finder code passes important information such as the start and end index of the gene in the open reading frame, the length of the gene, percentage match with nitrogenase, as well as a whether or not the open reading frame is a reverse complement to a pickle file. That file can then be accessed by our data visualization.

    Project Proposal

    Implementation information

    This code takes a list of contigs from a microbial community and looks for genes that can fix nitrogen. Since a gene that fixes nitrogen doesn't have to be an exact match for the nitrogenase sequence, we used a levenshtein algorithm to find a percent match for nitrogenase which is passed, along with other information of every other possible nitrogenase ORF, into a pickle file that is read by our data visualization code to produce a visualization.

    Code Diagram

    Next Steps

    As of right now, it can take at least 6 hours to finish running for an entire list of contigs from a microbial community using nitrogenase_finder.py. Even when we use PyPy, it takes about 15 minutes to produce results. Most of this time is spent searching in strings. In the future, we may implement a variation of the Boyer-Moore algorithm to perform string matching in less time.

    Authors and Contributors

  • Rebecca Gettys, Olin College Class of 2018 (@rebeccagettys)
  • Liv Kelley, Olin College Class of 2019 (@livkelley)
  • Erica Lee, Olin College Class of 2019 (@ericasaywhat)
  • Resources We Used

  • Recursive Levenshtein Distance Example - https://programmingpraxis.com/2014/09/12/levenshtein-distance/
  • Softdes professors and NINJAs
  • Jean Huang
  • Starter code for GeneFinder softdes project

  • Modules

  • Pygame - http://pygame.org
  • Pickle and Sys - built into python.
  • PyPy - See more about PyPyDownload here