After a bit of a hiatus, I returned back to the hackathon world this past weekend in what turned out to be a very interesting first trip to New York City. Freddie Vargus (a CS friend of mine from BU) and I went down for the competition this past weekend, thinking that his background in quantitative finance, and my background in computational biology (and both of our backgrounds with data science) might break up some of the cliche projects you usually find at these competitions. We were both wrong/right.
As we left the Empire State Building, we began to talk seriously about the idea, and surfed for data in the taxi. Luckily, we were able to stumble across some interesting data collected from cancer patients down in Texas. They had authorized the publication of their genomes, and doctors sequenced both their cancerous cells and healthy cells. Given our heavy quantitative backgrounds, we set out with a mission; create a neural net from genomic data that would be able to predict pancreatic cancer based solely upon genomic pattern recognition.
Given our heavy quantitative backgrounds, and my background in bioinformatics, we dove in. We were able to power through a whole night to get the 11GB healthy and cancerous genomic data preprocessed. From here, with our programmatically isolated DNA sequences, we then utilized Biopython (a really strong framework I will be aiming to get more familiar with, and will be writing about soon) to do a pairwise comparison of the nucleotides involved (the ‘ACTG’ stuff from bio class). Given that the genomes were millions of these nucleotides long, we avoided the memory abuse by selecting sections of 102 nucleotides at a time and grouped them in this fashion. This would allow us to see identifiable patterns in the cancerous genome versus the normal genome, which would then allow us to train a neural net.
This then brought us to the final night. With a pattern recognized, we attempted to then utilize Google’s TensorFlow API to abstract away the nasty details of machine learning implementation. However, as we continued on, we were approached by a mentor from NYU who told us it would completely impossible to do this in one night. Sure enough, we took a break, searched a bit more, and found a paper from Stanford where they had done a similar analysis, but over a long period of time with 35000 samples of DNA…needless to say we were not going to be doing this in one night. From here, we took a break for an hour, and came back to finish out the night by finishing our data contextualization, and were able to produce a really interesting figure to display our findings about genomic differences from cancer:
From here, we can see a model of the DNA differences (x axis represents the nth, 102 length sequence and the y represents the number of nucleotides that had not changed). This is a rudimentary prototype of a tool that could help researchers draw conclusions about cancers effects on the genome, and how this change in turn propagates the disease.
Even though the project wasn’t a complete success, we were still able to break through the hackathon mold to build something truly unique. All the while, it was a great time to learn more about Python frameworks, as well as machine learning theory and its application in health. Regardless of the stigma, this is why I always recommend hackathons.
If anyone is interested in learning about TensorFlow, Biopython, or genomics/bioinformatics, I will be posting some links below which are awesome reads to give you some inspiration for your next side projects. I will also be posting a link to my GitHub where you can check out the source code. Looking forward to the next big update!
Google TensorFlow Documentation: