Bagels and Bioinformatics: NCBI Codeathon at CMU Libraries

Happy Friday, Datascapers! Now that we are a couple of weeks into the semester, how is everyone feeling? Are you staying hydrated? If you are like me, it's a struggle to remember to drink water throughout the busy day, and by the end of the day, you have a significant dehydration headache and feel like a dried-out sponge. Or maybe that's just me. Regardless, I love to use phone apps to help me remember to drink water. I personally use WaterMinder (, but I recommend browsing your 'local' app store and finding one that works best for you!

This week on Tartan Datascapes, rather than featuring a single researcher, I'm ecstatic to be highlighting a group of researchers who came together at CMU Libraries from January 8th - January 10th to take part in a codeathon organized by Dr. Ben Busby, Genomics Outreach Coordinator at the National Center for Biotechnology Information (NCBI)! Over the course of three days, participants were assigned to teams and given a specific topic to explore and push the envelope of bioinformatics, including infectious disease, cancer graphs, virus graphs, and clinical RNAseq pipelines. 

These participants came from a variety of institutions, including Columbia University, the University of Pittsburgh, Johns Hopkins University, various companies in industry, and our very own CMU, and included masters students, Ph.D. students, postdocs, Ph.D.s, and one incredibly brave and ambitious undergraduate student! With computational support from the Pittsburgh Supercomputing Center, and with my colleagues Emma Slayton and Huajin Wang as co-hosts, on Wednesday morning the participants buckled in for 3 days of coding, creating workflows, and eating copious amounts of bagels

Some readers may be familiar with codeathons where teams are competing against each other while trying to solve a certain computational issue or challenge. That was not the case here! While teams were assigned to a certain topic, there was an information exchange across teams that led to a collaborative and intellectually-fruitful environment for all. Ben, with the help of all the participants, facilitated an environment where folks were encouraged to embrace the diversity of expertise in the room and seek to learn new things from each other. 

Now, I can safely say that I have no background in or familiarity with bioinformatics and computational biology, so in many ways, I was there mainly for moral support and writing notes of encouragement on the dry-erase boards in IDeATe Studio A. However, I was happy to be able to facilitate a brief presentation on data management principles and library resources for the participants, which made my little Research Data Management Consultant heart happy. Thanks to Ben, researchers in the codeathon were encouraged to think about data management from day one - including the use of README files, GitHub, and using collaborative communication tools like Slack to help organize their workflow. As a novice to bioinformatics, I was particularly excited to see the use of README files as a way to convey a brief summary of the project for folks like me who may not be able to fully understand the full complexity of the project. 

Stephen Price, CMU undergraduate student

Remember the really ambitious and brave undergraduate student I mentioned earlier? I am thrilled that I was able to chat with him more about his experience at the codeathon. Stephen Price, a junior studying Computational Biology here at CMU, attended the NCBI codeathon and found the experience incredibly rewarding. As President of CMU's Undergraduate Computational Biology Society, he's hoping that he can share this experience with the rest of the student body by developing and running a similar, NCBI-styled codeathon specifically for undergraduates this coming fall (email Stephen at for more information!). Most importantly, he hopes the experience will "encourage other CMU students with an interest in the computational and biological sciences to gain an appreciation for the amazing opportunities and tools that exist in the world of computational biology."

I had a chance to speak more in-depth with Stephen about his experience at this particular codeathon, which made me beam with pride in being able to be a part of this great experience:

When I entered the codeathon, I thought I'd be way out of my depth as I'd never developed a bioinformatics tool expected to undergo peer-reviewed publication before. After I joined my team, I was provided incredible mentorship, and learned about an incredible number of bioinformatics resources in a short period of time. My group was filled with graduate students and professionals passionate and knowledgeable about bioinformatics, and I was able to both contribute to our goal and learn from my teammates. After the codeathon, I'm excited about the prospect of building more bioinformatics tools, and can't wait to have the opportunity to participate in a NCBI codeathon again.

Our project was to build a command-line tool to allow researchers the ability to quickly and easily test the genetic differences of their case vs. control population (e.g. diabetics vs. non-diabetics) against a database of the genetic differences in humans with and without a viral infection. This allows researchers to test whether a particular condition may have shared characteristics to a previously thought unrelated viral infection. This tool lies on the cutting edge of computational immunology as there's a growing body of evidence suggesting links between viral infections and the chronic diseases of old age.

We are so incredibly lucky to be here at CMU among such vibrant researchers such as Stephen, and speaking for myself, being in a room with these brilliant, creative, and positive folks participating in the codeathon (as well as Ben) was incredibly inspiring. We will be striving to hold another codeathon in the future, so keep your eyes peeled for more information! And, if you'd like to dig a bit deeper into the work completed by each team during the codeathon, feel free to check out their public GitHub repositories (all accompanied by excellent README files!): 

RNAseq Reporting Designed for Clinical Deliverability:

Patient 'omics data in OMOP format:

Solidifying the VirusGraphs Infrastructure for Deployment: 

Finding NeoEpitomes in Transcriptomes:

Genomic Responses to Infectious Disease: 

What are three takeaways from this researcher highlight?

1. README files can be incredibly useful for collaborative projects, especially when sharing the projects in a public platform such as GitHub. Those who may not have familiarity with the research area can still learn important, digestible information from README files, and they also help researchers within your subject area understand how to engage with the data associated with your project. Need helping writing a good README for your project? Send me an email at and let's chat! 

2. Communicating across and within teams? Consider using Slack as a landing page for your messages. I was a "fly on the wall" in the Slack workspace used for the codeathon, and it was exciting to see how quickly Slack allowed folks to ask questions within and across teams, and receive help for computational issues as they arose. Personally, many of us within CMU Libraries use Slack to help clean up our inboxes and organize our virtual communication. I fully recommend giving it a try within your collaborative environments! 

3. Codeathons are an excellent way to hone your skills in problem solving, collaboration, and discipline-specific data analysis, and broaden and expand your data science capabilities. Codeathons like this NCBI codeathon were rooted in an environment of support, camaraderie, and information exchange, with the goal of both pushing the envelope of bioinformatics and helping participants broaden their computational skillsets! 

Important Happenings in Research Data Management at CMU Libraries:

We have a great lineup of workshops coming up at CMU Libraries (click here to see our full list of workshops for the semester), many of which can help you learn new tips and tricks for data collection, analysis, and management. Here's a few that have a particular Tartan Datascapes-flavor: 

enlightened Writing an Effective Data Management Plan (taught by yours truly!), Monday, January 27th from 6:00 pm - 7:00 pm in the Sorrells Library Den (click here to register!)

enlightened Data Visualization Basics, Thursday, February 6th from 12:30 pm - 1:30 pm in the Sorrells Library Den (click here to register!)

enlightened dSHARP Gerrymandering Series: Network Analysis, Monday, February 10th from 12:00 pm - 1:00 pm in the Sorrells Library Den (click here to register!). 

enlightened Data Management for Social Sciences, Monday, February 10th from 6:00 pm - 7:00 pm in the Sorrells Library Den (click here to register!)

And of course, please email me at if you'd like some help on your journey as a researcher/scholar/awesome human being here at CMU. Remember, we all use data, regardless of our discipline. If you think something might be data, you are likely correct and I can help you develop good habits for managing it! If you'd like to have your research data featured on Tartan Datascapes, please fill out this Google Form to get in touch!