Libraries Hackathon Puts Reproducibility to the Test

AI hackathon

On Friday, March 22, the Libraries hosted “Reproducibility Hackathon 2024: A Day in Digital Humanities,” a day-long hackathon focused on replicating and augmenting published research. The hackathon was collaborative rather than competitive — each team worked with data and code produced from the same publication, exploring new ideas and analyses.

Science and Engineering Librarian Chasz Griego, who organized the event, has been interested in the reproducibility of research data and code since he first started at the Libraries as an open science postdoctoral associate two years ago, and has worked to develop opportunities to engage with CMU on the topic. This past summer he piloted a course focused on implementing various open science tools to prepare practice data for dissemination, which he is teaching again this summer.

In a hackathon setting, actual data from research happening on campus made the outputs even more relevant.

“I thought it would be interesting to simulate a setting where a researcher puts work up for scrutiny, and I’ve thought of multiple ways to approach it,” he explained. “The hackathon allowed us to dive into real research produced at Carnegie Mellon, assessing it for reproducibility. It’s a new way for us to provide support to researchers, inviting them to put their work to the test and helping them prepare outputs in improved ways.”

Data and code for the hackathon was provided by Professor of English and Associate Department Head Christopher Warren, whose published article focused on the “Oxford Dictionary of National Biography” — a collection of over 60,000 biographies of important figures in British history. After accessing metadata from the resource, Warren was able to use digital humanities methods to scrutinize the people, places, and professions from across the 72 million words contained in the dictionary in a new way.

“Scholars like the historian Jo Guldi have started calling my article an ‘audit’ of the ODNB, and I actually really like that. The thought is that hugely influential research platforms encode all kinds of biases and assumptions, but those biases and assumptions are often hidden away in the bowels of the data infrastructure,” Warren explained. “The ODNB is fabulous, and enormous, but that doesn't mean we should simply be grateful that it exists at all. Big data doesn't mean beyond reproach. That's why we need periodic audits, as Guldi says, so researchers know what they're dealing with at any given time — what's changed and what still needs to change.”

Warren first crossed paths with Griego during an earlier Libraries workshop focused on Code Ocean, a cloud-based platform for creating, organizing, and sharing reproducible computational research environments. Code Ocean aligned with Warren’s goals for reproducible research, and the connection led to the partnership that brought the hackathon to life.

“It wasn't until Chasz invited me to participate in the Reproducibility Hackathon that I realized how critical it was that periodic audits have code that could survive the vicissitudes of dependencies and versioning,” Warren continued. “If the code won't run, you can't run the same audit in five or 10 years to assess what's changed. Reproducibility’s about holding institutions to account. Reproducibility is accountability.”

Participants ranged in discipline from computer science to modern languages, with varying degrees of knowledge about Python and data science. They were asked to download Warren’s data, run the code, and reproduce the figures to test how easily it could be done. Then, they considered improvements that could be made and provided reflections.

For Elizabeth Terveen, a first-year student in the School of Computer Science, the hackathon was a chance to work with like-minded researchers and learn more about digital humanities. “It was a great opportunity to connect with people who have a background in computer science, but are also interested in real-world, humanities-oriented approaches,” she said.

By participating, Terveen expanded her knowledge about the open science movement and reproducibility, becoming familiar with several tools that support open research. “This is valuable because it serves as a learning tool, fostering growth for younger researchers,” she added. “It also makes it easier for interdisciplinary work to get off the ground.”

In the coming months, findings and research outputs from the hackathon will be collectively published in an article, on the F1000 hackathon channel. Outputs like data, code, and visualizations have already been shared to a public repository on Open Science Framework.

Griego aims to turn reproducibility hackathons into an annual event for the Libraries, pulling in more participants from across campus and raising awareness about the ways the Libraries can serve as an advocate and ally in a researcher’s reproducibility efforts.

“It will be great to incorporate other areas of research that are at the heart of CMU, like robotics, machine learning, and AI,” Griego said. “The potential of an interdisciplinary team led by the Libraries to magnify the findings of original research is incredible.”

For help exploring the reproducibility of your research, you can get in touch with specialists at the Libraries like Griego, Open Science Program Director and Librarian Melanie Gainey and Research Data Services Librarian Alfredo González-Espinoza. To learn more about a variety of open science offerings, subscribe to the Open Science Newsletter.

by Sarah Bender, Communications Coordinator