KiltHub Hosts Groundbreaking BOLD5000 Dataset
A groundbreaking new dataset of functional MRI brain scans that will significantly impact researchers’ ability to apply machine learning techniques to understanding how the brain processes information can be found in the University Libraries’ KiltHub repository. Collected over multiple sessions, the project gathered over 20 hours of MRI data from each of four subjects. Named BOLD5000, for the 5000 images that participants viewed over the course of their sessions, the unprecedented collection is one of the most downloaded items in KiltHub, with over 2,700 downloads.
KiltHub is one of two repositories, along with Open Neuro, a discipline-specific repository, that host the dataset and support downloads from the BOLD5000 website. When it came to the process of uploading the dataset to the cloud, the large and complex files required special attention. Ana Van Gulick, Librarian and Program Director, Open Science, advised the interdisciplinary project team on how to make the dataset publicly available and facilitated support for the data deposit between the researchers, the KiltHub team, and figshare, the platform that powers KiltHub.
“Since the dataset is intended for reuse by both neuroscientists and computer scientists, it was important to provide the documentation in ways that would be useful to both communities,” Van Gulick said. “This meant putting the dataset in multiple repositories and in multiple formats, including raw data and pre-processed data. I provided recommendations on issues such as data structure, versioning, and licensing.”
Hosting BOLD5000 in the Libraries’ repository is a demonstration of the collaboration and support that the Libraries’ research liaisons and data services team can bring to CMU researchers to support open science practices. The availability of KiltHub – which includes dataset citations with a DOI and metrics on views and downloads, among other features – free of charge to CMU scholars to disseminate the products of their research is a critical element of the research infrastructure of the university. And CMU’s strategic development partnership with figshare via its parent company Digital Science, means that Libraries faculty like Van Gulick have a direct line to developers who can implement new features and functionality to support the evolving research needs of the CMU research community.
“The BOLD5000 dataset was an interesting use case for the growth of the figshare platform to support large and complex datasets within institutional repositories,” Van Gulick said. “As science becomes more collaborative and computational and large datasets such as BOLD5000 become the norm, the Libraries and figshare are committed to providing the infrastructure to support data discoverability and reuse.”: Data & Publishing, Research support