README Files: Front Matter for Your Data

For all you bibliophiles out there (a person who loves books and/or collects them), I'm certain that you'll enjoy this blog post where I draw an irresistible connection between the front matter of books and README files. For the casual book user, I hope today's Tartan Datascapes not only increases your interest in README files, but also sparks a passion for all the lovely things that comprise a book.

I'll be honest, I was not familiar with the term 'front matter' up until a couple of years ago, and once I learned what it meant, it was a massive lightbulb moment where so many things became clear! The front matter of a book can be considered the beginning information in a book that comes before the actual text begins. This can include the title page, the table of contents, the foreword, and the introduction. If someone gave you a book that didn't have any front matter, you might be a little confused about what the book is about. You'd have to really dig into the text to understand the goals and sections of the book, and even then, it might take a couple of chapters to really understand the purpose of the book. For some folks, maybe this would be an exciting adventure, but if you had to write a book report, you might be a little frustrated that you couldn't see that important contextual information about the book itself. For me, the front matter is essential for my understanding of what to expect when I read a book, and even helps me decide if I want to read it! Front matter has allowed me to pick and choose which Stephen King novels I'd like to read and risk the nightmares that are all but sure to follow (growing up in Maine, we are exposed to his books at a very young age, for better or for worse! I'm still particularly scarred from reading IT at too young of an ageĀ  - perhaps I should have paid closer attention to the front matter).

README files are like front matter for your data, providing helpful information on the context of how your data is organized and what someone would need to know to understand it. Without a README file, it can be hard to know how to understand and even reuse a dataset! README files are generally text (with a .txt file extension) documents, or in the case of GitHub, Markdown documents (with a .md file extension) that live alongside your data wherever you are storing it. Some common elements in README files include:

- Who created the data, and their contact information

- Time range of data collection

- # of variables, # observations, and specific names of each variable; allowable values in each variable

- Any missing data to note, and how this missing data is represented

- Any special software or dependencies needed to use the data

- File formats and file directories (especially important if you have more than one data file!)

This is not an exhaustive list, and it's important to note that the contents of a README may shift based on the type of data or software being described. For the KiltHub institutional repository at CMU, we require each data deposit to have a README file in order to support reproducibility of the data. We provide a sample README for folks to check out as they are creating their own README for their KiltHub deposit, and are always happy to provide feedback on README files whether you are submitting to KiltHub or not!

Just like a book without front matter, a dataset without a README is hard to understand, not only for others who may be unfamiliar with the research, but even for the researcher themselves after some time has passed from the initial data collection! We have dozens (hundreds?!) of things going on at a single time, and it's natural that we may forget all the specifics of what went into our data collection and analysis. README files are not only good for others who want to use your data, but they are good for yourself, too. And, just like the front matter for Stephen King novels which kept me from reading books that would be a bit too intense for me, README files can save other researchers (and yourself) from some data nightmares when it comes to reusing data!