Special Installment: dSHARP and Fair Use Analyses of Text Data Sources

What?! Tartan Datascapes on a Wednesday?! That's right! Co-written with my colleague Matthew Lincoln, Research Software Engineer for CMU Libraries, I'm happy to share this special edition of Tartan Datascapes alongside our programming efforts around Fair Use/Fair Dealing Week 2020 (follow this link for a complete list of the ways we're engaging with this week, including trivia and a digital exhibit: https://library.cmu.edu/about/publications/news/fair-use-week-2020). Fair Use/Fair Dealing Week is an opportunity for institutions across North America to provide education on the principles of Fair Use in the United States and Fair Dealing in Canada.

I would argue that an important core tenet of research data management is understanding where your data comes from, and how this impacts your engagement with the data in an ethical and appropriate manner. Because it is Fair Use/Fair Dealing Week, this week's installment of Tartan Datascapes will focus on data from copyrighted materials which fall under fair use. So, what is fair use and what does it have to do with data? Grab a snack, sit back, and let's learn this together! 

Section 107 of the Copyright Act defines fair use as the following: 

[T]he fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include --

the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

the nature of the copyrighted work;

the amount and substantiality of the portion used in relation to the copyrighted work as a whole;

and the effect of the use upon the potential market for or value of the copyrighted work.

In short, fair use of copyrighted materials (particularly data collected from harvested materials) can be claimed in a variety of situations, including when the use is for educational, non-commercial purposes. As primarily a humanities researcher on popular culture, conversations around fair use have come up in my career many times, particularly since I engage with a lot of copyrighted materials, including images, song lyrics, and audio clips. Upon joining CMU Libraries last summer, I found a home within dSHARP (digital Sciences, Humanities, and Arts Research & Publishing), a team of faculty and staff at CMU Libraries dedicated to advancing research and teaching involving digital tools, methods, and sources. When Andy Prisbylla, our Library Programming and Engagement Coordinator, first asked me to write this special feature of Tartan Datascapes, I knew it would be a great opportunity to highlight the interesting data engagement taking place in dSHARP, where conversations around fair use take place on a regular basis. 

Recently at dSHARP office hours (Wednesdays from 1-4pm in the Sorrells Library Den on the 4th floor of Wean Hall), I sat down with Matt Lincoln, who has a B.A., M.A., and Ph.D. in Art History, to chat about how fair use intersects with the world of art history and digital humanities. In his role with CMU Libraries, Matt collaborates with scholars to plan and implement computational approaches to humanities research. Given the wide range of data sources that we use in humanities research (many of which include artwork, textual sources, photographs, and music), the 'fair use' status of these data guide how we can visualize and communicate our research. Much of the research Matt comes into contact with involves text analysis of works which may still be under copyright. Copyright generally lasts for 70 years after the death of the author who created the work, at which point it enters the public domain (that is, not protected by copyright laws and may be used freely and openly). But, what if Matt wants to undertake text analysis on a work that still remains within this 70-year window? There are absolutely still ways to use these works as data sources! Matt highlighted two text mining techniques, topic modeling and term frequency, that allow him to engage in text analysis on these copyrighted works because they only need the counts of words in those works, not the full text itself. Topic modeling involves finding groups of words from a corpus of texts that tend to appear together in the same documents - in other words, terms that could be said to represent 'topics' of those documents. Term frequency measures how many times a word appears in a document relative to the total amount of words in that document, providing context on the themes present in the text. 

Topic modeling and term frequencies are just two of a number of techniques in text mining that deal with word counts, also known as 'bag-of-words' models. If you think of this quite literally, if we took a novel, ripped out the pages (sorry to all bibliophiles here!), cut each word onto its own sheet of paper, and threw these into a brown paper bag, we would have a bag of words. This bag of words would disregard the original word order in the text, simply representing the presence of words throughout the publication. Since these word counts do not replace the original function of the work, as they can't let us 'read'' the publication, they fall under fair use. Some providers of textual data, such as JSTOR and HathiTrust, allow bulk access to the counts of words in a document, article, or book that is still under copyright.

Without giving too many spoilers, I'm happy to share that I'll be featuring the 'Print & Probability: A Statistical Approach to Clandestine Publication' project in a Tartan Datascapes feature later this semester, which will provide an in-depth view into how several collaborators across campus (including from dSHARP at CMU Libraries, Dr. Christopher Warren in the Department of English, and Dr. Max G'Sell from the Department of Statistics and Data Science!) are engaging with text and image data sources through fair use. Stay tuned!

What are some general takeaways from this special feature? Here's a few: 

1. dSHARP is a fantastic resource on campus for digital humanities research and general support and collaboration on digital techniques. Contact dSHARP at dhcmu@andrew.cmu.edu to learn more!

2. Fair use is a topic of incredible importance in many situations, but particularly in humanities research engaging with text data. Not sure how to navigate these data? Email our Research Data Services team at UL-DataServices@andrew.cmu.edu for help!

3. CMU Libraries offers programming around Fair Use Week/Fair Dealing Week each year! We're lucky to have Andy Prisbylla leading these programming efforts. Check out the digital exhibit he created for a remix poem by Terrence Chiusano, Technical Services Specialist for CMU Libraries. Terrence constructed this poem entirely from the titles in Walt Whitman's 1865 Civil War poem collection 'Drum Taps", and is an excellent example of fair use in action: https://pertainstome.com.