There are many open sources for text or data that are available on the web. The list below is a selection of sources that come to our attention and/or may have not been already included in available online directories such as Open Access Directory's data repositories, or Bigdata Made Simple's list.
The corpora at this site were created by Mark Davies, Professor of Linguistics at Brigham Young University. These are probably the most widely-used corpora currently available (read more).
MSU Libraries Humanities Data includes but is not limited to digitized and born digital text, audio, images, moving images, and the metadata that describes them. Current collection strengths reside in text and audio data. Their collections have been prepared with an eye toward enabling computational analysis at the micro and macro scale.
The Internet Archive and Open Library offers over 8,000,000 fully accessible and texts. Please be sure to read bulk-download instructions.
The JSTOR Data for Research (DfR) service, freely available to the public, provides text-and-data-mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can contact JSTOR directly at firstname.lastname@example.org to obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. For more information, see the Data for Research FAQ.
New York Times now offers API access to its newspapers. It can be searched as a whole or in sections (see available API).
- The PLOS Search API enables developers to query the content of PLOS journals and integrate the data into applications for the web, desktop or mobile devices. For more information, see the Search API FAQ.
- The PLOS Article-Level Metrics (ALM) API gives developers access to data collected by the PLOS Article-Level Metrics application for every article published in a PLOS journal, including usage statistics (e.g., page views, downloads), citation counts, mentions in Wikipedia, activity on social networks and blog coverage. For more information, see the ALM API FAQ. The PLOS API Display Policy specifies how data extracted using the PLOS APIs may be displayed.
PubMed Central offers access to its texts via various freely available mining tools with a focus on the automatic extraction of biological entities (genes, diseases, chemicals, mutations, species) and their relations from free text. In addition, there are "large-scale" literature indexing and text simplification tools and several biomedical corpora with manual annotation (e.g. NCBI Disease Corpus).
The Online Books Page is a website that facilitates access to books that are freely readable over the Internet. It also aims to encourage the development of such online books, for the benefit and edification of all. Their collections include various repositories, including non-English collections (read more).