Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives

Updated, more in-depth depth version of this presentation given at the DLF Fall Forum 2013. Focus in research on NC State peer institutions. (2013-11-04)

HTML slides press “n” for speaker notes (repo)
PDF slides
Single document HTML slides with speaker notes
Web Data Commons with .edu All of the Web Data Commons N-Quads with “.edu” somewhere in the statement.
Web Data Commons N-Quads with “.edu” converted to CSV. Data extracted from each of the “.edu” N-Quads as CSV, which was then imported into Solr.
https://bitbucket.org/jronallo/austin A library and documentation for reproducing these data sets.
https://bitbucket.org/jronallo/austin-blacklight An instance of Blacklight for exploring the CSV data as facets.

Abstract:

Search engines are reaching the limits of natural language processing while wanting to provide more exact answers, not just results, especially for the mobile context. This shift is part of what has spurred progress in how data can be published and consumed on the Web. Broad and simple vocabularies and simplified embedded semantic markup is leading to wider adoption of publishing data in HTML. Libraries and archives can take advantage of new opportunities to make their services and collections more discoverable on the open Web. This presentation will show some examples of what libraries and archives are currently doing and point to future possibilities.

At the same time as this new data is being made available, only a few organizations have the resources to crawl the Web and extract the data. The Common Crawl is helping to make a large repository of Web crawl data available for public use, and Web Data Commons is extracting the data embedded in the Common Crawl and making the resulting linked data available for download. This presentation will share data from original research on how libraries currently fare in this new environment of big Web data. Are libraries and archives represented in the corpus? With this democratization of Web crawl data and lowered expense for consumption of it, what are the opportunities for new library services and collections?

Preliminary Inventory of Digital Collections

Incomplete thoughts on digital libraries.

Embedded Semantic Markup, schema.org, the Common Crawl, and Web Data Commons: What Big Web Data Means for Libraries and Archives