Directly to content
  1. Publishing |
  2. Search |
  3. Browse |
  4. Recent items rss |
  5. Open Access |
  6. Jur. Issues |
  7. DeutschClear Cookie - decide language by browser settings

The Web Data Commons Structured Data Extraction

Primpeli, Anna ; Meusel, Robert ; Bizer, Christian ; Stuckenschmidt, Heiner

[thumbnail of est_poster_vice-uc_17-03-2017.pdf]
PDF, English - main document
Download (460kB) | Lizenz: Creative Commons LizenzvertragThe Web Data Commons Structured Data Extraction by Primpeli, Anna ; Meusel, Robert ; Bizer, Christian ; Stuckenschmidt, Heiner underlies the terms of Creative Commons Attribution 3.0 Germany

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.


More and more websites annotate their content using different markup formats. These annotations involve a large number of topics such as persons, events, products, hotels, organizations and cities. The purpose of embedding structured data in HTML pages is to make the content of those pages understandable to web applications. In this way, the retrieval and integration of data deriving from different web pages is greatly facilitated. The presented poster gives an overview of the Web Data Commons - structured data project for the year 2016. The Web Data Commons project extracts structured data from the web corpus provided by Common Crawl, the largest public web corpus, and offers the extracted data for public download. In order to process these huge amounts of data, Web Data Commons builds upon its Extraction Framework and the Amazon Web Services.

Document type: Conference Item
Date Deposited: 27 Apr 2017 07:23
Date: 17 March 2017
Number of Pages: 1
Event Dates: 16-17 Mar 2017
Event Location: Heidelberg University
Event Title: E-Science-Tage 2017: Forschungsdaten managen
Faculties / Institutes: Service facilities > Computing Centre
DDC-classification: 004 Data processing Computer science
020 Library and information sciences
Controlled Keywords: Markup Language, Structured data
Collection: E-Science-Tage 2017
About | FAQ | Contact | Imprint |
OA-LogoDINI certificate 2013Logo der Open-Archives-Initiative