eprintid: 22891
rev_number: 9
eprint_status: archive
userid: 2808
dir: disk0/00/02/28/91
datestamp: 2017-04-27 07:23:36
lastmod: 2017-05-04 08:43:37
status_changed: 2017-04-27 07:23:36
type: conferenceObject
metadata_visibility: show
creators_name: Primpeli, Anna
creators_name: Meusel, Robert
creators_name: Bizer, Christian
creators_name: Stuckenschmidt, Heiner
title: The Web Data Commons Structured Data Extraction
subjects: ddc-004
subjects: ddc-020
divisions: i-704000
pres_type: poster
cterms_swd: Markup Language
cterms_swd: Structured data
abstract: More and more websites annotate their content using different markup formats. These annotations involve a large number of topics such as persons, events, products, hotels, organizations and cities. The purpose of embedding structured data in HTML pages is to make the content of those pages understandable to web applications. In this way, the retrieval and integration of data deriving from different web pages is greatly facilitated. The presented poster gives an overview of the Web Data Commons -  structured data project for the year 2016. The Web Data Commons project extracts structured data from the web corpus provided by Common Crawl, the largest public web corpus, and offers the extracted data for public download. In order to process these huge amounts of data, Web Data Commons builds upon its Extraction Framework and the Amazon Web Services.
date: 2017-03-17
id_scheme: DOI
id_number: 10.11588/heidok.00022891
collection: c-50
ppn_swb: 1657841049
own_urn: urn:nbn:de:bsz:16-heidok-228911
language: eng
bibsort: PRIMPELIANTHEWEBDATA20170317
full_text_status: public
pages: 1
event_title: E-Science-Tage 2017: Forschungsdaten managen
event_location: Heidelberg University
event_dates: 16-17 Mar 2017
citation:   Primpeli, Anna ; Meusel, Robert ; Bizer, Christian ; Stuckenschmidt, Heiner  (2017) The Web Data Commons Structured Data Extraction.  [Conference Item]     
document_url: https://archiv.ub.uni-heidelberg.de/volltextserver/22891/1/est_poster_vice-uc_17-03-2017.pdf