Implicit Entity Networks: A Versatile Document Model

Spitz, Andreas

PDF, Englisch - Hauptdokument
Download (3MB) | Lizenz:

Creative Commons Namensnennung-Nicht kommerziell 4.0

Zitieren von Dokumenten: Bitte verwenden Sie für Zitate nicht die URL in der Adresszeile Ihres Webbrowsers, sondern entweder die angegebene DOI, URN oder die persistente URL, deren langfristige Verfügbarkeit wir garantieren. [mehr ...]

DOI: 10.11588/heidok.00026328
URN: urn:nbn:de:bsz:16-heidok-263287

Abstract

The time in which we live is often referred to as the Information Age. However, it can also aptly be characterized as an age of constant information overload. Nowhere is this more present than on the Web, which serves as an endless source of news articles, blog posts, and social media messages. Of course, this overload is even greater in professions that handle the creation or extraction of information and knowledge, such as journalists, lawyers, researchers, clerks, or medical professionals. The volume of available documents and the interconnectedness of their contents are both a blessing and a curse for the contemporary information consumer. On the one hand, they provide near limitless information, but on the other hand, their consumption and comprehension requires an amount of time that many of us cannot spare. As a result, automated extraction, aggregation, and summarization techniques have risen in popularity, even though they are a long way from being comprehensive. When we, as humans, are faced with an overload of information, we tend to look for patterns that bring order into the chaos. In news, we might identify familiar political figures or celebrities, whereas we might look for expressive symptoms in medicine, or precedential cases in law. In other words, we look for known entities as reference points, and then explore the content along the lines of their relations to others entities. Unfortunately, this approach is not reflected in current document models, which do not provide a similar focus on entities. As a direct result, the retrieval of entity-centric knowledge and relations from a flood of textual information becomes more difficult than it has to be, and the inclusion of external knowledge sources is impeded.

In this thesis, we introduce implicit entity networks as a comprehensive document model that addresses this shortcoming and provides a holistic representation of document collections and document streams. Based on the premise of modelling the cooccurrence relations between terms and entities as first-class citizens, we investigate how the resulting network structure facilitates efficient and effective entity-centric search, and demonstrate the extraction of complex entity relations, as well as their summarization. We show that the implicit network model is fully compatible with dynamic streams of documents. Furthermore, we introduce document aggregation methods that are sensitive to the context of entity mentions, and can be used to distinguish between different entity relations. Beyond the relations of individual entities, we introduce network topics as a novel and scalable method for the extraction of topics from collections and streams of documents. Finally, we combine the insights gained from these applications in a versatile hypergraph document model that bridges the gap between unstructured text and structured knowledge sources.

Übersetzung des Abstracts (Deutsch)

Unsere Zeit wird oft als das Informationszeitalter bezeichnet, obwohl eine Charakterisierung als Zeitalter des konstanten Informationsüberflusses ebenso treffend wäre. Nirgendwo sind Informationen so präsent wie im Internet, das eine unversiegbare Quelle an Nachrichtenartikeln und Beiträgen aus den sozialen Medien ist. In Arbeitsfeldern wie dem Journalismus oder der Medizin, die sich mit der Verwaltung oder Beschaffung von Informationen und Wissen beschäftigen, ist diese Informationslast oftmals noch stärker ausgeprägt. Die Menge an verfügbaren Dokumenten ist dabei für den Leser häufig ein Fluch und ein Segen zugleich. Auf der einen Seite bietet sie Zugang zu fast unbegrenzten Informationen, aber auf der anderen Seite erfordern Lektüre und Verständnis einen Zeitaufwand, der kaum zu rechtfertigen ist. Aufgrund dieses Problems hat die Verwendung von maschinellen Extraktions-, Aggregations- und Zusammenfassungsverfahren stark zugenommen, stößt aber an ihre Grenzen. Im Angesicht eines solchen Informationsüberflusses liegt für den Leser oft die Suche nach Mustern nahe, um Ordnung in das Chaos zu bringen. Dies können bekannte Personen in Nachrichtenartikeln sein, Präzedenzfälle im Rechtswesen, oder Symptome in der Medizin. Mit anderen Worten: Wir suchen nach uns bekannten Entitäten als Referenzpunkten und hangeln uns dann an den Beziehungen zu anderen Entitäten entlang, um den Inhalt der Dokumente zu verstehen. Genau dieses Vorgehen wird aber von existierenden Dokumentenmodellen auf technischer Seite nicht unterstützt, da diese keine Entitätsrelationen berücksichtigen. Somit wird die Informationsgewinnung aus unstrukturierten Texten mithilfe existierender Dokumentenmodelle schwieriger als notwendig und die Einbindung externer Wissensquellen verhindert.

In dieser Arbeit führen wir daher implizite Entitätsnetzwerke ein, die eine vollständigere Repräsentation von Dokumentensammlungen ermöglichen. Basierend auf der Modellierung von Kookkurrenzrelationen zwischen Entitäten und Worten als primäre Komponente des Modells untersuchen wir effiziente und effektive Methoden zur entitätsbasierten Suche in Dokumenten sowie der Extraktion von Entitätsrelationen. Wir zeigen weiterhin, dass implizite Entitätsnetzwerke auch genutzt werden können, um dynamische Ströme von Dokumenten zu modellieren. Basierend auf dem Kontext von Entitätsrelationen extrahieren wir Topics aus Netzwerken und setzen diese in Kontrast zu Topicmodellen. Schließlich verallgemeinern wir das Modell der impliziten Entitätsnetzwerke zu einem Dokumentenmodell basierend auf Hypergraphen, das die direkte Kombination von unstrukturierten Texten und strukturierten Wissensbasen ermöglicht.

Dokumententyp:	Dissertation
Erstgutachter:	Gertz, Prof. Dr. Michael
Ort der Veröffentlichung:	Heidelberg
Tag der Prüfung:	10 April 2019
Erstellungsdatum:	29 Apr. 2019 15:18
Erscheinungsjahr:	2019
Institute/Einrichtungen:	Fakultät für Mathematik und Informatik > Institut für Informatik
DDC-Sachgruppe:	004 Informatik
Normierte Schlagwörter:	Information Retrieval
Freie Schlagwörter:	Document Exploration, Implicit Network, Cooccurrence Network, Network Topic, Hypergraph Document Model, Natural Language Processing