Directly to content
  1. Publishing |
  2. Search |
  3. Browse |
  4. Recent items rss |
  5. Open Access |
  6. Jur. Issues |
  7. DeutschClear Cookie - decide language by browser settings

Neural Patent Classification beyond Title and Abstract: Leveraging Patent Text and Metadata

Pujari, Subhash Chandra

[thumbnail of thesis_Subhash_Chandra_Pujari_2024-07-30.pdf]
Preview
PDF, English
Download (2MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

Abstract

Intellectual property violations involve substantial litigation and license costs, because of which patent search is of utmost importance. Over the years, patent corpora have amassed millions of patents, making manual searches impractical. Patent classification techniques help domain experts to search and analyze patents. On submission to an examination office, a patent application is assigned with labels from pre-defined patent taxonomies, e.g., Cooperative Patent Classification (CPC) and International Patent Classification (IPC). CPC/IPC classification helps to route patent applications to the correct department and assists in performing prior art searches. In addition to CPC/IPC classification, we address the classification task associated with the Patent Landscape Study (PLS), a process that allows organizations to search patents, categorize them into custom labels, and analyze them to derive crucial insights. This thesis significantly contributes to the improvement of patent classification systems by addressing the key challenges described below.

Most of the existing CPC/IPC classification datasets provide only limited texts of the included patents and are, therefore, insufficient for our experiments. In response to this issue, we release a CPC classification dataset that includes the full texts of patents. Further, the unavailability of open-source datasets is a major bottleneck for the automation of PLS. To address this challenge, we curate, enrich, and release three open-source datasets from two diverse domains.

Despite CPC/IPC classification being a hierarchical multi-label classification task, most prior neural models have not considered the hierarchical taxonomy when designing model architectures and have often predicted labels only for a single level. We make a major contribution with our memory-efficient model architecture, which shares a single transformerbased language model across multiple classification heads, one for each label in the taxonomy, and leverages hierarchical links in the model architecture. We demonstrate that the proposed technique consistently outperforms baselines, particularly for infrequent labels.

Our analysis shows that the sentences and abstracts of patents are often duplicated, illustrating the relevance of the full texts of patents to perform classification. However, transformer-based language models that take 512 or 4,096 tokens as input are insufficient for patents, which contain 12.5k tokens on average. Motivated by these factors, we make a major contribution with our document representation technique, which combines truncated section text embeddings using vector summation, performing better than baselines. In addition, we propose a sentence ranker and demonstrate that the extractive summarization techniques are effective in selecting informative sentences for neural representation in the context of patent classification.

Unlike CPC/IPC classification, in the case of PLS, the CPC/IPC labels are known during inference. As a major contribution, we enrich the document representation by combining CPC/IPC labels with patent text to predict PLS-oriented categories, often representing concepts different from CPC/IPC labels. To demonstrate the broader applicability of the proposed technique, we apply it to a similar task: classifying research publications into target categories using text and author-provided keywords as input.

Document type: Dissertation
Supervisor: Gertz, Prof. Dr. Michael
Place of Publication: Heidelberg
Date of thesis defense: 30 July 2024
Date Deposited: 06 Aug 2024 06:14
Date: 2024
Faculties / Institutes: The Faculty of Mathematics and Computer Science > Department of Computer Science
The Faculty of Mathematics and Computer Science > Institut für Mathematik
DDC-classification: 004 Data processing Computer science
020 Library and information sciences
600 Technology (Applied sciences)
Controlled Keywords: Natural Language Processing, Patent Classification, Hierarchical Multi-label Classification, Long Text Representation
About | FAQ | Contact | Imprint |
OA-LogoDINI certificate 2013Logo der Open-Archives-Initiative