title: Neural Patent Classification beyond Title and Abstract: Leveraging Patent Text and Metadata
creator: Pujari, Subhash Chandra
subject: ddc-004
subject: 004 Data processing Computer science
subject: ddc-020
subject: 020 Library and information sciences
subject: ddc-600
subject: 600 Technology (Applied sciences)
description: Intellectual property violations involve substantial litigation and license costs, because of  which patent search is of utmost importance. Over the years, patent corpora have amassed  millions of patents, making manual searches impractical. Patent classification techniques  help domain experts to search and analyze patents. On submission to an examination office,  a patent application is assigned with labels from pre-defined patent taxonomies, e.g., Cooperative  Patent Classification (CPC) and International Patent Classification (IPC). CPC/IPC  classification helps to route patent applications to the correct department and assists in  performing prior art searches. In addition to CPC/IPC classification, we address the classification  task associated with the Patent Landscape Study (PLS), a process that allows  organizations to search patents, categorize them into custom labels, and analyze them to  derive crucial insights. This thesis significantly contributes to the improvement of patent  classification systems by addressing the key challenges described below.    Most of the existing CPC/IPC classification datasets provide only limited texts of the  included patents and are, therefore, insufficient for our experiments. In response to this  issue, we release a CPC classification dataset that includes the full texts of patents. Further,  the unavailability of open-source datasets is a major bottleneck for the automation of PLS.  To address this challenge, we curate, enrich, and release three open-source datasets from  two diverse domains.    Despite CPC/IPC classification being a hierarchical multi-label classification task, most  prior neural models have not considered the hierarchical taxonomy when designing model  architectures and have often predicted labels only for a single level. We make a major contribution  with our memory-efficient model architecture, which shares a single transformerbased  language model across multiple classification heads, one for each label in the taxonomy,  and leverages hierarchical links in the model architecture. We demonstrate that the  proposed technique consistently outperforms baselines, particularly for infrequent labels.    Our analysis shows that the sentences and abstracts of patents are often duplicated,  illustrating the relevance of the full texts of patents to perform classification. However,  transformer-based language models that take 512 or 4,096 tokens as input are insufficient  for patents, which contain 12.5k tokens on average. Motivated by these factors, we make a  major contribution with our document representation technique, which combines truncated  section text embeddings using vector summation, performing better than baselines. In  addition, we propose a sentence ranker and demonstrate that the extractive summarization  techniques are effective in selecting informative sentences for neural representation in the  context of patent classification.    Unlike CPC/IPC classification, in the case of PLS, the CPC/IPC labels are known during  inference. As a major contribution, we enrich the document representation by combining  CPC/IPC labels with patent text to predict PLS-oriented categories, often representing  concepts different from CPC/IPC labels. To demonstrate the broader applicability of the  proposed technique, we apply it to a similar task: classifying research publications into  target categories using text and author-provided keywords as input.
date: 2024
type: Dissertation
type: info:eu-repo/semantics/doctoralThesis
type: NonPeerReviewed
format: application/pdf
identifier: https://archiv.ub.uni-heidelberg.de/volltextserverhttps://archiv.ub.uni-heidelberg.de/volltextserver/35223/1/thesis_Subhash_Chandra_Pujari_2024-07-30.pdf
identifier: DOI:10.11588/heidok.00035223
identifier: urn:nbn:de:bsz:16-heidok-352238
identifier:   Pujari, Subhash Chandra  (2024) Neural Patent Classification beyond Title and Abstract: Leveraging Patent Text and Metadata.  [Dissertation]     
relation: https://archiv.ub.uni-heidelberg.de/volltextserver/35223/
rights: info:eu-repo/semantics/openAccess
rights: http://archiv.ub.uni-heidelberg.de/volltextserver/help/license_urhg.html
language: eng