title: Neural Patent Classification beyond Title and Abstract: Leveraging Patent Text and Metadata creator: Pujari, Subhash Chandra subject: ddc-004 subject: 004 Data processing Computer science subject: ddc-020 subject: 020 Library and information sciences subject: ddc-600 subject: 600 Technology (Applied sciences) description: Intellectual property violations involve substantial litigation and license costs, because of which patent search is of utmost importance. Over the years, patent corpora have amassed millions of patents, making manual searches impractical. Patent classification techniques help domain experts to search and analyze patents. On submission to an examination office, a patent application is assigned with labels from pre-defined patent taxonomies, e.g., Cooperative Patent Classification (CPC) and International Patent Classification (IPC). CPC/IPC classification helps to route patent applications to the correct department and assists in performing prior art searches. In addition to CPC/IPC classification, we address the classification task associated with the Patent Landscape Study (PLS), a process that allows organizations to search patents, categorize them into custom labels, and analyze them to derive crucial insights. This thesis significantly contributes to the improvement of patent classification systems by addressing the key challenges described below. Most of the existing CPC/IPC classification datasets provide only limited texts of the included patents and are, therefore, insufficient for our experiments. In response to this issue, we release a CPC classification dataset that includes the full texts of patents. Further, the unavailability of open-source datasets is a major bottleneck for the automation of PLS. To address this challenge, we curate, enrich, and release three open-source datasets from two diverse domains. Despite CPC/IPC classification being a hierarchical multi-label classification task, most prior neural models have not considered the hierarchical taxonomy when designing model architectures and have often predicted labels only for a single level. We make a major contribution with our memory-efficient model architecture, which shares a single transformerbased language model across multiple classification heads, one for each label in the taxonomy, and leverages hierarchical links in the model architecture. We demonstrate that the proposed technique consistently outperforms baselines, particularly for infrequent labels. Our analysis shows that the sentences and abstracts of patents are often duplicated, illustrating the relevance of the full texts of patents to perform classification. However, transformer-based language models that take 512 or 4,096 tokens as input are insufficient for patents, which contain 12.5k tokens on average. Motivated by these factors, we make a major contribution with our document representation technique, which combines truncated section text embeddings using vector summation, performing better than baselines. In addition, we propose a sentence ranker and demonstrate that the extractive summarization techniques are effective in selecting informative sentences for neural representation in the context of patent classification. Unlike CPC/IPC classification, in the case of PLS, the CPC/IPC labels are known during inference. As a major contribution, we enrich the document representation by combining CPC/IPC labels with patent text to predict PLS-oriented categories, often representing concepts different from CPC/IPC labels. To demonstrate the broader applicability of the proposed technique, we apply it to a similar task: classifying research publications into target categories using text and author-provided keywords as input. date: 2024 type: Dissertation type: info:eu-repo/semantics/doctoralThesis type: NonPeerReviewed format: application/pdf identifier: https://archiv.ub.uni-heidelberg.de/volltextserverhttps://archiv.ub.uni-heidelberg.de/volltextserver/35223/1/thesis_Subhash_Chandra_Pujari_2024-07-30.pdf identifier: DOI:10.11588/heidok.00035223 identifier: urn:nbn:de:bsz:16-heidok-352238 identifier: Pujari, Subhash Chandra (2024) Neural Patent Classification beyond Title and Abstract: Leveraging Patent Text and Metadata. [Dissertation] relation: https://archiv.ub.uni-heidelberg.de/volltextserver/35223/ rights: info:eu-repo/semantics/openAccess rights: http://archiv.ub.uni-heidelberg.de/volltextserver/help/license_urhg.html language: eng