Identification of Software Features in Issue Tracking System Data

Merten, Thorsten

[thumbnail of camera-ready-thesis-bw-links.pdf]

Preview

PDF, English
Download (2MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00022655
URN: urn:nbn:de:bsz:16-heidok-226553
URL: http://www.ub.uni-heidelberg.de/archiv/22655

Abstract

The knowledge of Software Features (SFs) is vital for software developers and requirements specialists during all software engineering phases: to understand and derive software requirements, to plan and prioritize implementation tasks, to update documentation, or to test whether the final product correctly implements the requested SF. In most software projects, SFs are managed in conjunction with other information such as bug reports, programming tasks, or refactoring tasks with the aid of Issue Tracking Systems (ITSs). Hence ITSs contains a variety of information that is only partly related to SFs. In practice, however, the usage of ITSs to store SFs comes with two major problems: (1) ITSs are neither designed nor used as documentation systems. Therefore, the data inside an ITS is often uncategorized and SF descriptions are concealed in rather lengthy. (2) Although an SF is often requested in a single sentence, related information can be scattered among many issues. E.g. implementation tasks related to an SF are often reported in additional issues. Hence, the detection of SFs in ITSs is complicated: a manual search for the SFs implies reading, understanding and exploiting the Natural Language (NL) in many issues in detail. This is cumbersome and labor intensive, especially if related information is spread over more than one issue. This thesis investigates whether SF detection can be supported automatically. First the problem is analyzed: (i) An empirical study shows that requests for important SFs reside in ITSs, making ITSs a good tar- get for SF detection. (ii) A second study identifies characteristics of the information and related NL in issues. These characteristics repre- sent opportunities as well as challenges for the automatic detection of SFs. Based on these problem studies, the Issue Tracking Software Feature Detection Method (ITSoFD), is proposed. The method has two main components and includes an approach to preprocess issues. Both components address one of the problems associated with storing SFs in ITSs. ITSoFD is validated in three solution studies: (I) An empirical study researches how NL that describes SFs can be detected with techniques from Natural Language Processing (NLP) and Machine Learning. Issues are parsed and different characteristics of the issue and its NL are extracted. These characteristics are used to clas- sify the issue’s content and identify SF description candidates, thereby approaching problem (1). (II) An empirical study researches how issues that carry information potentially related to an SF can be detected with techniques from NLP and Information Retrieval. Characteristics of the issue’s NL are utilized to create a traceability network vii of related issues, thereby approaching problem (2). (III) An empirical study researches how NL data in issues can be preprocessed using heuristics and hierarchical clustering. Code, stack traces, and other technical information is separated from NL. Heuristics are used to identify candidates for technical information and clustering improves the heuristic’s results. The technique can be applied to support components, I. and II.

Translation of abstract (German)

Software Features (SFs) sind zentrale Artefakte für die Softwareentwicklung und das Anforderungsmanagement. SFs werden beispielsweise genutzt, um Anforderungen zu verstehen, abzuleiten oder zu dokumentieren. Oft stützt sich auch die Planung der Entwicklungsarbeiten und die Dokumentation auf SFs. In der Praxis werden SFs meist in Verbindung mit anderen Informationen, wie Fehlerbeschreibungen, Entwicklungs- und Refactoring-Aufgaben in einem Issue Tracking System (ITS) verwaltet. Demnach beinhalten ITSe meist eine Vielzahl von Informationen, die jedoch nur teilweise mit SFs in Zusammenhang stehen. Die Verwaltung von SFs in ITSen bringt in der Praxis jedoch zwei große Probleme mit sich: (1) ITSe wurden zur Unterstützung der Softwareentwicklung, nicht aber für die Dokumentation erstellt. Daher sind die Daten in ITSen oft falsch kategorisiert und SFs verbergen sich in ausschweifenden Beschreibungen oder Kommentaren. (2) Auch wenn SFs meist mit nur einem Satz beschrieben werden, so befinden sich verwandte Informationen überall im ITS. Beispielsweise werden zugehörige Implementierungsaufgaben oft in einem neuen Issue festgehalten. Somit ist die Erkennung von SFs eine schwierige Aufgabe: Um SFs manuell zu finden, müssen mehrere Issues inklusive der Kommentare im Detail gelesen und bewertet werden. Dies ist sehr aufwändig, insbesondere wenn darüberhinaus noch verwandte Informationen aus mehreren Issues zusammengetragen werden müssen. Die vorliegende Arbeit untersucht, inwiefern SFs automatisch erkannt werden können und analysiert zunächst das Problem: (i) Eine empirische Studie zeigt, dass wichtige SFs in ITSen gefunden werden können und ITSen dadurch ein gutes Ziel für die automatische Erkennung darstellen. (ii) Eine weitere Studie identifiziert Charakteristiken der Informationen und natürlichsprachlichen Formulierungen in Issues. Diese Charakteristiken wiederum stellen Herausforderungen, aber auch Chancen, für eine automatische Detektion von SFs dar. Basierend auf der Problem-Analyse wird die Issue Tracking Software Feature Detection Method (ITSoFD), eine Methode zur Detektion von SFs in ITSen, vorgestellt. ITSoFD hat zwei Hauptkomponenten und adressiert die beiden Probleme, die sich durch die Verwaltung von SFs in ITSen ergeben. ITSoFD wird in drei Studien validiert: (I) In einer erstem empirischen Studie wird untersucht, inwiefern SFs mit Techniken aus dem Natural Language Processing (NLP) und dem Machine Learning erkannt werden können. Hierbei werden verschiedene Charakteristiken der Issues und der natürlichen Sprache extrahiert und zur Klassifizierung von Issues genutzt. Diese Studie untersucht Problem (1). (II) In einer zweiten empirischen Studie wird untersucht, inwiefern in Beziehung stehende Informationen in verschiedenen Issues durch Techniken des NLP und Information Retrieval zusammengeführt werden können. Es werden verschiedene Charakteristiken der natürlichen Sprache genutzt, um verwandte Issues miteinander zu verlinken. Diese Studie untersucht Problem (2). (III) In einer dritten empirischen Studie wird untersucht, inwiefern technische Informationen wie Code und Stack Traces in Issues von natürlicher Sprache getrennt werden können. Heuristiken werden genutzt, um Kandidaten für technische Informationen zu bestimmen und diese Kandidaten werden durch Clustering zusammengefasst um Falscherkennungen durch die Heuristiken auszugleichen. Diese Technik wird als Vorverarbeitung für obige Komponenten eingesetzt.

Document type:	Dissertation
Supervisor:	Paech, Prof. Dr. Barbara
Date of thesis defense:	10 February 2017
Date Deposited:	14 Feb 2017 13:01
Date:	2017
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Department of Computer Science
DDC-classification:	004 Data processing Computer science
Further URL:	Research data