Beyond Supervised Learning: Exploring Alternative Forms of Supervision for Visual Recognition

Dencker, Tobias

Preview

PDF, English
Download (31MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00030168
URN: urn:nbn:de:bsz:16-heidok-301681
URL: http://www.ub.uni-heidelberg.de/archiv/30168

Abstract

The rise of deep learning has significantly advanced the field of computer vision. Deep learning, especially in combination with supervised learning, has become the backbone of most computer vision algorithms. However, labeled data is often the bottleneck of data-hungry deep learning methods that limits their performance and broader application. While large-scale annotation is expensive and time-consuming, humans know from experience that learning with limited supervision is possible. This thesis aims to study and devise computer vision algorithms that enable deep learning for tasks with limited supervision.

Learning without full supervision requires exploiting alternative forms of supervision. A common approach is to use prior knowledge about the task to constrain the learning problem and compensate for the lack of supervision. This is difficult in the case of deep learning models, where an end-to-end learning paradigm largely prevents prior control over the representation learned by the model. Instead, this work pursues the idea of wrapping the supervised learning of a deep model into an iterative procedure that alternates between generating supervision from alternative sources and refining the model. To this end, models and learning procedures are developed and evaluated in the context of two computer vision applications for visual recognition with different variants of limited supervision.

The first application deals with representation learning for human pose analysis from unsupervised video data. Visual recognition of human pose has many interesting applications but is challenging due to the high variability of pose and appearance as well as the problem of self-occlusion. This thesis employs a self-supervised learning approach by designing two auxiliary tasks that generate their own supervision to learn from spatiotemporal information in videos. To increase the robustness of the learning procedure, the approach exploits the inherent self-similarity of human motion for refining the generated supervision and creates a curriculum for learning that gradually increases in difficulty. It learns a meaningful representation of human pose that shows competitive performance to the state of the art.

The second application addresses the challenge of reading cuneiform script in age-old clay tablets. To support Assyriologists in their analysis, a weakly supervised approach to train a cuneiform sign detector is proposed that can locate and classify cuneiform signs. Rather than requiring thousands of sign annotations, the approach incrementally learns by alternating between generating supervised data from weak supervision and training a cuneiform sign detector. To improve the precision and recall of the sign detector, the supervision generation implements an exploration-exploitation strategy that produces reliable and diverse examples for learning. The effectiveness of the proposed approach is thoroughly evaluated on the first large-scale dataset for cuneiform sign detection which is established as part of this thesis. Finally, this thesis investigates an approach for linguistic refinement to further improve the results of the trained sign detector. A text correction model is learned in a self-supervised fashion that combines the bottom-up information from the sign detections and the top-down information from the language encoded in cuneiform script. The approach demonstrates the first steps towards the automatic transliteration of cuneiform clay tablets from images.

Document type:	Dissertation
Supervisor:	Ommer, Prof. Dr. Björn
Place of Publication:	Heidelberg
Date of thesis defense:	30 June 2021
Date Deposited:	15 Jul 2021 06:16
Date:	2021
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Department of Computer Science
DDC-classification:	004 Data processing Computer science
Controlled Keywords:	Computer Vision, Machine Learning, Object Detection