Preview |
PDF, English
- main document
Download (64MB) | Lizenz: Rights reserved - Free Access |
Abstract
Computer vision has played a crucial role in the recent increase in interest in artificial intelligence. Neural networks, in particular, have led to breakthroughs in many application areas, ranging from the recognition of lung cancer in CRT images to novel ways of creating virtual immersive environments and photorealistic avatars. Most computer vision research focuses on demonstrating new technological solutions and showcasing their capabilities. This dissertation attempts a change of perspective from a technology-centred to a user-centred focus to ensure the successful deployment of innovations in real-world scenarios. It explores the intersection of human-computer interaction (HCI) and computer vision, with particular emphasis on the domains of interpretable vision and gaze-aware video conferencing. Comparative user studies with robust baseline conditions are central to this work and form the cornerstone of the methodological strategy.
The first part of this dissertation delves into interpretable vision, evaluating the efficacy of different explanation methods in enhancing users' understanding of image classifiers. Through rigorous experimental design and the development of a novel synthetic dataset, two studies provide nuanced insights into the effectiveness of these explanation methods. Results show that saliency maps can draw users' attention to specific features, while counterfactuals help discover model biases. Notably, results also show that simple example-based explanations can be overall just as effective as more sophisticated methods while being easier to implement. We argue that these explanations should serve as a benchmark for evaluating any future explanation methods. These results highlight the importance of measuring how well users can reason about a model rather than solely relying on technical evaluations or proxy tasks when assessing the explanation techniques.
The second part of this dissertation shifts the focus to image synthesis. It addresses the quality of the video-conferencing user experience by exploring a conceptual system capable of conveying gaze and attention. Gazing Heads is a round-table virtual meeting concept that enables direct eye contact and signals gaze via controlled head rotation. We built a four-party camera-based simulation to evaluate Gazing Heads against a conventional “Tiled View” video-conferencing system. In contrast to prior concepts, Gazing Heads increases social presence, mutual eye contact, and user engagement. We attribute these novel results to the amplifying effect of head rotations for conveying gaze. In its current design, Gazing Heads unequivocally enhances the experience of users in highly interactive small group meetings. Our work also highlights the remaining challenges in implementing Gazing Heads on commodity hardware and in achieving seamless integration into daily video-conferencing.
Overall, this dissertation contributes to the fields of HCI and computer vision by providing empirical insights into the benefits and limitations of current computer vision applications from a user-centred perspective. Published across top-tier machine learning and HCI venues, this research emphasises the need for more meticulously designed user studies in computer vision. It provides foundational artefacts, such as benchmark datasets, study designs, and system concepts, which can serve as a starting point for future research.
Document type: | Dissertation |
---|---|
Supervisor: | Rother, Prof. Dr. Carsten |
Place of Publication: | Heidelberg |
Date of thesis defense: | 7 May 2025 |
Date Deposited: | 13 May 2025 05:47 |
Date: | 2025 |
Faculties / Institutes: | The Faculty of Mathematics and Computer Science > Department of Computer Science |
DDC-classification: | 004 Data processing Computer science |
Controlled Keywords: | Künstliche Intelligenz, Maschinelles Sehen, Deep Learning, Data Science |
Uncontrolled Keywords: | Human-Computer Interaction, User Studies |