The proliferation of videos in recent years has spurred a surge of interest in developing efficient techniques for automatic video interpretation. The thesis improves the understanding of videos by building structured models that use latent information to detect and recognize instances of actions or abnormalities in videos. The thesis also proposes efficient algorithms for inference in and learning of the proposed latent structured models that are appropriate for learning with weak supervision.
An important class of latent variable models is the multiple instance learning where the training labels are provided only for bags of instances, but not for instances themselves. As inference of latent instance labels is performed jointly with training of a classifier on the same data, multiple-instance learning is very susceptible to overfitting. To increase the robustness of popular methods for multiple instance learning, the thesis introduces a novel concept of superbags (ensemble of bags of bags) that allows for decoupling of classifier training and latent label inference steps.
In the thesis, a novel latent structured representation is proposed to discover instances of action classes in videos and jointly train an action classifier on them. Action class instances typically occupy only a part of the whole video that is not annotated in weakly labeled training videos. Therefore, multiple instance learning is proposed to find these latent action instances in training videos and jointly train the action classifier. The thesis proposes a sequential method to multiple instance learning to increase the robustness of the training.
For the interpretation of crowded scenes, it is important to detect all irregular objects or actions in a video. However, the abnormality detection is hindered by the fact that the training set does not contain any abnormal sample, thus it is necessary to find abnormalities in a test video without actually knowing what they are. To address this problem, the thesis proposes a probabilistic graphical model for video parsing that searches for latent object hypotheses to jointly explain all the foreground pixels, which are, at the same time, well matched to the normal training samples. By inferring all latent normal hypotheses in a video, the model indirectly finds abnormalities as those hypotheses that are not supported by normal samples but still need to be used to explain the foreground. Video parsing is applied sequentially on individual video frames, where hypotheses are jointly inferred by a local search in a graphical model. The thesis then proposes a spatio-temporal extension of the video parsing, where an efficient inference method based on convex optimization is developed to find abnormal/normal spatio-temporal hypotheses in the video.
|Supervisor:||Ommer, Prof. Dr. Björn|
|Date of thesis defense:||22 July 2014|
|Date Deposited:||29 Jul 2014 07:39|
|Faculties / Institutes:||The Faculty of Mathematics and Computer Science > Department of Computer Science
Service facilities > Interdisciplinary Center for Scientific Computing
Service facilities > Heidelberg Collaboratory for Image Processing (HCI)
|Subjects:||004 Data processing Computer science|