In this paper a method for extraction of mid-level semantics from sign language videos is proposed, by employing high level domain knowledge. The semantics concern labeling of the depicted objects of the head and the right/left hand as well as the occlusion events, which are essential for interpretation and therefore for subsequent higher level semantic indexing. Initially the low-level skin-segement descriptors are extracted after face detection and color modeling. Then the respective labels are assigned to the segments. The occlusions between hands, head and hands and body and hands, can easily confuse extractors and thus lead to wrong interpretation. Therefore, a Bayesian network is employed to bridge in a probabilistic fashion the gap between the high level knowledge about the valid spatiotemporal configurations of the human body and the extractor. The approach is applied here in sign-language videos, but it can be generalized to any other situation where semantically rich informat...
Dimitrios I. Kosmopoulos, Ilias Maglogiannis