A real time speaker localization and detection system for videoconferencing environments is presented. In this system, a recently proposed modified Steered Response Power - Phase Transform (SRP-PHAT) algorithm has been used as the core processing scheme. The new SRP-PHAT functional has been shown to provide robust localization performance in indoor environments without the need for having a very fine spatial grid, thus reducing the computational cost required in a practical implementation. Moreover, it has been demonstrated that the statistical distribution of location estimates when a speaker is active can be successfully used to discriminate between speech and non-speech frames by using a criterion of peakedness. As a result, talking participants can be detected and located with significant accuracy following a common processing framework.
Amparo Marti, Maximo Cobos, José J. L&oacut