Multimodal surveillance systems using visible/IR cameras and other sensors are widely deployed today for security purpose, particularly when subjects are at a large distance. However, audio information as an important data source has not been well explored. One of the reasons is because audio detection using microphones needs installation close to the subjects in monitoring. In this paper, we investigate a novel “optical” sensor, called Laser Doppler Vibrometer (LDV), for capturing voice signals in a very large range to realize a truly remote and multimodal surveillance system. Speech enhancement approaches are studied based on the characteristics of LDV Audio. Experimental results show that remote voice detection via an LDV is promising when choosing appropriate targets close to human subjects in the environment.