In this paper we propose multimodal fusion of super resolved texture (SRT) features and 3D shape features with acoustic features for 3D audio-video person authentication systems with liveness checks. The proposed SRT features allow information related to non-rigid variations on speaking faces, such as expression lines, gestures, and wrinkles, enhancing the performance of the system against impostor and spoof attacks. Experiments with multimodal fusion of acoustic and super-resolved texture and 3D shape features for two different speaking face data corpus, VidTIMIT, and AVOZES, allowed equal error rates (EERs) of less than 0.5 % for imposter and type-1 replay attacks (still photo and pre-recorded audio) and less than 3% for more complex type-2 replay attacks (pre-recorded video or fake CG animated video).