The paper considers the problem of audio-visual speech recognition in a simultaneous (target/masker) speaker environment. The paper follows a conventional multistream approach and examines the specific problem of estimating reliable timevarying audio and visual stream weights. The task is challenging because, in the two speaker condition, signal-to-noise ratio (SNR)