In this paper, we present a multisensor multiband energy tracking scheme for robust feature extraction in noisy environments. We introduce a multisensor feature extraction algorithm which combines both the spatial and frequency information incorporated in the speech signals captured by a microphone array. This is based on the estimation of cross-energies over multiple sensors and minimization of an error term due to noise. The relevant noise-analysis is given. Automatic Speech Recognition (ASR) experiments at various SNR levels demonstrate that the newly proposed frontend performs better than alternative schemes, especially in noisy conditions.