The use of psychoacoustical masking models for audio coding applications has been wide spread over the past decades. In such applications, it is typically assumed that the original input signal serves as a masker for the distortions that are introduced by the lossy coding method that is used. Up to now, these masking models are mostly based on spectral masking. In this paper, we propose a new perceptual model for audio and speech processing algorithms based on spectro-temporal masking. A sophisticated perceptual model is simplified, such that the eventual distortion measure can be written as a frequency-weighted l2-norm. This yields the same computational complexity as conventional spectral-based methods, but with the preservation of the temporal fine structure of the clean signal. It is shown that the new model can successfully avoid pre-echoes and can correctly predict masking curves for various maskers.