Recently there is an increasing interest in video based interface techniques, allowing more natural interaction between users and systems than common interface devices do. Here, we present a neural architecture for user localisation, embedded within a complex system for visually-based human-machine-interaction (HMI). User's localisation is an absolute prerequisite to videobased HMI. Due to the main objective, the greatest possible robustness of the localisation as well as the whole visual interface under highly varying environmental conditions, we propose a multiple cue approach. This approach combines the features facial structure, head-shoulder-contour, skin color, and motion, with a multiscale representation. The selection of that image region most likely containing a possible user is then realised via a WTA-process within the multiscale representation. Preliminary results show the reliability of the multiple cue approach.