Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLSD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments have proven that the FLSD method boosts the performance of the face diarization task (i.e. the task of discovering faces over time given only the visual signal). In addition, we have proven through experimentation that applying the FLSD method for discriminating between faces is also independent of the initial feature space and remains relatively unaffectedasthenumberof faces increases. Finally, a fusion method is proposed that leads to performance improvement in comparison to the best individual modality, which is the audio signal. Keywords Speaker diarization · FLsD · FLD · Audio-visual fusion