An intrusion detection system (IDS) usually has to analyse Giga-bytes of audit information. In the case of anomaly IDS, the information is used to build a user profile characterising normal behaviour. Whereas for misuse IDSs, it is used to test against known attacks. Probabilistic methods, e.g. hidden Markov models, have proved to be suitable to profile formation but are prohibitively expensive. To bring these methods into practise, this paper aims to reduce the audit information by folding up subsequences that commonly occur within it. Using n-grams language models, we have been able to successfully identify the n-grams that appear most frequently. The main contribution of this paper is a n-gram extraction and identification process that significantly reduces an input log file keeping key information for intrusion detection. We reduced log files by a factor of 3.6 in the worst case and 4.8 in the best case. We also tested reduced data using hidden Markov models (HMMs) for intrus...