Text compression algorithms are normally defined in terms of a source alphabet of 8-bit ASCII codes. We consider choosing to be an alphabet whose symbols are the words of English or, in general, alternate maximal strings of alphanumeric characters and non-alphanumeric characters. The compression algorithm would be able to take advantage of longer-range correlations between words and thus achieve better compression. The large size of leads to some implementation problems, but these are overcome to construct word-based LZW, word-based Adaptive Huffman, and wordbased Context Modelling compression algorithms.
R. Nigel Horspool, Gordon V. Cormack