The increasing importance of Unicode for text files, for example with Java and in some modern operating systems, implies a possible doubling of data storage space and data transmission time, with a corresponding need for data compression. However it is not clear that data compressors designed for 8-bit byte data are well matched to 16-bit Unicode data. This paper investigates the compression of Unicode files, using a variety of established data compressors on a mix of genuine and artificial Unicode files. It is found that while Ziv-Lempel and unbounded context compressors work well, finite-context compressors are less satisfactory on Unicode. Tests with a simple special compressor intended for 16-bit data show that it may be useful to design compressors specifically for Unicode files.
Peter M. Fenwick, Simon Brierley