We present a new differential compression algorithm that combines the hash value techniques and suffix array techniques of previous work. Differential compression refers to encoding a file (a version file) as a set of changes with respect to another file (a reference file). Previous differential compression algorithms can be shown empirically to run in linear-time but they have certain drawbacks, namely they do not find the best matches for every offset of the version file. Our algorithm finds the best matches for every offset of the version file, with respect to a certain granularity (or block size) and above a certain length threshold. It has two variations depending on how we choose the block size. If we keep the block size fixed, we show that the compression performance of our algorithm is similar to that of the greedy algorithm, without the expensive space and time requirements. If we vary the block size linearly with the reference file size, we show that our algorithm can run in...
Ramesh C. Agarwal, Suchitra Amalapurapu, Shaili Ja