
I've finished a demonstration version of XDELTA and XPATCH for
experiments in its value as a replacement for GNU diff in an RCS-like
file format.  Its algorithm does not use the least-common-subsequence
approach found in XDELTA.  Instead, it uses the block-copy approach.
Unlike Tichy, who has published a paper on the (quite simple)
algorithm using suffix-trees to match strings in the input and output,
mine computes checksums from one of the files and inserts the in a
hash table.  It then computes checksums in the other file and looks
them up in the table.  The trick (found in rsync and gzip) is to
efficiently compute checksums so that given the checksum of TEXT[N..M]
the checksum of TEXT[N+1..M+1] is easy to compute.  Using this
"rolling" checksum, blocks of test are matched up.

I have not myself tested the suffix-tree approach.  Jean-loup Gailly,
author of zlib, gzip, and _The Data Compression Book_, said he had run
tests on suffix-tree implementations of this type of thing and that
they only win when the input is highly redundant.

A first attempt might be to pick some length L and using the checksums
I've described, match up blocks of text and output them.  The problem
with this is that if L is too large it will miss diffs, and if L is
too small, its possible to miss the fact that a larger match was
available that could replace several smaller copies with a single
copy.  Further, it L is too small, hash collisions or identical
strings in the input will cause a slow-down.  To avoid this problem, I
use multiple values of L.  Starting with an initial value of L, all
matches of length L are found.  Each region inbetween matches of
length L are then recursively searched for matches of length L/2.

Another neat property of the checksum used is that given the checksum
for TEXT[N..N+D], TEXT[N+D+1..N+2D], ... TEXT[N+(L-1)D+1,N+LD], it is
possible to compute the checksum for TEXT[N..N+LD] in O(L) operations.

The obvious problem with this algorithm is that when no matches are
found an the time complexity becomes exponential.  To prevent this,
I've capped the number of match failures to an experimentally chosen
3.  Suppose C matches are found for an input of length L with a
minimum match length of M, where M*C < L.  It took O(L) time to find
the matches.  Now further suppose that the matches are distributed
evenly so that the C-1 or C+1 gaps to be recursively examined are
approximately even in length.  If T(X) is the time to examine length
X, then the recursion relation is (ignoring the +- 1, with integer
breakage):

	T(X) = C T(L/C - M) + O(L)

This is clearly less than

	T'(X) = C T'(L/C) + O(L)

which according to the master theorem runs in time O(L log_C(L)).  As
stated before, when C=0 problems arise.

There is preprocessing time associated with computing the checksums.
Setting a range for match lengths to be between A and B, where A=2^a
and B=2^b, where the FROM input has length F, the preprocessing takes
time and space O(F (lg (F/B) - a)).

I've compared it against GNU diff, which uses a few heuristics on an
algorithm by E. Myers.  If N is the sum of the input lengths and D is
the length of the LCS of the inputs, his algorithm is O(N) space and
worse case O(ND) time, with an expected O(N+D^2) time for "similar"
inputs.  WIth the heuristic Paul Eggert (GNU diff author) applies, he
claims the algorithm is O(N**1.5 log N) and expected to be linear when
files are "similar" at the cost of suboptimal output for large files.
The problem with comparing times between the two programs is that GNU
diff uses lines as the atom being compared, whereas XDELTA uses bytes.
As a result, though the times may be similar for similar atom counts,
GNU diff performs much faster for the same input because it breaks the
file into lines.

The one improvement I can think of is to limit the length processed.
A file larger than some limit (perhaps a megabyte), are processed as a
sequence of chunks.  This produces bigger differences when similar
regions cross the chunk boundaries.

Here's some data, comment would be appreciated.  I'm planning on
writing a very very stripped down, simple replacement for the RCS file
format for use in a client-server version of PRCS.  I've placed it up
for ftp at

	ftp://ftp.xcf.berkeley.edu/pub/jmacd/xdelta-0.2.0.tar.gz

				TEST 1

Test #1 is run on the xdelta source tree containing 79 text files
ranging from 0 to 76k bytes, 0 to 2535 lines (note: xdelta is only 1100
lines of C source, plus glib, Makefiles, autoconf stuff, etc).  The
output of wc on the actual files is in appendix C.  The total size of
all files is 570204 bytes.

The diff programs were run on each of 79^2=6241 pairs of files with
script in appendix A.  The result of just copying each file 78 times
amounts to 44475912 bytes.  The timing results are not particularly
accurate, as I was using the machine for other purposes while running
the tests.  They should give a rough idea of the differences.

xdelta:

took (234.82 real, 137.06 user, 69.36 sys)
generated 43523233 bytes

GNU diff:

took (177.17 real, 77.68 user, 78.47 sys)
generated 44188372 bytes

As you can see, for the typical comparisons where files differ
greatly, neither performs very well.  xdelta took 32% longer and
generated 1.5% less bytes worth of differences.  Not very good as far
as time-space tradeoffs go, but a 32% cost in time is acceptable, when
you consider the tradeoffs found in other tests.

				TEST 2

The same test is now run on /sbin on my FreeBSD system:

68 files, only 59 of which are unique due to hard links:
size: 2k to 272k bytes
lines: 38 to 893
total size: 6825006 bytes
copy size: 457275402 bytes

The only difference is that I did not save the diffs, I counted them
using wc and an awk script, see appendix B.  The output of wc on the
actual files is in appendix D.

xdelta:

took (2297.96 real, 2052.10 user, 129.53 sys)
generated 236645893 bytes

GNU diff:

took (489.95 real, 298.72 user, 150.75 sys)
generated 420459709 bytes

This time GNU diff produces 78% more bytes worth of differences and
took close to one-fifth as long as xdelta.

				TEST 3

79 source files selected from the PRCS source tree (appendix E), all
of which were present in the versions 1.0.0, 1.0.9, and 1.1.0b5.
Between 1.0.0 and 1.0.9 only minor modifications and bug-fixes were
made.  Between 1.0.9 and 1.1.0b5 major modifications were made to some
files.  The differences between files from 1.0.0 and 1.0.9:

xdelta:
took (3.66 real, 2.17 user, 1.41 sys)
generated: 94976 bytes

GNU diff:
took (2.81 real, 1.24 user, 1.51 sys)
generated 125389 bytes

The differences between files from 1.0.9 and 1.1.0b5:

xdelta:
took (3.69 real, 2.23 user, 1.36 sys)
generated 170385 bytes

GNU diff:
took (3.14 real, 1.53 user, 1.53 sys)
generated 246348 bytes

				TEST 4

Each version of PRCS was compiled.  The same CFLAGS and CXXFLAGS were
used for each.  The 37 object files were tested as in TEST 3 (appendix
F for file names).  Debugging flags were on.  The
differences between 1.0.0 and 1.0.9:

xdelta:
took (25.45 real, 23.98 user, 0.97 sys)
generated 2234377 bytes

GNU diff:
took (3.72 real, 2.42 user, 1.20 sys)
generated 4207450 bytes

The differences between 1.0.9 and 1.1.0b5:

xdelta:
took (33.20 real, 31.50 user, 1.04 sys)
generated 2751271 bytes

GNU diff:
took (4.07 real, 2.61 user, 1.37 sys)
generated 4698744 bytes

The prcs binary itself, which is around 5M (I seperated this to point
out the logarithmic behaviour), from 1.0.0 to 1.0.9:

xdelta:
took (71.22 real, 68.05 user, 0.30 sys)
generated 2222125 bytes

GNU diff:
took (2.52 real, 1.21 user, 0.36 sys)
generated 4297233 bytes

From 1.0.9 to 1.01.0b5:

xdelta:
took (71.28 real, 67.00 user, 0.41 sys)
generated 3695807 bytes

GNU diff:
took (2.87 real, 1.40 user, 0.37 sys)
generated 4831158 bytes

Two FreeBSD kernels, one from 2.2-BETA, the other from a 3.0-SNAP.
Differences from kernel.ORIG to kernel:

kernel:      1084325 bytes
kernel.ORIG: 1020999 bytes

xdelta:
took (11.09 real, 10.69 user, 0.10 sys)
generated 386068 bytes

GNU diff:
took (0.75 real, 0.38 user, 0.09 sys)
generated 1042491 bytes


			      APPENDICES

A: script used for test 1
#!/bin/sh

for i in `find $1 -type f`; do
  for j in `find $1 -type f`; do
    echo diffing $i $j
#    diff -a --rcs $i $j >> diff.diffs
    xdelta $i $j >> xdelta.diffs
  done
done

B: scripts used for test 2
#!/bin/sh

for i in `find $1 -type f`; do
  for j in `find $1 -type f`; do
#    diff -a --rcs $i $j | wc
    xdelta $i $j | wc
  done
done

BEGIN { C = 0; }
{
   C += $3
}
END { print C " bytes"; }

C: test #1 files:
       0       0       0 ../test//AUTHORS
       0       0       0 ../test//COPYING
       0       0       0 ../test//ChangeLog
       0       0       0 ../test//INSTALL
       0       0       0 ../test//NEWS
       0       0       0 ../test//README
       0       0       0 ../test//glib/COPYING
       0       0       0 ../test//glib/INSTALL
       0       0       0 ../test//glib/NEWS
       0       0       0 ../test//glib/README
       1       0       1 ../test//.deps/.P
       1       1      10 ../test//glib/stamp-h
       1       1      10 ../test//glib/stamp-h.in
       1       3      37 ../test//glib/AUTHORS
       3      12     169 ../test//.deps/myers.P
       4      16     241 ../test//.deps/lcs.P
       5       5      80 ../test//file1
       5       5      80 ../test//file2
       5      19     284 ../test//.deps/cm.P
       8      28     141 ../test//test
       9      32     523 ../test//.deps/xdelta.P
       9      32     526 ../test//.deps/xpatch.P
      10      46     383 ../test//glib/ChangeLog
      10      55     398 ../test//.xdelta.prcs_aux
      16      29     227 ../test//Makefile.am
      16      48     327 ../test//glib/libglib.la
      22     154    1100 ../test//config.log
      23      70     626 ../test//xdelta.prj
      31      72     524 ../test//glib/Makefile.am
      35      62     614 ../test//configure.in
      36      91     729 ../test//glib/mkinstalldirs
      36      91     729 ../test//mkinstalldirs
      39     163    1932 ../test//config.cache
      39     163    1932 ../test//glib/config.cache
      43     254    1976 ../test//glib/config.log
      44     145    1293 ../test//aclocal.m4
      60     317    2072 ../test//glib/acconfig.h
      61     201    1272 ../test//glib/gprimes.c
      65     232    1460 ../test//glib/gconfig.h.in
      66     275    1625 ../test//glib/gconfig.h
      85     249    2064 ../test//glib/configure.in
      93     235    2266 ../test//xdelta.h
      97     371    2747 ../test//libtool
     101     392    2878 ../test//glib/libtool
     119     331    2497 ../test//glib/gtimer.c
     125     245    2210 ../test//glib/garray.c
     137     332    2352 ../test//xpatch.c
     165     624    4809 ../test//glib/aclocal.m4
     169     544    5251 ../test//config.status
     211     572    5218 ../test//glib/gcache.c
     236     705    5154 ../test//glib/gerror.c
     238     740    4772 ../test//glib/install-sh
     238     740    4773 ../test//install-sh
     245     633    5355 ../test//xdelta.c
     277     882    6316 ../test//glib/testglib.c
     295     927   10551 ../test//glib/config.status
     324     677    5458 ../test//glib/gslist.c
     349     742    6262 ../test//glib/glist.c
     389    1107   10711 ../test//glib/Makefile.in
     389    1112   10780 ../test//glib/Makefile
     391    1183   11007 ../test//Makefile.in
     391    1189   11080 ../test//Makefile
     428     986    9103 ../test//glib/ghash.c
     487    1127    9067 ../test//glib/gstring.c
     497    1689   13928 ../test//config.guess
     600    2080   17282 ../test//glib/config.guess
     607    1829   16566 ../test//cm.c
     642    2033   18929 ../test//glib/glib.h
     718    1672   15503 ../test//glib/gtree.c
     755    2169   16743 ../test//glib/gutils.c
     805    2338   20126 ../test//glib/gmem.c
     833    2255   17002 ../test//config.sub
     867    2387   17995 ../test//glib/config.sub
     888    2857   22948 ../test//ltconfig
     917    2933   23790 ../test//glib/ltconfig
    1472    4883   37322 ../test//ltmain.sh
    1549    5054   39280 ../test//glib/ltmain.sh
    1681    6781   52863 ../test//configure
    2535   10114   75925 ../test//glib/configure
   22049   70341  570204 total

D: test #2 files:
      38     306    1907 /sbin/nologin
     101    1041   40960 /sbin/comcontrol
     105    1066   40960 /sbin/mknod
     105    1149   40960 /sbin/nextboot
     109    1157   45056 /sbin/md5
     110    1089   40960 /sbin/modunload
     110    1102   40960 /sbin/clri
     112    1122   40960 /sbin/dumpon
     112    1141   45056 /sbin/badsect
     122    1295   49152 /sbin/modload
     126    1224   53248 /sbin/mount_devfs
     126    1224   53248 /sbin/mount_fdesc
     126    1224   53248 /sbin/mount_kernfs
     126    1224   53248 /sbin/mount_procfs
     126    1224   53248 /sbin/mount_std
     127    1172   53248 /sbin/mount_cd9660
     128    1170   49152 /sbin/swapon
     128    1178   53248 /sbin/mount_ext2fs
     129    1182   53248 /sbin/mount_lfs
     131    1220   53248 /sbin/mount_null
     132    1234   53248 /sbin/mount_union
     135    1360   57344 /sbin/adjkerntz
     136    1239   49152 /sbin/ldconfig
     143     572    3259 /sbin/scsiformat
     143    1459   57344 /sbin/ft
     146    1657   69632 /sbin/savecore
     149    1358   49152 /sbin/tunefs
     149    1439   61440 /sbin/nfsiod
     158    1421   61440 /sbin/mount_umap
     163    1542   61440 /sbin/dumplfs
     168    1759   57344 /sbin/fdisk
     172    1744   65536 /sbin/slattach
     179    1821   73728 /sbin/nfsd
     181    1613   61440 /sbin/dumpfs
     193    1818   73728 /sbin/startslip
     194    1718   73728 /sbin/mount
     224    2365   94208 /sbin/dset
     238    2552   98304 /sbin/ccdconfig
     255    2100   73728 /sbin/scsi
     261    2780  102400 /sbin/newlfs
     268    3019  110592 /sbin/dmesg
     320    3539  122880 /sbin/mount_mfs
     320    3539  122880 /sbin/newfs
     342    3250  114688 /sbin/disklabel
     364    3591  143360 /sbin/mount_msdos
     379    3347  126976 /sbin/umount
     384    3788  143360 /sbin/quotacheck
     388    3644  143360 /sbin/shutdown
     398    3522  126976 /sbin/mount_nfs
     405    3577  126976 /sbin/rtquery
     440    3802  139264 /sbin/ifconfig
     449    3782  131072 /sbin/ipfw
     462    3811  131072 /sbin/ping
     474    3933  143360 /sbin/route
     520    4670  172032 /sbin/fastboot
     520    4670  172032 /sbin/fasthalt
     520    4670  172032 /sbin/halt
     520    4670  172032 /sbin/reboot
     549    5136  180224 /sbin/fsck
     555    5726  184320 /sbin/routed
     607    5406  208896 /sbin/mount_portal
     614    5459  204800 /sbin/mountd
     642    5895  200704 /sbin/init
     712    5815  204800 /sbin/dump
     712    5815  204800 /sbin/rdump
     799    7838  266240 /sbin/fsdb
     893    6687  221184 /sbin/restore
     893    6687  221184 /sbin/rrestore
   20265  185349 6825006 total

E: source files used for test 3
src/fnmatch.c
src/memcmp.c
src/vclex.c
src/utils.c
src/getopt.c
src/getopt1.c
src/md5c.c
src/maketime.c
src/partime.c
src/docs.c
src/include/quick.h
src/include/rebuild.h
src/include/repository.h
src/include/setkeys.h
src/include/sexp.h
src/include/syscmd.h
src/include/system.h
src/include/typedefs.h
src/include/utils.h
src/include/vc.h
src/include/checkin.h
src/include/checkout.h
src/include/convert.h
src/include/diff.h
src/include/fileent.h
src/include/lock.h
src/include/memseg.h
src/include/misc.h
src/include/populate.h
src/include/prcs.h
src/include/prcsdir.h
src/include/prcserror.h
src/include/projdesc.h
src/include/dstring.h
src/include/dynarray.h
src/include/getopt.h
src/include/fnmatch.h
src/include/global.h
src/include/md5.h
src/include/maketime.h
src/include/partime.h
src/include/hash.h
src/include/docs.h
src/prcs.cc
src/sexp.cc
src/projdesc.cc
src/fileent.cc
src/checkin.cc
src/checkout.cc
src/repository.cc
src/populate.cc
src/syscmd.cc
src/vc.cc
src/diff.cc
src/info.cc
src/misc.cc
src/package.cc
src/merge.cc
src/lock.cc
src/rebuild.cc
src/prcserror.cc
src/convert.cc
src/memseg.cc
src/prcsver.cc
src/setkeys.cc
src/quick.cc
src/rekey.cc
src/dstring.cc
src/dynarray.cc
src/hash.cc
src/execute.cc

F: source files used for test 4

src/prcs.o
src/sexp.o
src/projdesc.o
src/fileent.o
src/checkin.o
src/checkout.o
src/repository.o
src/populate.o
src/syscmd.o
src/vc.o
src/diff.o
src/info.o
src/misc.o
src/package.o
src/merge.o
src/lock.o
src/rebuild.o
src/prcserror.o
src/convert.o
src/memseg.o
src/prcsver.o
src/setkeys.o
src/quick.o
src/rekey.o
src/dstring.o
src/dynarray.o
src/hash.o
src/execute.o
src/utils.o
src/getopt.o
src/getopt1.o
src/md5c.o
src/maketime.o
src/partime.o
src/vclex.o
src/docs.o
