4. Goals, tricks, ideas and issues

4.1. exact patching

When debpatch or debdelta-upgrade recreates a .deb, it will be identical to the desired one (so it may be possible to check it using the security features in APT [1]). See though Section 4.5.

4.2. exact recompression

Suppose a .deb has inside a huge file /usr/share/doc/foobar/document.info.gz and this starts with a RCS tag ... then each time it is released, the file will be different even though just few bytes were changed. Another examples are manpages that start with the header containing the version of the command. So , to get good compression of the difference, I had to be able to gunzip those files, diff them, and gzip back them exactly identical (but possibly for headers [2]) For this reason, I studied gzip formats, and I wrote in debdelta some python code that does the trick (90% of the times...). [3]

4.3. speed

4.3.1. some (old) numbers

Warning: this section is referred to experiments done in 2006, and the backend for delta encoding was 'xdelta'. On a desktop with CPU Athlon64 3000 and a average hard disk,


	$ debdelta mozilla-browser_1.7.8-1sarge3_i386.deb \
	mozilla-browser_1.7.8-1sarge6_i386.deb /tmp/m-b.debdelta
      
processes the 10Mb of mozilla-browser in ~11sec, that is a speed of ~900kB per second. Then debpatch applies the above delta in 16sec, at a speed of ~600kB per second. Numbers drop in a old PC, or in a notebook (like mine, that has a Athlon 1600MHz and slow disks), where data are chewed at ~200kB per second. Still, since I have a ADSL line that downloads at max 80kB per second, I have a benefit downloading deltas. In a theoretical example, indeed, to download a 80MB package, it would take 1000seconds; whereas to download a delta that is 20% of 80MB it takes 200seconds, and then 80MB / (200kB/sec) = 400seconds to apply it, for a total of 600seconds. So I may get a "virtual speed" of 80MB / 600sec = 130kB/sec . Note that delta downloading and delta patching is done in parallel: if 4 packages as above have to be downloaded, then the total time for downloading of full debs would be 4000seconds, while the time for parallel-download-patch-apply-patch may be as low as 1400seconds.

This is a real example of running 'debdelta-upgrade' :


	Looking for a delta for libc6 from 2.3.6-9 to 2.3.6-11
	Looking for a delta for udev from 0.092-2 to 0.093-1
	Patching done, time: 22sec, speed: 204kB/sec, result: libc6_2.3.6-11_i386.deb
	Patching done, time: 4sec, speed: 57kB/sec, result: udev_0.093-1_i386.deb
	Delta-upgrade download time 28sec speed 21.6k/sec
	total time: 53sec; virtual speed: 93.9k/sec.
      
(Note that the "virtual speed" of 93.9k/sec , while less than the 130kB/sec of the theoretical example above, is still more than the 80kB that my ADSL line would allow). Of course the above is even better for people with fast disks and/or slow modems. Actually, an apt delta method may do a smart decision of how many deltas to download, and in which order, to optimize the result, (given the deltas size, the packages size, the downloading speed and the patching speed).

4.3.2. speeding up

The problem is that the process of applying a delta to create a new deb is currently slow, even on very fast machines. One way to overcome is to "parallelize as much as possible". The best strategy that I can imagine is to keep both the CPU, the hard disk, and the Internet connection, always maxed up. This is why 'debdelta-upgrade' has two threads, the "downloading thread" and the "patching thread". The downloading thread downloads deltas (ordered by increasing size), and as soon as they are downloaded, it queues them to be applied in the "patching thread"; whereas as soon as all available deltas are downloaded it starts downloading some debs, and goes on for as long as the deltas are being applied in the "patching thread". Summarizing, the downloading thread keeps Internet busy while the patching thread keeps the CPU and HDD busy.

Another speedup strategy is embedded inside the deltas themselves: since bsdiff is a memory hog, when the backend is bsdiff, I have to divide the data in chunks; this may lower the compression ratio, but the good point is that the HDD accesses and the calls to bsdiff can run "in parallel". With newer xdelta3, xdelta3 can read the original data from a pipe, so the data are not divided in chunks, but rather continously piped into xdelta3; so xdelta3 runs at the same time as when the data are read from HDD.

4.3.3. the 10kb trick

currently, roughly half of the generated deltas[4] are less than 10KB. debdelta-upgrade downloads deltas in two passes,

  1. in the first pass it tries to download the first 10KB of a delta; if it gets a complete delta, it immediatly pipes it in the "patching thread queue", otherwise if it gets only a partial download, it adds it to the download queue; if it gets HTTP404, it possibly checks for the "toobig" timestamp, and it possibly warns the user.

  2. in the second pass, it downloads the rest of the deltas, and queues them for patching

Why this complex method? because the first 10KBs of a delta contain the info, and those may be used to actually decide not to download the rest of the delta (if a TODO predictor decides that it is not worthwhile...Section 4.3.4).

4.3.4. the choice, the predictor

Which deltas should be downloaded, VS which debs? Currently there is a rule-of-thumb: the server immediately deletes any delta that exceeds 70% of the original deb , and it replaces it with an empty file ending in ".debdelta-too-big". In such cases, "debdelta-upgrade" will download the deb instead. See the explanation of "debdelta-upgrade --deb-policy" in the man page for more info and customization on which debs get downloaded.

Some time ago I tried to do devise a better way to understand when to download a delta w.r.t. a deb. The code is in the "Predictor" class .... but I could not reliably predict the final speed of patching, so currently it is not used.

4.3.5. State of the art

All in all, I still cannot obtain high speeds: so people that have a fast ADSL Internet connection usually are better downloading all the debs, and ignoring "debdelta-upgrade" alltogether. Anyway, the best way to know is to try "debdelta-upgrade -v" and read the final statistics. See Section 4.7 and Section 4.8 for recent developments.

4.4. better deb compression is a worse delta

'xdelta3' can reconstruct data at high speed: on nowadays processors, it can process up to 2MB per second; but, when applying a delta, 'xdelta3' works on uncompressed data. So if the data is then compressed at a ratio 1/3, then the resulting speed on compressed data is 700KB/sec. Moreover, time is needed to actually compress the data.

In recent years, 'dpkg' has transitioned from 'data.tar.gz' to 'data.tar.bz2' to 'data.tar.lzma'; each method is better at compressing, but is also slower than the previous one; since it is better at compressing, it also defeats the ability of 'debdelta' to produce small deltas (wrt the original deb, of course), and indeed statistics show that deltas are getting larger; since it is slower, it slows down the applying of deltas as well.

4.5. long time recovery

As aforementioned, deltas can rebuild the deb identically to the byte. But the patch.sh script calls the standard tools 'tail','head','zgip','bzip2','lzma', etc etc to rebuild a delta; so if the argument calling or output of any of those tools changes, than a delta may become unusable. As long as deltas are used for the debdelta-upgrade service, this is no big deal: if such a tool changes, then we can adjust the deltas to it, and there is just some days disruption of the service [5] (and people will download debs instead of deltas .... as we used to).

If anybody wants instead to use debdelta to archive debs for long time, (as the archive.debian.org service was doing), then we should make sure that , at any moment in future, deltas can be applied. A possible solution would be that deltas should contain, in the info files, the versions of all tools that are needed for applying. A second solution is that debdelta should keep a standard set of those tools inside the package.

4.6. streaming

Let me summarize. When 'debdelta-upgrade' (or 'debpatch') recreates a deb, one step is reassembling the data.tar part inside it; this part moreover is compressed (gzip, bzip2 or lately lzma). This 'reassembling and compressing' takes time (both for CPU and for HD), and is moreover quite useless, since, in short time, 'apt' will call 'dpkg -i' that decompresses and reopens the data.tar in the deb.

It is then reasonable to collapse this two parts, and this would possibly speed up the upgrade a bit. A first step is '--format=unzipped' Section 4.7 , a next step may be '--format=preunpacked' Section 4.8.

4.7. --format=unzipped

The recently introduced new --format=unzipped may speed up package upgrades. If you call 'debdelta-upgrade' with the option '--format=unzipped' , then in the recreated deb the data.tar part will not be compressed. This may speedup the 'debdelta-upgrade' + 'apt-get upgrade' process. Indeed, writing to hard disk is fast (let's say 5MB/sec, but usually much more); whereas compressing random data with 'bzip2 -9' or 'lzma -9' is much slower (let's say 2.0MB/sec and 1.5 MB/sec) ; and moreover the compressed data is then decompressed by dpkg when installing; so avoiding the compress/decompress should be a win/win (unless you run out of disk space...). Indeed I see that the creation of deltas is much faster; but I still do not have enough data collected....

4.8. --format=preunpacked

Here is another idea. When 'debdelta-upgrade' is called in upgrading a package 'foobar' it currently creates 'foobar_2.deb'. By an appropriate cmdline switch '--format=preunpacked', instead of creating a 'foobar_2.deb' , it directly saves all of its file to the filesystem, and it adds an extension to all the file names, making sure that no file name conflicts (=overwrite) with a preexisting file on the filesystem ; then it creates a file 'foobar_2.deb_preunpacked' , that is a deb package were 'data.tar.xxx' is replaced with 'data_list', just a text file specifying the contents of 'data.tar.xxx' and where regular files were temporarily unpacked.

Note that the above idea overlaps a lot with the SummerOfCode2010 StreamingPackageInstall

debdelta-upgrade --format=preunpacked is now implemented as a proof-of-concept (it does not really write temporary files to HD yet). The format of data_list is


Files:
 TYPE MODE USER GROUP MTIME
 NAME_FILE_WAS_UNPACKED_TO (if regular file)
 ORIGINAL_FILENAME
 LINK_NAME (if link)
[repeat]
       
Example of data_list

Files:
 d 0755 root root 1304626623
 
 ./etc
 
 - 0644 root root 1304626594
 /./etc/gnashrc_1956_debdelta_preunpacked
 ./etc/gnashrc
 l 0777 root root 1304626629
 
 ./usr/share/man/man1/gtk-gnash.1.gz
 gnash.1.gz
       

PROS: (1) may be faster; (2) if you need to upgrade a 100MB package, you do not need to save both the deb and (while 'dpkg --unpack') the whole new deb data : so there is less risk of running our of disk space.

CONS: (1) you cannot install that "preunpacked deb" twice (so dpkg should probably remove it once it has installed it); (2) you cannot move it to another host; (3) when "apt-get clean", all temporary files have to be removed as well.

So it may be a good idea to use ".deb_preunpacked" as extension for them. And I would recommend using '--format=unzipped' for essential packages such as the kernel.

If you like the idea, someone should help in changing 'dpkg' so that it would be able to install starting from 'foobar_2.deb_preunpacked'. And change APT so that it would interact with 'debdelta' to create the 'foobar_2.deb_unpacked' files, and pass them to dpkg (and clean them properly).

Notes

[1]

note though that debdelta-upgrade saves the recontructed debs in /var/cache/apt/archives, and APT does not check them there, AFAICT

[2]

the re-gzipped files are identical but for headers, (indeed gzip headers contain sometimes a timestamp ); but this is not a problem since the reconstructed gzipeed file is then piped again into 'xdelta3' or 'bsdiff' to rebuild the 'data.tar', so the header is fixed at that stage

[3]

This is implemented in the python routine delta_gzipped_files.

[4]

that is, discarding those that are more than 70% of the corresponding deb

[5]

this actually already happened some years ago, with libzip