Proposal: A new approach to differential debs

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Proposal: A new approach to differential debs

Julian Andres Klode-4
Hi everyone,

(I CCed -devel and deity, but we probably should just discuss
 that on -dpkg)

while breakfast here at DebConf, the topic of delta upgrades
came up. I think delta debs are generally a thing we should
aim to have, but debdelta is not the implementation we want:

* It is not integrated into APT or dpkg
* It relies on a script shipped in the debdelta to generate
  a new deb from and old deb

We guessed that generating the new deb from the old deb is
actually the slowest part of the whole debdelta process. I
propose a new solution, which does not really have a name
yet.

# Scope
Patching installed packages to match new state

# Not scope

Generating new package from old package

# Design

Introduce a new format 'ddeb' (delta/differential deb)[1], the
format shall contain a control.tar member and a version member
as in a .deb file. Instead of a data.tar member, it contains
a diff.tar member, however.

The .diff.tar member contains patches to apply to individual
files of the old package. No idea about specific algorithm
to choose here, yet.

The control.tar's control file is extended with an Old-Version
field, so we get some sanity checking. Alternatively, it may
be extended with an mtree of the source we are patching.

The delta files are stored alongside their debs, and referenced
in the packages files in a Patches-<hash> section:

        Patches-SHA256:
                <hash of delta> <size of delta> <old version> <path>

APT can then grab the delta, if it fails with a 404, it would
fall back to the full delta.

The deltas are not incremental, my suggestion is to do the following ones:

unstable: (1) against testing (2) against previous unstable
experimental: (1) against unstable (2) against previous experimental
stable:   (1) against last point release (2) against previous security update
              (or rather current one)

All these files will always be around when dak runs anyway, so we do not
need to keep additional historical packages around.

We probably want to make this opt-in, so packages set a field like
Generate-Patches: yes (there might be problems with applying patches to
live systems and bad maintainer scripts).

[1] name should probably change

# Requisites / Extensions to .deb and dpkg database

We need to keep data about the file tree in packages and
in the database, from what I gathered, there already is
a plan to use mtree for that, so that would fit us well.

# Applying a delta deb

Maintainer scripts are run as normally from the
control tarball. The unpack phase is different: Instead
of unpacking a file <file> from the tarball into <file>.dpkg-new,
we apply archive's <file> as a patch to the installed <file>
and store the result in .dpkg-new.

Files not listed in the new mtree but listed in the old
one will be deleted.

# Issues

We need a way to check if we can apply the diff.tar member; and
if we can't, have to download the full deb in APT. This might need
some kind of new patch-prepare state where we basically generate the
.dpkg-new files, but don't apply them (or run any maintscripts).

# Further work

I guess I should do a proof of concept and then we can look if that
is worthwhile, and how it performs.

--
Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Adrian Bunk-3
On Sat, Aug 12, 2017 at 02:16:21PM -0400, Julian Andres Klode wrote:
>...
> I think delta debs are generally a thing we should aim to have,
>...

It sounds like something that would have been a cool feature 20 years
ago when I was downloading Debian updates over an analog modem.

Today the required effort, infrastructure and added complexity would
IMHO not be worth it for a potential few percent of bandwidth decrease.

> The .diff.tar member contains patches to apply to individual
> files of the old package. No idea about specific algorithm
> to choose here, yet.
>...

Do you really want to ship *patches*, and not just copies of all
changed files?

Patching binaries to a new upstream version doesn't sound to me like
something that would make sense.

cu
Adrian

--

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Paul Wise via nm
On Sun, Aug 13, 2017 at 5:38 AM, Adrian Bunk wrote:

> It sounds like something that would have been a cool feature 20 years
> ago when I was downloading Debian updates over an analog modem.
>
> Today the required effort, infrastructure and added complexity would
> IMHO not be worth it for a potential few percent of bandwidth decrease.

Low-speed connections and low bandwidth quotas (especially on mobile)
are still common enough around the world that delta upgrades make a
difference right now, IIRC even Google uses them for Chrome.

--
bye,
pabs

https://wiki.debian.org/PaulWise

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Peter Silva
You are assuming the savings are substantial.  That's not clear.  When
files are compressed, if you then start doing binary diffs, well it
isn't clear that they will consistently be much smaller than plain new
files.  it also isn't clear what the impact on repo disk usage would
be.

The most straigforward option:
The least intrusive way to do this is to add differential files in
addition to the existing binaries, and any time the differential file,
compared to a new version exceeds some threhold size (for example:
50%) of the original file, then you end up adding the sum total of the
diff files in addition to the regular files to the repos.  I haven't
done the math, but it's clear to me that it ends up being about double
the disk space with this approach. it's also costly in that all those
files have to be built and managed, which is likely a substantial
ongoing load (cpu/io/people)  I think this is what people are
objecting to.

A more intrusive and less obvious way to do this is to use zsync (
http://zsync.moria.org.uk/. ) With zsync, you build tables of content
for each file, using the same 4k blocking that rsync does. To handle
compression efficiently, there needs to be an understanding of
blocking, so a customized gzip needs to be used.  With such a format,
you produce the same .deb's as today, with the .zsyncs (already in
use?) and the author already provides some debian Packages files as
examples.  The space penalty here is probably only a few percent.

the resource penalty is one read through each file to build the
indices, and you can save that by combining the index building with
the compression phase.  To get differential patches, one just fetches
byte-ranges in the existing main files, so no separate diff files
needed.  And since the same mechanisms can (should?) be used for repo
replication, the added cost is likely a net savings in bandwidth
usage, and relatively little complexity.

The steps would be:
   -- add gzip-ware flavour to package creation logic
   -- add zsync index creation on repos. (potentially combined with first step.)
   -- add zsync client to apt-* friends.

To me, this makes a lot of sense to do just for repo replication,
completely ignoring the benefits for people on low bandwidth lines,
but it does work for both.

On Sun, Aug 13, 2017 at 9:20 AM, Paul Wise <[hidden email]> wrote:

> On Sun, Aug 13, 2017 at 5:38 AM, Adrian Bunk wrote:
>
>> It sounds like something that would have been a cool feature 20 years
>> ago when I was downloading Debian updates over an analog modem.
>>
>> Today the required effort, infrastructure and added complexity would
>> IMHO not be worth it for a potential few percent of bandwidth decrease.
>
> Low-speed connections and low bandwidth quotas (especially on mobile)
> are still common enough around the world that delta upgrades make a
> difference right now, IIRC even Google uses them for Chrome.
>
> --
> bye,
> pabs
>
> https://wiki.debian.org/PaulWise
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Julian Andres Klode-4
On Sun, Aug 13, 2017 at 10:53:16AM -0400, Peter Silva wrote:
> You are assuming the savings are substantial.  That's not clear.  When
> files are compressed, if you then start doing binary diffs, well it
> isn't clear that they will consistently be much smaller than plain new
> files.  it also isn't clear what the impact on repo disk usage would
> be.

The suggestion is for it to be opt-in, so packages can optimize their file
layout when opting in - for example, the gzipped changelogs and stuff can be
the rsyncable ones (this might need some changes in debhelper to generate
diffable files if the opt-in option is set).

>
> The most straigforward option:
> The least intrusive way to do this is to add differential files in
> addition to the existing binaries, and any time the differential file,
> compared to a new version exceeds some threhold size (for example:
> 50%) of the original file, then you end up adding the sum total of the
> diff files in addition to the regular files to the repos.  I haven't
> done the math, but it's clear to me that it ends up being about double
> the disk space with this approach. it's also costly in that all those
> files have to be built and managed, which is likely a substantial
> ongoing load (cpu/io/people)  I think this is what people are
> objecting to.

I would throw away patches that are too large, obviously. I think
that twice the amount of space seems about reasonable, but then we'll
likely restrict that to

(a) packages where it's actually useful
(b) to updates and their base distributions

So we don't keep oldstable->stable diffs, or unstable->unstable
diffs around. It's questionable if we want them for unstable at
all, it might just be too much there, and we should just use them
for stable-(proposed-)updates (and point releases) and stable/updates;
and experimental so we don't accidentally break it during the
dev cycle.

>
> A more intrusive and less obvious way to do this is to use zsync (
> http://zsync.moria.org.uk/. ) With zsync, you build tables of content
> for each file, using the same 4k blocking that rsync does. To handle
> compression efficiently, there needs to be an understanding of
> blocking, so a customized gzip needs to be used.  With such a format,
> you produce the same .deb's as today, with the .zsyncs (already in
> use?) and the author already provides some debian Packages files as
> examples.  The space penalty here is probably only a few percent.
>
> the resource penalty is one read through each file to build the
> indices, and you can save that by combining the index building with
> the compression phase.  To get differential patches, one just fetches
> byte-ranges in the existing main files, so no separate diff files
> needed.  And since the same mechanisms can (should?) be used for repo
> replication, the added cost is likely a net savings in bandwidth
> usage, and relatively little complexity.
>
> The steps would be:
>    -- add gzip-ware flavour to package creation logic
>    -- add zsync index creation on repos. (potentially combined with first step.)
>    -- add zsync client to apt-* friends.
>
> To me, this makes a lot of sense to do just for repo replication,
> completely ignoring the benefits for people on low bandwidth lines,
> but it does work for both.

It does not work for client use. apt by default automatically deletes
packages files after a successful install, and upgrades are fairly infrequent
(so an expiration policy would not help much either); so we can't rely on old
packages to be available for upgradexs.

--
Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Peter Silva
> apt by default automatically deletes packages files after a successful install,

I don't think it does that.  It seems keep them around after install,
and even multiple
versions.  I don't know the algorithm for the cache, but it doesn't do
what you say.

On my debian box, I have nothing but a straight install, changed nothing.

blacklab% cd /var/cache/apt/archives
blacklab% ls -al | wc -l
1450
blacklab% ls -al xsensors*
-rw-r--r-- 1 root root 17688 Mar  6 21:11 xsensors_0.70-3+b1_amd64.deb
blacklab% dpkg -l xsensors
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                      Version           Architecture      Description
+++-=========================-=================-=================-=======================================================
ii  xsensors                  0.70-3+b1         amd64
hardware health information viewer
blacklab%

blacklab% ls -al xserver-xorg-video-nvidia_375*
-rw-r--r-- 1 root root 3129858 Feb 23 16:13
xserver-xorg-video-nvidia_375.39-1_amd64.deb
-rw-r--r-- 1 root root 3138320 May 28 11:28
xserver-xorg-video-nvidia_375.66-1_amd64.deb
-rw-r--r-- 1 root root 3101944 Jul 16 17:14
xserver-xorg-video-nvidia_375.66-2~deb9u1_amd64.deb
blacklab%

On Sun, Aug 13, 2017 at 12:43 PM, Julian Andres Klode <[hidden email]> wrote:

> On Sun, Aug 13, 2017 at 10:53:16AM -0400, Peter Silva wrote:
>> You are assuming the savings are substantial.  That's not clear.  When
>> files are compressed, if you then start doing binary diffs, well it
>> isn't clear that they will consistently be much smaller than plain new
>> files.  it also isn't clear what the impact on repo disk usage would
>> be.
>
> The suggestion is for it to be opt-in, so packages can optimize their file
> layout when opting in - for example, the gzipped changelogs and stuff can be
> the rsyncable ones (this might need some changes in debhelper to generate
> diffable files if the opt-in option is set).
>
>>
>> The most straigforward option:
>> The least intrusive way to do this is to add differential files in
>> addition to the existing binaries, and any time the differential file,
>> compared to a new version exceeds some threhold size (for example:
>> 50%) of the original file, then you end up adding the sum total of the
>> diff files in addition to the regular files to the repos.  I haven't
>> done the math, but it's clear to me that it ends up being about double
>> the disk space with this approach. it's also costly in that all those
>> files have to be built and managed, which is likely a substantial
>> ongoing load (cpu/io/people)  I think this is what people are
>> objecting to.
>
> I would throw away patches that are too large, obviously. I think
> that twice the amount of space seems about reasonable, but then we'll
> likely restrict that to
>
> (a) packages where it's actually useful
> (b) to updates and their base distributions
>
> So we don't keep oldstable->stable diffs, or unstable->unstable
> diffs around. It's questionable if we want them for unstable at
> all, it might just be too much there, and we should just use them
> for stable-(proposed-)updates (and point releases) and stable/updates;
> and experimental so we don't accidentally break it during the
> dev cycle.
>
>>
>> A more intrusive and less obvious way to do this is to use zsync (
>> http://zsync.moria.org.uk/. ) With zsync, you build tables of content
>> for each file, using the same 4k blocking that rsync does. To handle
>> compression efficiently, there needs to be an understanding of
>> blocking, so a customized gzip needs to be used.  With such a format,
>> you produce the same .deb's as today, with the .zsyncs (already in
>> use?) and the author already provides some debian Packages files as
>> examples.  The space penalty here is probably only a few percent.
>>
>> the resource penalty is one read through each file to build the
>> indices, and you can save that by combining the index building with
>> the compression phase.  To get differential patches, one just fetches
>> byte-ranges in the existing main files, so no separate diff files
>> needed.  And since the same mechanisms can (should?) be used for repo
>> replication, the added cost is likely a net savings in bandwidth
>> usage, and relatively little complexity.
>>
>> The steps would be:
>>    -- add gzip-ware flavour to package creation logic
>>    -- add zsync index creation on repos. (potentially combined with first step.)
>>    -- add zsync client to apt-* friends.
>>
>> To me, this makes a lot of sense to do just for repo replication,
>> completely ignoring the benefits for people on low bandwidth lines,
>> but it does work for both.
>
> It does not work for client use. apt by default automatically deletes
> packages files after a successful install, and upgrades are fairly infrequent
> (so an expiration policy would not help much either); so we can't rely on old
> packages to be available for upgradexs.
>
> --
> Debian Developer - deb.li/jak | jak-linux.org - free software dev
>                   |  Ubuntu Core Developer |
> When replying, only quote what is necessary, and write each reply
> directly below the part(s) it pertains to ('inline').  Thank you.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Christian Seiler
On 08/13/2017 07:11 PM, Peter Silva wrote:
>> apt by default automatically deletes packages files after a successful install,
>
> I don't think it does that.

The "apt" command line tool doesn't, but traditional "apt-get" does, as
does "aptitude". This was documented in the release notes of Jessie and
the changelog of the APT package when the "apt" wrapper was introduced.

Regards,
Christian

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Marvin Renich
* Christian Seiler <[hidden email]> [170813 13:19]:
> On 08/13/2017 07:11 PM, Peter Silva wrote:
> >> apt by default automatically deletes packages files after a successful install,
> >
> > I don't think it does that.
>
> The "apt" command line tool doesn't, but traditional "apt-get" does, as
> does "aptitude". This was documented in the release notes of Jessie and
> the changelog of the APT package when the "apt" wrapper was introduced.

This differs from my experience.  My laptop's /var/cache/apt/archives/
directory has 3459 .deb files.  I use aptitude almost exclusively, and I
update several times a week.

Is there an apt.conf parameter that controls this?  I don't remember
seeing any news or changelog about this, but if I did, I might have
disabled it.  Or maybe there is an apt.conf entry, whose default is to
not do apt-get clean after each install, but newly installed systems
have a snippet in /etc/apt.conf.d/ that turns it on?

Can you give the version number in /usr/share/doc/apt/changelog.gz where
this was mentioned?  A cursory glance through the Jessie release notes
(HTML) TOC doesn't give any obvious pointer to where this was mentioned;
can you give a ref for this as well?  (This is an honest question, I'm
not trying to be accusatory.)

Thanks...Marvin

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Christian Seiler
(Setting reply-to to debian-devel@ only as I don't think this
should continue on debian-dpkg@ and deity@)

On 08/14/2017 12:29 AM, Marvin Renich wrote:

> * Christian Seiler <[hidden email]> [170813 13:19]:
>> On 08/13/2017 07:11 PM, Peter Silva wrote:
>>>> apt by default automatically deletes packages files after a successful install,
>>>
>>> I don't think it does that.
>>
>> The "apt" command line tool doesn't, but traditional "apt-get" does, as
>> does "aptitude". This was documented in the release notes of Jessie and
>> the changelog of the APT package when the "apt" wrapper was introduced.
>
> This differs from my experience.  My laptop's /var/cache/apt/archives/
> directory has 3459 .deb files.  I use aptitude almost exclusively, and I
> update several times a week.

Erm, rereading my text, I misspoke a bit, I meant the opposite of
what I said:

 - apt-get / aptitude leave /var/cache/apt/archives lying around
 - apt doesn't

> Is there an apt.conf parameter that controls this?

For aptitude I have no idea, haven't used that in quite a while.

For the apt and apt-get utilities there's the following explanation
in /usr/share/doc/apt/NEWS.Debian.gz:

  [ Automatic removal of debs after install ]
  After packages are successfully installed by apt(8),
  the corresponding .deb package files will be
  removed from the /var/cache/apt/archives cache directory.

  This can be changed by setting the apt configuration option
    "Binary::apt::APT::Keep-Downloaded-Packages" to "true". E.g:

  # echo 'Binary::apt::APT::Keep-Downloaded-Packages "true";' \
      > /etc/apt/apt.conf.d/01keep-debs

  Please note that the behavior of apt-get is unchanged. The
  downloaded debs will be kept in the cache directory after they
  are installed. To enable the behavior for other tools, you can set
  "APT::Keep-Downloaded-Packages" to false.

The corresponding changelog entry for 1.2~exp1 is not quite as
obvious, so I should have probably referenced NEWS and not
changelog in my message, sorry about that:

  * Add new APT::Keep-Downloaded-Packages option (Closes: #160743)

> A cursory glance through the Jessie release notes
> (HTML) TOC doesn't give any obvious pointer to where this was mentioned;

Sorry, I was pretty sure it was in there, but after looking
through them you're right: it's not in there. But it was
mentioned by the APT developers towards the end of the
Jessie release cycle - I can't find anyting in
debian-devel-announce about that either though. Maybe it was
in the DebConf15 talk about APT?

Regards,
Christian

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Marvin Renich
* Christian Seiler <[hidden email]> [170813 18:59]:

> (Setting reply-to to debian-devel@ only as I don't think this
> should continue on debian-dpkg@ and deity@)
>
> On 08/14/2017 12:29 AM, Marvin Renich wrote:
> > * Christian Seiler <[hidden email]> [170813 13:19]:
> >> On 08/13/2017 07:11 PM, Peter Silva wrote:
> >>>> apt by default automatically deletes packages files after a successful install,
> >>>
> >>> I don't think it does that.
> >>
> >> The "apt" command line tool doesn't, but traditional "apt-get" does, as
> >> does "aptitude". This was documented in the release notes of Jessie and
> >> the changelog of the APT package when the "apt" wrapper was introduced.
> >
> > This differs from my experience.  My laptop's /var/cache/apt/archives/
> > directory has 3459 .deb files.  I use aptitude almost exclusively, and I
> > update several times a week.
>
> Erm, rereading my text, I misspoke a bit, I meant the opposite of
> what I said:
>
>  - apt-get / aptitude leave /var/cache/apt/archives lying around
>  - apt doesn't

Thanks.  My system isn't completely backwards, then!

> > Is there an apt.conf parameter that controls this?
[snipped good answers]

Thanks.  I won't need the apt.conf entry, since it applies to apt, not
apt-get and aptitude, but I appreciate your pointing it out.

...Marvin

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Holger Levsen-2
In reply to this post by Paul Wise via nm
On Sun, Aug 13, 2017 at 09:20:28AM -0400, Paul Wise wrote:
> Low-speed connections and low bandwidth quotas (especially on mobile)
> are still common enough around the world that delta upgrades make a
> difference right now, IIRC even Google uses them for Chrome.

fedora also has them and users enjoy shorter upgrade times there.

also, with reproducible builds the delta is smaller then without.


--
cheers,
        Holger

signature.asc (828 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Peter Silva
In reply to this post by Christian Seiler
o in spite of being the *default*, it isn't that universal, and in
any event, we can just decide to change the default, no? One can say
to people with bandwidth limitations, that their apt settings should
not delete packages after receipt, so that they can be used as the
basis for updates.  And these types of settings would appear to be
rather common already, so it isn't a huge change.

It strikes me as much simpler and lower to add zsync to the current
repo/apt tools, and that asking clients to do some caching to support
it is reasonable.


On Sun, Aug 13, 2017 at 1:19 PM, Christian Seiler <[hidden email]> wrote:

> On 08/13/2017 07:11 PM, Peter Silva wrote:
>>> apt by default automatically deletes packages files after a successful install,
>>
>> I don't think it does that.
>
> The "apt" command line tool doesn't, but traditional "apt-get" does, as
> does "aptitude". This was documented in the release notes of Jessie and
> the changelog of the APT package when the "apt" wrapper was introduced.
>
> Regards,
> Christian

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Julian Andres Klode-4
On Sun, Aug 13, 2017 at 08:24:27PM -0400, Peter Silva wrote:
> o in spite of being the *default*, it isn't that universal, and in
> any event, we can just decide to change the default, no? One can say
> to people with bandwidth limitations, that their apt settings should
> not delete packages after receipt, so that they can be used as the
> basis for updates.  And these types of settings would appear to be
> rather common already, so it isn't a huge change.

We also have to consider embedded devices. I was told there are small
embedded devices with really slow connections around that run debian
and automatically update using apt - we don't want those to have to
keep around old packages either.

And quite practically speaking, I do not want to keep around packages
on my system, but I do want deltas, because at 10 Mbit/s, it's a pain
to upgrade. I also want to keep my /var small, so I have actually usable
space rather than duplicate of files.

But since deltas require implementation in dpkg, it's not my
decision anyway, it's guillem's. I'm just proposing ideas that
seem reasonable.

--
Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

A radically different proposal for differential updates

Christian Seiler
In reply to this post by Julian Andres Klode-4
Hi there,

I've come to believe that binary diff packages are not the best way of
solving this issue. Intead I'd like to propse a radically different
solution to this issue.

The gist of it: instead of adding a format for how deltas work, I
propose to introduce a new format for storing Debian packages that will
be used for both the initial installation _and_ incremental updates.

This idea was inspired by the following talk given by Lennart
Poettering about a new tool he's written (which is already packaged for
Debian by the way):
https://www.youtube.com/watch?v=JnNkBJ6pr9s



Now to my proposal:

A Debian package currently consists of two files: control.tar.gz and
data.tar.xz (or .gz). What I want to propose is a new format that does
not contain a data.tar.xz at all. Instead I'd like to split the
data.tar.xz into chunks and have the new format only contain an index
that references these chunks. Let me call this new format "cdeb" for
"chunked deb".

A .deb package can be decomposed trivially into a .cdeb via the
following process:

  - uncompress data.tar.xz to obtain data.tar
  - split data.tar into chunks (see below)
  - create a list of cryptographically secure hashes of these chunks
    and store them in an index file
  - compress each chunk individually
  - the .cdeb would then contain control.tar.gz and data.chunks.gz,
    where the latter is the aforementioned chunk index

If you have a .cdeb package and all the chunks it references, one can
unpack that by decompressing each chunk in order, concatenating them
and then running tar.

In and by itself this just gives us a lot more overhead. However, there
are two key ideas that additionally come into play here:

1. The way the chunks are split: instead of splitting this up into
chunks of a fixed size, use a rolling hash function instead. Lennart's
casync tool uses Buzhash for this. The basic idea behind a rolling
hash is that you update the hash function with each byte until the
current hash value fulfills a certain property (such as that it's
divisible by a specific number). That's when you put a chunk boundary
there. This has the consequence that even if you insert a byte into
the data stream that is being chunked at some point this will only have
local consequences, and once you're far away from that position you
will still have the chunk boundaries placed at the same points.

Note that the hash function used here is just for splitting the files
up (and is not a cryptographic hash function); a separate hash function
is to be used to identify chunks once they're split.

2. Reconstruction of data.tar from the currently installed package.
When doing updates we do not want to have to store either the previous
full .deb files nor all of the previous chunks on the system. Instead
we can recreate data.tar from the installed package: we do know which
files belong to the package (dpkg has that information) and we can read
those files and recreate a tarball from them. We can then split that
back into chunks - and reuse any chunks that are still the same during
updates. Thus the installed system is an implicit cache of Debian's
packages, and we don't need to store additional data.

The following would be an example of APT installing a new package:

1. Download the .cdeb for the package.
2. Extract the list of chunks in the package.
3. Download those chunks from the archive
4. Recreate data.tar from these chunks
5. Proceed normally

The following would be an example of APT upgrading an existing package:

1. Download the .cdeb for the package.
2. Extract the list of chunks in the package.
3. Reconstruct a data.tar of the current package with what's currently
   in the filesystem, and apply the same chunking algorithm to it.
4. Any chunks that are common with the new package should be stored in
   the download directory.
5. Download the remaining chunks from the archive.
6. Recreate data.tar from these chunks.
7. Proceed normally


This scheme has a lot of advantages:

 - No need to decide what packages to diff against each other,
   everything "just works"[tm]
 - No need to explicitly cache anything on the client side (no older
   debs etc.)
 - You download only what you actually need. If you haven't modified
   the files installed by a package, an upgrade is going to be cheap.
   If you have you'll need to download more.
 - Deduplication within the archive. I expect that the mirrors will
   actually require less space when using this scheme - in the long
   term at least (see below).
 - Required tooling changes are relatively simple


There are some disadvantages:

 - Obviously this will incur some overhead in case there is no older
   version of a package installed.
 - How to implement this: as we want to support upgrades between Debian
   versions we either have to wait until Buster is released with APT
   supporting this before we change the archive in Bullseye (leading to
   a lot less testing opportunities), _or_ we keep the archive around
   in both formats for at least a release, which would increase the
   mirror sizes quite a bit in the short term
 - You don't have a "single file for a single binary package" concept
   anymore, at least not in transit. You can easily reconstruct such a
   package though
 - Currently packages can set a specific compression level for their
   .deb files if they so require. (For example large game data packages
   might opt to xz -9 to save space, because that might be worth it,
   but you don't want that on all packages, otherwise embedded systems
   might not be able to unpack their own packages due to the RAM
   requirements of xz when used at that level.) But as chunks are
   compressed individually you'll likely have a single compression
   level for all of the chunks. There may be ways around this (i.e.
   detect the compression level when converting and reuse that for the
   chunks).
 - Removing stuff from the archive gets a bit more complicated. We'd
   need a garbage collector for chunks, and an efficient one.

AFAQ (Anticipated frequently asked questions):

Q: How can you reconstruct a tarball from the installed system? Won't
   that change quite a bit? How will you ever be able to get more than
   just a couple of similar chunks out of this scheme?
A: If you modify stuff on your system: yes, the tarball you can
   generate from that will differ from the tarball that is in the .deb.
   However, as we want our .deb files to be reproducible, we are posing
   additional restrictions on the tarballs inside .deb files that are
   not part of the .tar standard. Most importantly the file ordering
   has to be well-defined - and currently this is done via sorting in
   the C locale. If we recreate the tarball from the system with these
   same constrains, we should ideally get an identical tarball back if
   no files are changed.

   Lennart has introduced a different format in his 'casync' tool
   because of three reasons: reproducibility, metadata and
   addressability. The latter two don't apply to us here, so we can
   get away with keeping .tar as our format, and just posing additional
   restrictions on it - which we already due for other reasons in the
   project.

   Also, this scheme will not break down if the tarball in the .deb
   doesn't have a predetermined order, it will just not save any
   bandwidth on updates anymore. So even in pathlogical cases we do
   degrade gracefully.

Q: Can I reconstruct a .deb file from a .cdeb and all of its chunks?
A: Yes.*

   * If you use a different version of xz than what was used to
     compress data.tar then the output might differ and it won't be
     bit-by-bit identical. With the same version of xz you should be
     able to get the same .deb file on a bit level. In either case
     the .deb will be functionally equivalent.

Q: Does this actually work?
A: Well, I don't have any code to test just now. But if you watch
   Lennart's talk you can see that the general principle works
   really well.

Q: How much will be saved for incremental updates?
A: Unfortunately I don't know yet.

Q: How much overhead will there be?
A: Unfortunately I don't know yet.

Q: Why keep .control.tar.gz in the .cdebs
A: Because it's metadata, and typically really small, and I do think
   that splitting this up further has no additional benefit. I may
   be wrong about that though.

Q: What about really small packages?
A: I think we should make an exception for tiny packages (for example
   less than 10k) and simply store them directly, as the overhead for
   additional network requests is likely going to be too high in that
   case.

Q: What about building Debian packages?
A: That's the beauty of this: people would still upload regular .deb
   packages, so the workflow for developers wouldn't change a single
   bit - and all the chunking magic would happen in the archive.

Q: What about tools like debootstrap etc.?
A: I believe it should be possible to implement this format in
   debootstrap in a relatively straight-forward manner.

Q: What kind of hash functions do you want to use? What kind of
   parameters?
A: I don't know yet. This will have to be best determined via tests.
   I could also imagine that we might want to use different
   parameters for the main archive and e.g. the debug archive.

What are the next steps?

 - First I'd like to get some feedback: what do others here think?

 - If there is positive feedback towards this I'd like to write a tool
   that can convert .deb files into .cdeb + chunks and run that over a
   local copy of the entire archive to get some numbers about this:
   What kind of overhead is there? How much does this type of
   deduplication save? How long does this take?

 - With some numbers there we would then decide if this is something
   that we'd want to implement in dpkg, APT, dak, etc. or not.
   
Anyway: thoughts?

Regards,
Christian

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: A radically different proposal for differential updates

Julian Andres Klode-4
On Tue, Aug 15, 2017 at 09:26:24AM +0200, Christian Seiler wrote:

> Hi there,
>
> I've come to believe that binary diff packages are not the best way of
> solving this issue. Intead I'd like to propse a radically different
> solution to this issue.
>
> The gist of it: instead of adding a format for how deltas work, I
> propose to introduce a new format for storing Debian packages that will
> be used for both the initial installation _and_ incremental updates.
>
> This idea was inspired by the following talk given by Lennart
> Poettering about a new tool he's written (which is already packaged for
> Debian by the way):
> https://www.youtube.com/watch?v=JnNkBJ6pr9s
>
>
>
> Now to my proposal:
>
> A Debian package currently consists of two files: control.tar.gz and
> data.tar.xz (or .gz). What I want to propose is a new format that does
> not contain a data.tar.xz at all. Instead I'd like to split the
> data.tar.xz into chunks and have the new format only contain an index
> that references these chunks. Let me call this new format "cdeb" for
> "chunked deb".
>
[...]
> Anyway: thoughts? Regards, Christian

It's of course an awesome idea. But:

I generally agree with the idea of chunk stores. They'd improve
things a lot. Now, instead of chunking the tarfiles, chunk both
the individual files, and the tarfiles. Then, with an index for
the individual files in control.tar listing the chunks, you can
easily reconstruct just the files that changed on your system
and avoid any rebuilding of debs for upgrades :D

That said, I believe that this change won't sell. Replacing the
basic format the repository is made of won't fare well. Too many
tools (most of which probably are not known) rely on the presence
of deb files in archives.

So as sad as it might be, I think we probably have to settle for
delta files.

--
Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Evaluation (Re: Proposal: A new approach to differential debs)

Julian Andres Klode-4
In reply to this post by Julian Andres Klode-4
On Sat, Aug 12, 2017 at 02:16:21PM -0400, Julian Andres Klode wrote:

> Hi everyone,
>
> (I CCed -devel and deity, but we probably should just discuss
>  that on -dpkg)
>
> while breakfast here at DebConf, the topic of delta upgrades
> came up. I think delta debs are generally a thing we should
> aim to have, but debdelta is not the implementation we want:
>
> * It is not integrated into APT or dpkg
> * It relies on a script shipped in the debdelta to generate
>   a new deb from and old deb

> We guessed that generating the new deb from the old deb is
> actually the slowest part of the whole debdelta process. I
> propose a new solution, which does not really have a name
> yet.

# Evaluation

I build a few sample delta debs, using bsdiff (.pdeb) and
xdelta3 (.xdeb). I attached the scripts to generate them
and apply them to an unpacked directory tree.

Feel free to check with other packages, here are the current
evaluations, including a comparison against debdelta.

libreoffice-core (size only):

  -rw-r--r-- 1 jak jak 29M Jul 22 20:02 libreoffice-core_5.3.5~rc1-3_amd64.deb
  -rw-r--r-- 1 jak jak 31M Jul 16 00:10 libreoffice-core_5.4.0~rc2-1_amd64.deb
  -rw-r--r-- 1 jak jak 31M Jul 28 18:29 libreoffice-core_5.4.0-1_amd64.deb
  -rw-r--r-- 1 jak jak  18M Aug 15 23:44 libreoffice-core_5.3.5~rc1-3_5.4.0-1_amd64.pdeb
  -rw-r--r-- 1 jak jak 4.5M Aug 15 23:42 libreoffice-core_5.4.0~rc2-1_5.4.0-1_amd64.pdeb

For 5.4~rc2 to 5.4 it made a huge difference, for 5.3.5~rc1 to 5.4 not so much,
so it probably is a good fit for stable updates, but not for unstable and testing.

firefox (size & performance):

 -rw-r--r-- 1 jak jak 2.3M Aug 15 20:59 firefox_55.0-1_55.0-2_amd64.debdelta
 -rw-r--r-- 1 jak jak 2.4M Aug 15 22:13 firefox_55.0-1_55.0-2_amd64.pdeb
 -rw-r--r-- 1 jak jak 7.4M Aug 15 22:36 firefox_55.0-1_55.0-2_amd64.xdeb
 -rw-r--r-- 1 jak jak  38M Aug 10 06:49 firefox_55.0-1_amd64.deb
 -rw-r--r-- 1 jak jak  38M Aug 10 12:44 firefox_55.0-2_amd64.deb

Generating the -2 deb from the -1 deb using debdelta took about 47 seconds. In
contrast, applying the .pdeb and .xdeb files to an installed directory tree
took about 1.5 seconds.

The .pdeb uses bsdiff compression, the .xdeb uses xdelta 3. It took
96 seconds to generate the pdeb, and 13 seconds to generate the xdeb
on my ThinkPad X230 with 16 GB of RAM and a Core i5-3320M.

# Conclusions

1. xdelta3 is substantially faster than bsdiff, but bsdiff produces substantially
   smaller update sizes.
2. deltas for major updates are too big to be massively useful, so focusing on
   stable and security updates seems like a good idea (though we do need to have
   a set of pdebs for testing in unstable and/or testing)

# Further extensions

If you put a pristine-tar delta into the delta file, you can fully
reconstruct debs.

--
Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

build-patch.sh (902 bytes) Download Attachment
apply-patch.sh (706 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Julian Andres Klode-4
In reply to this post by Peter Silva
On Sun, Aug 13, 2017 at 10:53:16AM -0400, Peter Silva wrote:

> You are assuming the savings are substantial.  That's not clear.  When
> files are compressed, if you then start doing binary diffs, well it
> isn't clear that they will consistently be much smaller than plain new
> files.  it also isn't clear what the impact on repo disk usage would
> be.
>
> The most straigforward option:
> The least intrusive way to do this is to add differential files in
> addition to the existing binaries, and any time the differential file,
> compared to a new version exceeds some threhold size (for example:
> 50%) of the original file, then you end up adding the sum total of the
> diff files in addition to the regular files to the repos.  I haven't
> done the math, but it's clear to me that it ends up being about double
> the disk space with this approach. it's also costly in that all those
> files have to be built and managed, which is likely a substantial
> ongoing load (cpu/io/people)  I think this is what people are
> objecting to.
>
> A more intrusive and less obvious way to do this is to use zsync (
> http://zsync.moria.org.uk/. ) With zsync, you build tables of content
> for each file, using the same 4k blocking that rsync does. To handle
> compression efficiently, there needs to be an understanding of
> blocking, so a customized gzip needs to be used.  With such a format,
> you produce the same .deb's as today, with the .zsyncs (already in
> use?) and the author already provides some debian Packages files as
> examples.  The space penalty here is probably only a few percent.

Today's research has shown that rolling hashes do not perform well
on executables because of changing offsets and so on destroying the
hashes. There were no measurable space savings when adding fairly
similar firefox releases to either a casync or borg repository -
and that's on uncompressed tarballs of the latest 2 firefox uploads.

--
Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Julian Andres Klode-4
In reply to this post by Adrian Bunk-3
On Sun, Aug 13, 2017 at 12:38:56PM +0300, Adrian Bunk wrote:

> On Sat, Aug 12, 2017 at 02:16:21PM -0400, Julian Andres Klode wrote:
> >...
> > I think delta debs are generally a thing we should aim to have,
> >...
>
> It sounds like something that would have been a cool feature 20 years
> ago when I was downloading Debian updates over an analog modem.
>
> Today the required effort, infrastructure and added complexity would
> IMHO not be worth it for a potential few percent of bandwidth decrease.
>
> > The .diff.tar member contains patches to apply to individual
> > files of the old package. No idea about specific algorithm
> > to choose here, yet.
> >...
>
> Do you really want to ship *patches*, and not just copies of all
> changed files?
>
> Patching binaries to a new upstream version doesn't sound to me like
> something that would make sense.

bsdiff was specifically invented for patching binaries. See the
evaluation I posted a (few) hour(s) ago. It's used succesfully by
FreeBSD, Chrome, Android, Apple App Store, and other places.

Chrome switched to courgette meanwhile, which basically just
disassembles the code and unifies offsets before passing it
to bsdiff.

Especially for security and stable updates, as opposed to new
(major) upstream versions it makes a substantial difference
(93% savings for the firefox 55.0-1 to 55.0-2 release, from 38
 MB down to 2.4).

--
Debian Developer - deb.li/jak | jak-linux.org - free software dev
                  |  Ubuntu Core Developer |
When replying, only quote what is necessary, and write each reply
directly below the part(s) it pertains to ('inline').  Thank you.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: A radically different proposal for differential updates

Peter Silva
In reply to this post by Julian Andres Klode-4
Isn't there kind of a universal issue that tar and compression happen
sort of in the wrong order?  Wouldn't it make more sense to make files
that were .gz.tar (ie. compress the files individually, then have an
index into them via tar.)  Then tar works perfectly well for
extracting individual files without having to download the entire
thing.  For people who want the entire file, no problem.  For people
who want to minimize client side storage, they can fetch byte ranges
of files to extract the Table of contents of files.

The delta dpkg's are just the .gz.tar of the new files. tar already
has --append, that might be used to just add the delta's to the
existing files, so they become a single dpkg that contains multiple
versions.

Once you can seek (and remotely use byte ranges) there is a whole lot
of options, and it is compression after the index is built that takes
all those options away.



On Tue, Aug 15, 2017 at 2:51 PM, Julian Andres Klode <[hidden email]> wrote:

> On Tue, Aug 15, 2017 at 09:26:24AM +0200, Christian Seiler wrote:
>> Hi there,
>>
>> I've come to believe that binary diff packages are not the best way of
>> solving this issue. Intead I'd like to propse a radically different
>> solution to this issue.
>>
>> The gist of it: instead of adding a format for how deltas work, I
>> propose to introduce a new format for storing Debian packages that will
>> be used for both the initial installation _and_ incremental updates.
>>
>> This idea was inspired by the following talk given by Lennart
>> Poettering about a new tool he's written (which is already packaged for
>> Debian by the way):
>> https://www.youtube.com/watch?v=JnNkBJ6pr9s
>>
>>
>>
>> Now to my proposal:
>>
>> A Debian package currently consists of two files: control.tar.gz and
>> data.tar.xz (or .gz). What I want to propose is a new format that does
>> not contain a data.tar.xz at all. Instead I'd like to split the
>> data.tar.xz into chunks and have the new format only contain an index
>> that references these chunks. Let me call this new format "cdeb" for
>> "chunked deb".
>>
> [...]
>> Anyway: thoughts? Regards, Christian
>
> It's of course an awesome idea. But:
>
> I generally agree with the idea of chunk stores. They'd improve
> things a lot. Now, instead of chunking the tarfiles, chunk both
> the individual files, and the tarfiles. Then, with an index for
> the individual files in control.tar listing the chunks, you can
> easily reconstruct just the files that changed on your system
> and avoid any rebuilding of debs for upgrades :D
>
> That said, I believe that this change won't sell. Replacing the
> basic format the repository is made of won't fare well. Too many
> tools (most of which probably are not known) rely on the presence
> of deb files in archives.
>
> So as sad as it might be, I think we probably have to settle for
> delta files.
>
> --
> Debian Developer - deb.li/jak | jak-linux.org - free software dev
>                   |  Ubuntu Core Developer |
> When replying, only quote what is necessary, and write each reply
> directly below the part(s) it pertains to ('inline').  Thank you.
>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Proposal: A new approach to differential debs

Jonathan Nieder
In reply to this post by Julian Andres Klode-4
Hi,

Julian Andres Klode wrote:

> Today's research has shown that rolling hashes do not perform well
> on executables because of changing offsets and so on destroying the
> hashes. There were no measurable space savings when adding fairly
> similar firefox releases to either a casync or borg repository -
> and that's on uncompressed tarballs of the latest 2 firefox uploads.

For the same reason, compressors like xz apply a filter to executables
before using more generic compression algorithms on them.  See e.g.
https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/simple/x86.c
(and the rest of the filters in that directory).

Thanks and hope that helps,
Jonathan

12
Loading...