sr.ht's source tarballs may not be deterministically generated

git-archive has a stable output format to .tar, but afterward, hands things off to an external compressor: namely, the gzip binary. Let's compare some different downloads across source hosting platforms:

https://github.com/archlinux/devtools/archive/20190821.tar.gz https://git.archlinux.org/devtools.git/snapshot/devtools-20190821.tar.gz https://gitlab.com/eschwartz/devtools/-/archive/20190821/devtools-20190821.tar.gz https://git.sr.ht/~eschwartz/devtools/archive/20190821.tar.gz

$ sha256sum *devtools-20190821.tar.gz
4557e5db0225db0aab0d26b853907b3308037a05231519a69f7882ee2168b3b3  cgit-devtools-20190821.tar.gz
4557e5db0225db0aab0d26b853907b3308037a05231519a69f7882ee2168b3b3  github-devtools-20190821.tar.gz
4557e5db0225db0aab0d26b853907b3308037a05231519a69f7882ee2168b3b3  gitlab-devtools-20190821.tar.gz
fe222eb819bf0dd410ab6a3201fc196961746e3b2f1866dae5ca5d27142da208  srht-devtools-20190821.tar.gz

One of these tarballs is not like the others! However, the underlying tar is the same.

$ gzip -dk github-devtools-20190821.tar.gz srht-devtools-20190821.tar.gz
$ sha256sum *devtools-20190821.tar
528100dae1d0c2a4747b43b818e6a8776dc66723afcba33e615baac9874eac77  github-devtools-20190821.tar
528100dae1d0c2a4747b43b818e6a8776dc66723afcba33e615baac9874eac77  srht-devtools-20190821.tar

Seems like sr.ht is hosted on alpine with the gzip binary provided by busybox. ssh'ing into a builds.sr.ht alpine image and using gzip -n on the .tar reveals this busybox build reproduces the same tarball. So this is where git.sr.ht is getting the unusual output.

Does busybox guarantee a stable output? It is certainly not generating the exact same bytes as GNU gzip is.

More worryingly, I cannot generate this output on my Arch Linux laptop. My busybox gzip -n, produces the following sha256sum: 4449fda607906c232ba753c9a5b3299ce4b14750aab1ad1da65a3f774df43a8b

It seems like to at least some extent, what output you get from busybox gzip will depend on which version and/or build of busybox you have. Maybe it would be better to require sr.ht to be hosted on a system with a non-busybox build.

Assigned to
4 years ago
3 years ago
No labels applied.

~eschwartz 4 years ago

This has interesting applications for https://todo.sr.ht/~sircmpwn/git.sr.ht/231, because if the gzip compressor is unreliable it may be better to advise users to sign their sources via git notes --ref=refs/notes/signatures/tar, not tar.gz (of course, an argument could be made that that is more advisable even without this).

~eschwartz 4 years ago

See http://lists.busybox.net/pipermail/busybox/2019-September/087449.html

Will need to double-check to make sure reality plays out as expected, but the next major.minor busybox release should ensure that all busybox-generated gzip files have invalidated checksums, and instead align with what GNU gzip creates (where things will hopefully remain permanently).

~eschwartz 4 years ago

So it turns out that this is generally an issue that also causes https://github.com/swaywm/sway/issues/4603

~sircmpwn REPORTED FIXED 4 years ago

I think this is no longer the case.

~eyjhb 3 years ago · edit

This is not fixed, downloading the tars above shows this.

~sircmpwn 3 years ago

Those are from four different services. The git.sr.ht tarball is self-consistent, it doesn't need to match the others.

~eschwartz 3 years ago

The tars above show the status has not changed; srht-devtools-20190821.tar.gz continues to have a checksum of fe222eb819bf0dd410ab6a3201fc196961746e3b2f1866dae5ca5d27142da208

However, if you boot into the alpine/edge image where BusyBox v1.32.0 is installed, busybox gzip -n < devtools-20190821.tar produces the same file as GNU gzip.

See this test case: https://builds.sr.ht/~eschwartz/job/331650

So the fix would be to upgrade busybox on the sourcehut server, e.g. by migrating to a newer version of Alpine.

~eyjhb 3 years ago · edit

If there is the possibility of matching the other sha256 checksums, and it is as trival (maybe), to update the Alpine version (I am not an expert in the setup), then it would be nice if that could be done.

The reason pretty much is, that you can use each service if the others are unavailable, while not needing to have a different hash for each service. E.g. srht -> gitlab -> github -> else.

~eschwartz 3 years ago

This is now fixed in the production sr.ht instance (which migrated from alpine 3.12 -> 3.13 today and therefore upgraded busybox), so all sourcehut generated archives undergo a one-time change and then should act like comparable archives from any GNU gzip-using software forge going forward.

Register here or Log in to comment, or comment via email.