~nabijaczleweli, do you want to take a whack at this? The thing which brought my attention to this was the lack of observability in git gc runs. It would be nice to cover the number of repos GC'd, a histogram of time required to GC each repo, and total duration of each task in -periodic, plus anything else that seems useful at the time. If we can get any info out of git regarding how much data we were able to save during the GC that would also be cool.
I suspect (but cannot confirm) that the near-miss outage today happened because the introduction of regular GC's increased the rate at which the present state of the filesystem diverged from the snapshots, increasing the rate at which the disk space occupied by snapshots grew.
git-gc(1) is very, hm, laconic in its output (this is code for "there is none except for errors") and the only way to get the savings is to run the equivalent of du(1) (cheap, since the only difference is gonna be the size ofthe sole pack file under .git/objects/pack/), but I don't really forsee a problem with this.
However, I'm not sure where the metrics would, like, physically go? Should I just slap an LDJSON thing in /var/log, or do you have something that'd integrate better?
The best place to put the metrics would be here:
Maybe we can patch git-gc upstream to add a mode which gives us more insight into the process?
Trying now, here's the output from git-gc(1):
nabijaczleweli@tarta:~/code/meta.sr.ht$ git gc > /dev/null Enumerating objects: 3644, done. Counting objects: 100% (3644/3644), done. Delta compression using up to 24 threads Compressing objects: 100% (1341/1341), done. Writing objects: 100% (3644/3644), done. Total 3644 (delta 2456), reused 3333 (delta 2252)
Note, how (a), the info all goes out over stderr, but (b) we do get a metric of how much was done, so to say: re-running the GC made the last line
Total 3644 (delta 2456), reused 3644 (delta 2456), as there was nothing to do, but the first run only reused 91% of the objects. However, this disappears if redirected and the
subprocess.run(["git", "gc"], stderr=subprocess.PIPE)we'd use also makes it go blank. Patch upstream would be necessary for this.
Alternatively, we could read the pack directory directly, which'd amount to, what, two-times-five syscalls max? and present the part that actually matters ‒ on-disk usage before/after (a quick test reveals that even an empty GC run changes inode numbers, even if the SHAs and filenames remain the same, so I'm pretty sure it gets unshared from the snapshot anyway?). As for patching git to print out file sizes: no part of git that I know of deals in bytes, much less on-disk bytes of internal data, I'm unconvinced it's worth the effort writing a patch for it to get rejected.
Pushgateway seems workable enough, I'll start on an implementation with that and directory-based metrics.