From readme: I am not CONVINCED that a linker is the best way to handle things. This has implications on things like distributing libraries, defining ABI's, using DLL's, and parallelizing the compiler itself. No solid thoughts here yet, but it is an area worth thinking about. Rust, C, Go and Swift present different points in this area to look at.
Semi-related: #8 , since it looks at "why modules and compilation units" a bit.
For reference, it looks like Zig, like Rust, does a whole project as a single compilation unit. This post and the attached video suggest that Zig is also experimenting with re-thinking the traditional linker pipeline. It's pretty cool.
https://lobste.rs/s/mxwit7/link_time_optimisation_lto has some interesting conversations. It's pretty clear to me by now that the traditional 1970's linking model can be certainly be improved.
The trick to that, imo, is how to effectively parallelize compilation. That's the real problem that linkers try to solve these days. That kinda requires that whole-program analysis either be kept to an absolute minimum or also effectively parallelized (and stay effective as it's scaled up to 100's of cores and majillions of LoC). #52 has some related discussion.
plecra had a whole ocean of interesting thoughts, mostly about things I hadn't considered:
icefox: Linking will always be a bottleneck but it's possible you can mitigate that some
plecra: solving this for deterministic builds seems so tricky. There's a bunch of workable solutions when we allow nondeterminism or suboptimal codegen. Best solution I've got is deliberately overallocating functions, so that when we rebuild its more likely to not change code layout. seems like there's probably something smart that can be done about allocating SCCs of functions into single blocks so that relative calls within them dont need relocation.
To make it properly O(1) it'll need filesystem support too, so that the compiler doesnt need to rewrite the whole binary :D deliberately segmenting the virtual memmap would also work well, but breaks the optimal code constraint because in principle it means the binary cant use hugepage mappings in a single copy from disk.
With a graph of callers found from the entrypoint,
identify SCCs and combine them. These're p rare, especially in a language that forbids recursion between packages. We're left with a dag while there's more than one node in the graph, sort the nodes by #incoming edges for all nodes with a single incoming edge, merge them with the caller
looking for a way to organise the linking of a call graph to make it incremental. Does this... work? I cant think of any graphs that wouldnt reduce but it seems too easy 😄 . At each step, the root node has a set of callees S. For each S[i] that has > 1 incoming edges, there must be some other S[j] that refers to it. If S[i] transitively refers to S[j], it'd form an SCC and be collapsed in the first step ==> there must be at least one element with a single incoming node .
hmm using this to chunk up the graph seems promising for log(n) links. at its core its just using isolated call graphs to get independent codegen, to turn it into a recursive problem. having an explicit pass for collecting the callgraph lets it organise the code right.
ooh but swizzling it is especially ideal for incremental emit:
# gets perfectly minimal rewrites: only a single layer of the call graph needs its relocations fixed # con is that this has esp awful locality for graph_layer in graph: # can traverse the dag bottom-up for func in graph_layer: func_locations.push(func, exe.current_location()) write(exe, apply_relocations(func_locations, func))
plecra: huh yeah ok that's interesting. References to functions with more callers are more unpredictable, because they must be dependent on the size of all callers ==> those should be linked last in their subgraphs.
so the log(n) factor is that when there's a call graph, we want to be able to edit a function at any depth, so each pair of depths are merged in a tree. eg When a leaf function is changed, the top half of the call graph can be copied to the output unchanged because its internal links are already done . which can be guided by the merge traversal, I think. It's ordered by width of links, then by lowest set bit in the edge depth .
oh got a better way to frame the problem. link : Section -> Section -> Section is the primitive, and its associative + commutative. While reduce sections empty_section link works, a version that exploits the associativity to eval it as a tree has the ~idealised parallelism. The smallest sections are static decls + basic blocks, then everything composes up from there . To optimize that, the linker should also reorder the leaves to try to minimise the free names at each level. Which means sorting the list of sections to minimise the distance between references. this is where all the graph logic comes in. SCCs are grouped, and dependencies are laid out ~start to end . incremental linking can then kick in when we have a ~4MiB leaf, and copy it directly on disk. btree-y patterns look useful too for the normal reasons. right finally clicked.
icefox:
Have you heard of how Zig does this? Apparently they do their incremental binary building by having every function call go through a table of function pointers, which lets them freely relocate functions and kinda skips the linking step. And then on non-dev builds they just inline it away the same way they inline away any other indirect call they can. I have no idea if that's a good idea in the long term but it's a very literal way of dealing with it.
plecra:
I have! This is a great strategy for debug builds but it has problems with determinism (it's very similar to how JITs handle the same requirement), if you want to avoid rewriting the whole binary then you need an allocation scheme that depends on the last compilation. Inlining the GOT for final builds works out to be as expensive as normal linking.
Just some more notes
so we're minimizing the maximum ref distance per page. a page with only internal references wont ever need rewrites. we use #incoming edges to approximate the distance when sections are linked, so we prefer links with low incoming pages. for the same reason we prefer linking small sections. why is linking a print caller bad? because the result includes a reference to print, which has a high incoming edge degree and ==> distant references. this also explains why we merge sections with the same target: It's unlikely to affect the ref distance. so each wave of links needs to be conscious of this. expected_distance is based on incoming_section_size + target_symbol_offset. the first layer pairs up sections which minimise the combined max ref distance. then recurse on that. single-callers in fact aren't ideal then. they do eliminate a reference well, but might lower the longref density. we also want the extra callees to be low expected distance. when two callers merge, the incoming_section_size is stable.
this is all brainstorming how to structure the release build so that we can get some decent incrementality out of it. having ~8KiB of code in each lib that only uses internal links is super doable, so the compiler should be using that to skip a bunch of relocations. Common functions like Vec::push are the worst case, but if the compiler can lay them out into denser usage patterns (2 pages that dont use it, then 2 pages that do) it can make the final link cheaper.
icefox:
Yeah I guess determinism is hard for fast-turnaround incremental builds
plecra:
yeah >.> It's not really a big deal. even p bad linking is only 1s for the release builds, and you only need those outside dev but its the principle of it >:)
icefox:
I mean a medium-sized Rust project can take like 10s for linking 😛 partially because it's not actually just linking, it's doing a ton of codegen and LTO and stuff in the process.
plecra:
yeah :) And those bits dont have this fundamental issue in the same way. can all b parallelised better.
icefox:
true! but they make the bottleneck worse.
plecra:
LTO only depends on the local two functions. The reason applying relocations is so nasty is that every function depends on the size of every other function in the binary.
icefox:
yeah, it's kinda a knapsack problem iiuc? like if you could add a layer of indirection it becomes incredibly trivial but then you have that layer of indirection. trying to pack things together to minimize the distance between them sounds like a very nice idea.
plecra:
hmm I wouldnt say relocations r knapsack problem really. The packing doesnt super matter. You allocate in one pass and then you know where everything is. There's no way for that to fail. (asm isel can be knapsack problem)
icefox:
Slightly tangential, but I am also slightly hopeful about the idea of pre-computing inlined functions, which is a local improvement but if you have something like Vec::push() that's used everywhere then maybe you have the compiler say "okay once all the inlining and DCE is done, this is actually just a sequence of three instructions". so you short-circuit a bit of the cycle of "generate tons of code -> optimize a ton of it out".
plecra:
true! Relocations rlly suffer from the non-inlinable part. So malloc is relevant for the grow case in push
icefox:
yeah, that idea is not really about relocations, just that inlining and linking are kinda enemies of each other
plecra:
seems like a nice middle ground, tho ofc not ideal for release builds where we want to specialize the inlined code
icefox:
ok I think I get it, the question you're working on is how to make the linking reproducable without having a single-threaded bottleneck where you go through to rearrange and re-link everything. I confess I hadn't thought much about reproducability, just "yeah of course it will be there", which def. isn't the case if you're redoing incremental builds. Changing the ordering of how you do things could change the results, so you're coming up with a global ordering that is max parallelizable and consistent.
RIGHT so let's try to throw down some actual goals and non-goals out of this, instead of just "let's make linking fast(tm)".
Release builds:
O(nlogn)
or better time and memory complexity,O(n^2)
could be allowed for some degenerate cases.- Reproducable builds always
- Good but not necessarily perfect instruction selection for calls/jumps
- Scales linearly with number of cores as much as we damn well can
- Stable ABI for public types/functions
Debug/dev builds:
- At least as good time and memory complexity as release builds
- Incremental builds
- Not necessarily reproducable builds
- Whole-program optimization not necessary for decent performance -- local inlining and such for example
- Scales linearly with number of cores
- No stable ABI, a la Rust??? Not sure if that makes a difference here
- Frankly a real JIT wouldn't necessarily be a bad idea for this sort of thing, though it's probably not feasible
Interesting experiment: https://www.kxxt.dev/blog/fully-dynamically-linked-rust-binary/