So far most of my thoughts are lifted mostly directly from Erlang: https://www.erlang.org/doc/programming_examples/bit_syntax.html and https://www.erlang.org/doc/reference_manual/expressions#bit_syntax
First off the important question: what delimiters do I use for these? Erlang uses << ... >>
, but I kinda want to use those for bit shifts, or at least have the option for it right now. I guess I'll just use b{ ... }
for now.
Erlang has only one real operation on them: pattern matching. So they can be a pattern full of values or unbound variables and you can match a bunch of bits onto it, or you can construct a bunch of values into a pile of bits and assign it to an unbound variable. Or, you know, any combination of the above, because Erlang.
I also want to be able to mesh them with a struct type and have it arrange the struct per that pattern, which I guess is just syntactic sugar for those operations. So I guess you could do something like:
type Thing = struct bar: I8, foo: I32, end layout b{ foo:32/little, bar:16/little, }
And it would tell you the type is an array of 6 bytes, the first 4 making a
I32
and the 5th as aU8
and the 6th unused. I suppose if it werebar:16/big
then thei8
would be in the 6th byte and the 5th unused. So then you could cast a[6]U8
to a Thing and it would be valid, and proooooooobably even Safe?Wouldn't it be easier/more obvious to do something like this?
type Thing = struct foo: I32/little, bar: I8/little, _: I8/little end
Was considering something like that but I want the layout and struct to not necessarily be tied together. But I want to be able to apply a layout to a tuple or sum type in the same way. So it would be equally valid to do:
type Thing = {I8, I32} layout b{ _1:32/little, _0:16/little, }
The
layout
syntax is kinda terrible, but idk what I'd do otherwise.layout(Thing, b{ ... })
? You can't really assign a layout to a variable or type... can you? Well... in Erlang it's a type that you pattern match on. Soooooo... why wouldn't it be a type?I want to be able to apply a layout to a sum type similarly, I guess by putting a layout on each arm? Can we do niche optimization in that way as well? Like, somehow describe Rust's
Option<NonNull<T>>
using a 0 value asNone
? Mmmmmmaybe. Not sure we should try to go too crazy with it though, at least not yet.
Here's another good example of the sort of thing I want to be able to say, the RISC-V instruction set from here: https://secret.club/2023/12/24/riscy-business.html
union Instruction { struct { uint32_t compressed_flags : 2; uint32_t opcode : 5; uint32_t : 25; }; struct { uint32_t opcode : 7; uint32_t rd : 5; uint32_t funct3 : 3; uint32_t rs1 : 5; uint32_t rs2 : 5; uint32_t funct7 : 7; } rtype; struct { uint32_t opcode : 7; uint32_t rd : 5; uint32_t funct3 : 3; uint32_t rs1 : 5; uint32_t rs2 : 5; uint32_t shamt : 1; uint32_t imm : 6; } rwtype; struct { uint32_t opcode : 7; uint32_t rd : 5; uint32_t funct3 : 3; uint32_t rs1 : 5; uint32_t imm : 12; } itype; struct { uint32_t opcode : 7; uint32_t rd : 5; uint32_t imm : 20; } utype; struct { uint32_t opcode : 7; uint32_t rd : 5; uint32_t imm12 : 8; uint32_t imm11 : 1; uint32_t imm1 : 10; uint32_t imm20 : 1; } ujtype; struct { uint32_t opcode : 7; uint32_t imm5 : 5; uint32_t funct3 : 3; uint32_t rs1 : 5; uint32_t rs2 : 5; uint32_t imm7 : 7; } stype; struct { uint32_t opcode : 7; uint32_t imm_11 : 1; uint32_t imm_1_4 : 4; uint32_t funct3 : 3; uint32_t rs1 : 5; uint32_t rs2 : 5; uint32_t imm_5_10 : 6; uint32_t imm_12 : 1; } sbtype; int16_t chunks16[2]; uint32_t bits; }; static_assert(sizeof(Instruction) == sizeof(uint32_t), "");How would that look with our bit patterns? Let's do them one at a time:
alias Rtype = b{ opcode:7, rd:5, funct3:3, rs1:5, rs2:5, funct7:7, } alias Rwtype = b{ opcode:7, rd:5, funct3:3, rs1:5, rs2:5, shamt:1, imm:6, } alias Itype = b{ opcode:7, rd:5, funct3:3, rs1:5, imm:12, } alias Utype = b{ opcode:7, rd:5, imm:20, } alias Ujtype = b{ opcode:7, rd:5, imm12:8, imm11:1, imm1:10, imm20:1, } alias Stype = b{ opcode:7, imm5:5, funct3:3, rs1:5, rs2:5, imm7:7, } alias Sbtype = b{ opcode:7, imm_11:1, imm_1_4:4, funt3:3, rs1:5, rs2:5, imm_5_10:6, imm_12:1, }
Ok, now we come to an interesting portion; we want to be able to make a sum type out of structs with these layouts, with the opcode as the discriminant. Soooo what might that look like? That's actually a difficult question because the discriminant of each member of the sum type is not a single value but a range, which is something I've never considered before, but this is exploratory so let's explore. First let's consider what it might look like without trying to define the discriminant, just making sure it's part of the opcode field:
type Instruction = sum(size = 32, discr = .opcode) rtype: Rtype, rwtype: Rwtype, itype: Itype, utype: Utype, ujtype: Ujtype, stype: Stype, sbtype: Sbtype, end
The naming there is a little fraught, don't sweat it too much for now.
Right now I'm just putting layout meta-info in some kind of argument to the
sum
directive. In Rust it'd be in a#[...]
type dealie either attached to the type or to fields in the type, so it might be worth coming up with some general-purpose syntax that does something similar. For example we might want to specify a different discriminant field for each arm, as long as they occupy the same bits. For now though, we don't need to get that fancy.Here though, we have a choice. We will obviously have an
opcode
enum that looks like:type Opcode = enum -- Rtype instructions Op = 0b0110011, Op32 = ..., -- Itype instructions Load = ..., ... end
So the choice is, do we make it so that discriminants can be specific sets of enum values, or do we make ranges, or does each discriminant have to be a specific value?
RISC-V is also an interesting test case for this because it splits the full opcode across the
opcode
andfunct3
fields, and not all instruction layouts have afunct3
field. And some of the ones that do have afunct7
field, so your discriminant may end up scattered all the damn over the place. Is it worth it to try to represent that directly? Not yet, I think. Erlang would do it via shitloads of pattern matching, so let's take a cue from that for now and not try to make special-purpose tools when you have general-purpose ones you can make work.Anyway, back to just considering opcodes... Ranges don't help in this case 'cause opcodes aren't necessarily contiguous, so we either have multiple valid discriminants per sum type arm, or we have multiple arms with the same layout and do something like this:
type Instruction = sum(size = 32 discr = .opcode : Opcode) Op: Rtype, Op32: Rtype Load: Itype, ... end
so it's just a normal sum type, with the discriminant being the `Opcode enum, and we just have multiple sum type arms with the same layout. That... actually seems kinda better in retrospect?
Ok well, now that we've gone through some exercises, we have some goals for bit patterns. Bitstrings are here to make bitwise operation stuff:
- Easier
- Less error-prone
- Leverage the power of pattern matching
Antigoals:
- Handle every special case possible in as strong a type system as possible
- Be as concise and short as possible -- syntax can always be revised later
- Fit into a perfect idealized theoretical structure
Really, Erlang doesn't have any special type stuff involved, it just uses pattern matching and that's how it handles types. We want to have compile time types with a compiler that tells us more than "this pattern match failed", so we need some extra machinery around them.
Also, as used above, bit patterns are kinda treated as a type. But really they're a pattern, so, hmmmm. Need to spend some time thinking about how that fits into our "everything is a value and a type literal" style system. Bit patterns also look and act quite a bit like a struct, so making them a special case of structs like C does isn't entirely unreasonable.
Good candidate for the next gedankexperiment: UTF-8 character encoding.
Another good use case example here: https://wiki.alopex.li/TheStateOfGarnet2022#the-future
Oh, there's a very convenient paper about this now: Bit Stealing Made Legal https://perso.ens-lyon.fr/thais.baudon/icfp2023.pdf
Hmmm, so what would Rust look like if it had a magical variable-sized
utf8
type? Thenstr/&str/String
would become[utf8]/&[utf8]/Vec<utf8>
, essentially. Silly names aside, it feels like it moves the magic to where it needs to be, instead of&str
being a special-case of&[]
that kinda honestly works exactly the same.Maybe you could then have
utf8
be a single char and[utf8]
be a packed array of UTF-8 chars? THAT shows where the actual magic is, because it implies you can say "a single valueT
has a different representation than an array ofT
's", which is a little bit of a big heckin' deal for some types. If you generalize that idea hard enough it could lead to things like niche-encoding of enums or SoA-AoS transformations. Hmmmmm....
That gets us context-specific encodings, which is really powerful but also not exactly something that gets explored much. A BIG assumption that gets made in a lot of languages is that
*&T
has the same representation/bit-pattern asT
, which is how we build arrays and structs. If we allow context-specific encodings, then we end up in cases where*&some_struct.t != some_t != array_of_t[x]
, which means that those differentT
's actually have different types. (different kinds???) In UTF-8, this works because the bitpattern describes the size of the type, so the language can know that comparing[\u20, 0x00, 0x00, 0x00]
and[\u20, 0x01, 0x02, 0x03]
is equal... TheEq
trait alone is capable of doing that. What is harder is dealing with variable-length types that get packed down to bytes, you can't ask things like "how many more characters can I fit in this array" because the answer is always "it depends".Rust's (currently limited) niche optimizations are another example of this but kinda the other way around. When you have
Option<i32>
andOption<NonZeroI32>
the representation of thei32
really doesn't change (though its allowed bit-patterns do; one is a strict subset of the other), it's the representation of theOption
that changes. Hmmmm....