~icefox/garnet#63: 
Design discussion: Bit patterns and layouts

So far most of my thoughts are lifted mostly directly from Erlang: https://www.erlang.org/doc/programming_examples/bit_syntax.html and https://www.erlang.org/doc/reference_manual/expressions#bit_syntax

First off the important question: what delimiters do I use for these? Erlang uses << ... >>, but I kinda want to use those for bit shifts, or at least have the option for it right now. I guess I'll just use b{ ... } for now.

Status
REPORTED
Submitter
~icefox
Assigned to
No-one
Submitted
1 year, 4 months ago
Updated
6 months ago
Labels
T-DESIGN

~icefox 1 year, 4 months ago

Erlang has only one real operation on them: pattern matching. So they can be a pattern full of values or unbound variables and you can match a bunch of bits onto it, or you can construct a bunch of values into a pile of bits and assign it to an unbound variable. Or, you know, any combination of the above, because Erlang.

I also want to be able to mesh them with a struct type and have it arrange the struct per that pattern, which I guess is just syntactic sugar for those operations. So I guess you could do something like:

type Thing = struct
  bar: I8,
  foo: I32,
end layout b{
  foo:32/little,
  bar:16/little,
}

And it would tell you the type is an array of 6 bytes, the first 4 making a I32 and the 5th as a U8 and the 6th unused. I suppose if it were bar:16/big then the i8 would be in the 6th byte and the 5th unused. So then you could cast a [6]U8 to a Thing and it would be valid, and proooooooobably even Safe?

Wouldn't it be easier/more obvious to do something like this?

type Thing = struct
  foo: I32/little,
  bar: I8/little,
  _: I8/little
end

Was considering something like that but I want the layout and struct to not necessarily be tied together. But I want to be able to apply a layout to a tuple or sum type in the same way. So it would be equally valid to do:

type Thing = {I8, I32} layout b{
  _1:32/little,
  _0:16/little,
}

The layout syntax is kinda terrible, but idk what I'd do otherwise. layout(Thing, b{ ... })? You can't really assign a layout to a variable or type... can you? Well... in Erlang it's a type that you pattern match on. Soooooo... why wouldn't it be a type?

I want to be able to apply a layout to a sum type similarly, I guess by putting a layout on each arm? Can we do niche optimization in that way as well? Like, somehow describe Rust's Option<NonNull<T>> using a 0 value as None? Mmmmmmaybe. Not sure we should try to go too crazy with it though, at least not yet.

~icefox 1 year, 4 months ago*

Here's another good example of the sort of thing I want to be able to say, the RISC-V instruction set from here: https://secret.club/2023/12/24/riscy-business.html

union Instruction
{
    struct
    {
        uint32_t compressed_flags : 2;
        uint32_t opcode           : 5;
        uint32_t                  : 25;
    };

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t funct7 : 7;
    } rtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t shamt  : 1;
        uint32_t imm    : 6;
    } rwtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t imm    : 12;
    } itype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t imm    : 20;
    } utype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t imm12  : 8;
        uint32_t imm11  : 1;
        uint32_t imm1   : 10;
        uint32_t imm20  : 1;
    } ujtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t imm5   : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t imm7   : 7;
    } stype;

    struct
    {
        uint32_t opcode   : 7;
        uint32_t imm_11   : 1;
        uint32_t imm_1_4  : 4;
        uint32_t funct3   : 3;
        uint32_t rs1      : 5;
        uint32_t rs2      : 5;
        uint32_t imm_5_10 : 6;
        uint32_t imm_12   : 1;
    } sbtype;

    int16_t  chunks16[2];
    uint32_t bits;
};
static_assert(sizeof(Instruction) == sizeof(uint32_t), "");

How would that look with our bit patterns? Let's do them one at a time:

alias Rtype = b{
	opcode:7,
	rd:5,
	funct3:3,
	rs1:5,
	rs2:5,
	funct7:7,
}

alias Rwtype = b{ 
	opcode:7,
	rd:5,
	funct3:3,
	rs1:5,
	rs2:5,
	shamt:1,
	imm:6,
}

alias Itype = b{ 
	opcode:7,
	rd:5,
	funct3:3,
	rs1:5,
	imm:12,
}

alias Utype = b{ 
	opcode:7,
	rd:5,
	imm:20,
}

alias Ujtype = b{ 
	opcode:7,
	rd:5,
	imm12:8,
	imm11:1,
	imm1:10,
	imm20:1,
}

alias Stype = b{ 
	opcode:7,
	imm5:5,
	funct3:3,
	rs1:5,
	rs2:5,
	imm7:7,
}

alias Sbtype = b{
	opcode:7,
	imm_11:1,
	imm_1_4:4,
	funt3:3,
	rs1:5,
	rs2:5,
	imm_5_10:6,
	imm_12:1,
}

Ok, now we come to an interesting portion; we want to be able to make a sum type out of structs with these layouts, with the opcode as the discriminant. Soooo what might that look like? That's actually a difficult question because the discriminant of each member of the sum type is not a single value but a range, which is something I've never considered before, but this is exploratory so let's explore. First let's consider what it might look like without trying to define the discriminant, just making sure it's part of the opcode field:

type Instruction = sum(size = 32, discr = .opcode)
	rtype: Rtype,
	rwtype: Rwtype,
	itype: Itype,
	utype: Utype,
	ujtype: Ujtype,
	stype: Stype,
	sbtype: Sbtype,
end

The naming there is a little fraught, don't sweat it too much for now.

Right now I'm just putting layout meta-info in some kind of argument to the sum directive. In Rust it'd be in a #[...] type dealie either attached to the type or to fields in the type, so it might be worth coming up with some general-purpose syntax that does something similar. For example we might want to specify a different discriminant field for each arm, as long as they occupy the same bits. For now though, we don't need to get that fancy.

Here though, we have a choice. We will obviously have an opcode enum that looks like:

type Opcode = enum
	-- Rtype instructions
	Op = 0b0110011,
	Op32 = ...,
	
	
	-- Itype instructions
	Load = ...,

	...
end

So the choice is, do we make it so that discriminants can be specific sets of enum values, or do we make ranges, or does each discriminant have to be a specific value?

RISC-V is also an interesting test case for this because it splits the full opcode across the opcode and funct3 fields, and not all instruction layouts have a funct3 field. And some of the ones that do have a funct7 field, so your discriminant may end up scattered all the damn over the place. Is it worth it to try to represent that directly? Not yet, I think. Erlang would do it via shitloads of pattern matching, so let's take a cue from that for now and not try to make special-purpose tools when you have general-purpose ones you can make work.

Anyway, back to just considering opcodes... Ranges don't help in this case 'cause opcodes aren't necessarily contiguous, so we either have multiple valid discriminants per sum type arm, or we have multiple arms with the same layout and do something like this:

type Instruction = sum(size = 32 discr = .opcode : Opcode)
	Op: Rtype,
	Op32: Rtype
	Load: Itype,
	...
end

so it's just a normal sum type, with the discriminant being the `Opcode enum, and we just have multiple sum type arms with the same layout. That... actually seems kinda better in retrospect?

Ok well, now that we've gone through some exercises, we have some goals for bit patterns. Bitstrings are here to make bitwise operation stuff:

  • Easier
  • Less error-prone
  • Leverage the power of pattern matching

Antigoals:

  • Handle every special case possible in as strong a type system as possible
  • Be as concise and short as possible -- syntax can always be revised later
  • Fit into a perfect idealized theoretical structure

Really, Erlang doesn't have any special type stuff involved, it just uses pattern matching and that's how it handles types. We want to have compile time types with a compiler that tells us more than "this pattern match failed", so we need some extra machinery around them.

Also, as used above, bit patterns are kinda treated as a type. But really they're a pattern, so, hmmmm. Need to spend some time thinking about how that fits into our "everything is a value and a type literal" style system. Bit patterns also look and act quite a bit like a struct, so making them a special case of structs like C does isn't entirely unreasonable.

Good candidate for the next gedankexperiment: UTF-8 character encoding.

~icefox referenced this from #34 1 year, 4 months ago

~icefox 1 year, 4 months ago

Another good use case example here: https://wiki.alopex.li/TheStateOfGarnet2022#the-future

~icefox 11 months ago

Oh, there's a very convenient paper about this now: Bit Stealing Made Legal https://perso.ens-lyon.fr/thais.baudon/icfp2023.pdf

~icefox 6 months ago*

Hmmm, so what would Rust look like if it had a magical variable-sized utf8 type? Then str/&str/String would become [utf8]/&[utf8]/Vec<utf8>, essentially. Silly names aside, it feels like it moves the magic to where it needs to be, instead of &str being a special-case of &[] that kinda honestly works exactly the same.

Maybe you could then have utf8 be a single char and [utf8] be a packed array of UTF-8 chars? THAT shows where the actual magic is, because it implies you can say "a single value T has a different representation than an array of T's", which is a little bit of a big heckin' deal for some types. If you generalize that idea hard enough it could lead to things like niche-encoding of enums or SoA-AoS transformations. Hmmmmm....

~icefox 6 months ago

That gets us context-specific encodings, which is really powerful but also not exactly something that gets explored much. A BIG assumption that gets made in a lot of languages is that *&T has the same representation/bit-pattern as T, which is how we build arrays and structs. If we allow context-specific encodings, then we end up in cases where *&some_struct.t != some_t != array_of_t[x], which means that those different T's actually have different types. (different kinds???) In UTF-8, this works because the bitpattern describes the size of the type, so the language can know that comparing [\u20, 0x00, 0x00, 0x00] and [\u20, 0x01, 0x02, 0x03] is equal... The Eq trait alone is capable of doing that. What is harder is dealing with variable-length types that get packed down to bytes, you can't ask things like "how many more characters can I fit in this array" because the answer is always "it depends".

Rust's (currently limited) niche optimizations are another example of this but kinda the other way around. When you have Option<i32> and Option<NonZeroI32> the representation of the i32 really doesn't change (though its allowed bit-patterns do; one is a strict subset of the other), it's the representation of the Option that changes. Hmmmm....

Register here or Log in to comment, or comment via email.