and consider adding a strings::bytelen() and strings::runelen()
see https://lists.sr.ht/~sircmpwn/hare-users/%3C36a3388c-58d5-4204-930a-86f82bf17596%40pysoft.co.za%3E
I'm quite strongly opposed to that.
An argument can be made for byte-length being more fundamental than rune-length, see for example https://hsivonen.fi/string-length/ (quite long, but really worth a read, rune-length is called Unicode scalar length there). And Hare is the kind of language where byte-length matters even more because we care a lot about size of objects in memory.
Having
len()
on strings is one of the two things that set the str type apart from a random struct defined instrings::
astype str = struct { _contents: []u8 };
. If we removelen()
on strings, the only distinguishing thing left is string equality and that on its own is... just a really bad justification for having a dedicated builtin type.So if I had to rank solutions to this problem by preference:
- Remove str type entirely
- Keep as is
- Do what this issue suggests
thanks for linking that blog post - i'd read it a while ago, but i didn't succeed at refinding it
string equality is imo quite a good justification for a dedicated type, especially since it allows switching on strings. also, having to do a
use types::{str};
at the top of every single hare file which uses strings would get old pretty quicklyultimately the issue with having a builtin len(str) is that the user should always think carefully before getting any length of a string, and allowing len(str) makes it seem like they can avoid doing that. it also sends the message that byte length is a more useful measure of strings, and while that's true from the language's perspective, it's rarely true from the user's perspective
(also, users should usually be iterating over strings rather than getting their length, which we can direct them to do in the documentation for strings::bytelen and strings::runelen, but we don't currently have a similar place to warn people about len(str))
The other justification for a dedicated builtin type is that it's the result type of string literals. The alternatives are either having the result type be a struct which is assignable to an alias defined in the stdlib (which I think is a bad idea), or making the result be
[]u8
, which would make a stdlibstr
struct struct more difficult to use. I think it makes at least some sense to keepstr
separate from slices, even if for no other reason than it discourages directly indexing the string (which is almost always incorrect).The blog post you linked I think kinda serves as more justification to drop
len
on strings though? The behavior is inconsistent between languages, and, while I agree that if we do support it that having it count bytes is the Correct choice, it's not immediately intuitive. Which makes sense, because "what is the length of this string" is impossible to answer without making some assumptions on what's actually being asked (as opposed to "what is the length of this slice", which is unambiguous).also, having to do a use types::{str}; at the top of every single hare file which uses strings would get old pretty quickly
To be fair, most code which uses strings needs to
use strings;
anyway.
(Dropping len on strings definitely warrants an actual RFC though)
On Sat Mar 9, 2024 at 9:53 PM GMT, ~ecs wrote:
ultimately the issue with having a builtin len(str) is that the user should always think carefully before getting any length of a string, and allowing len(str) makes it seem like they can avoid doing that. it also
You'd have to not understand what encodings are to think there is a one-size-fits-all best way to measure strings. Treating the programmer as if they don't know what encodings or graphemes are is a weird level of coddling. It's reasonable to expect them to know this.
We can at a minimun mention these topics and document [[strings::runelen]] alongside len() in the intro tutorial.
sends the message that byte length is a more useful measure of strings,
When a Hare programmer is introduced to len(), they are taught that it "returns the length of a given slice/array". When they are introduced Hare strings, they're given the understanding that they are []u8 under the hood. They put two and two together: You give len() a string, you get a byte-length.
(also, users should usually be iterating over strings rather than getting their length
Depends on the problem you're solving.
On Sun Mar 10, 2024 at 12:17 AM UTC, ~torresjrjr wrote:
On Sat Mar 9, 2024 at 9:53 PM GMT, ~ecs wrote:
ultimately the issue with having a builtin len(str) is that the user should always think carefully before getting any length of a string, and allowing len(str) makes it seem like they can avoid doing that. it also
You'd have to not understand what encodings are to think there is a one-size-fits-all best way to measure strings. Treating the programmer as if they don't know what encodings or graphemes are is a weird level of coddling. It's reasonable to expect them to know this.
that's exactly my point: there's no one-size-fits-all way to measure strings, and having a len() implies that there is. getting rid of footguns isn't "coddling", or at least, it's not out of line with hare's design philosophy, and imo len(str), no matter what unit it's in, is a footgun
sends the message that byte length is a more useful measure of strings,
When a Hare programmer is introduced to len(), they are taught that it "returns the length of a given slice/array". When they are introduced Hare strings, they're given the understanding that they are []u8 under the hood. They put two and two together: You give len() a string, you get a byte-length.
sure, len() being the byte length is more consistent, and makes sense from a language design perspective. the issue is that it's also a footgun, because it makes subtly-wrong semantics look like the simplest ones
a broader theme in hare's design philosophy which applies here is that the easy thing should, by default, be the right thing. propagating errors is usually the right thing, so we made it the easy thing. making sure you don't have out-of-bounds accesses is usually the right thing, so we made it the easy thing. there're always cases outside of that "usually", which is why we have things like unbounded arrays, but we try to make sure that they're not the first thing someone reaches for unless they're absolutely certain
len(str) breaks this, because it's the easy thing to do but byte length isn't what's usually right here, because there isn't any unambiguous "usually right". non-byte lengths definitely can't be builtins for time complexity reasons, byte length is almost always irrelevant in a unicode context, and so imo the thing that makes sense is to just avoid canonizing any one length measurement in len()
ultimately this is an extension of disallowing indexing on strings: byte indexing is Wrong, rune indexing is expensive and still not necessarily correct, grapheme indexing is expensive and extremely complicated and definitely not a good fit for us, so we just don't have any built-in indexing. the same arguments apply for length measurements, though bytewise length is less wrong than bytewise indexing, so imo it makes sense to have in the stdlib
(also, users should usually be iterating over strings rather than getting their length
Depends on the problem you're solving.
for (let i = 0z; i < len(some_str); i += 1) is pretty much never a thing you want to do, and if it is, it's perfectly reasonable to need to explicitly opt into bytewise length via strings::bytelen()
On Sun Mar 10, 2024 at 12:38 AM GMT, ~ecs wrote:
imo the thing that makes sense is to just avoid canonizing any one length measurement in len()
len() isn't special cased for strings. It just does what it always does. It's the string that bends to len()'s will. I don't see that as canonizing. But a footgun? I can see that.
ultimately this is an extension of disallowing indexing on strings: byte indexing is Wrong, rune indexing is expensive and still not necessarily correct, grapheme indexing is expensive and extremely complicated and definitely not a good fit for us, so we just don't have any built-in indexing. the same arguments apply for length measurements, though bytewise length is less wrong than bytewise indexing, so imo it makes sense to have in the stdlib
I do find this point convince, however, that strings are currently this sort of hybird type which, on a builtin level, can act like a slice in some ways ( len(s) ) but not others ( s[i] ), and that is not good.
Having the user explicitly convert to []u8 is good. Having the user use convenience functions which do this for them under the hood, like strings::bytelen(), for this reason, I think is good.
+1 for droping len(string).
This discussion should be taking place in an RFC; closing the issue accordingly.
fwiw I've thought about this some more and I'm -1 on removing len for strings (and +1 for keeping it bytewise).