~emersion/soju#52: 
Channel casemapping

We should support channel casemapping. That probably means canonicalizing channel names somehow.

Example where this causes issues:

  • Join #channel
  • Server makes the user join #Channel instead
  • Part (the client sends a PART #Channel)
  • In the DB, #channel is still stored
Status
REPORTED
Submitter
~emersion
Assigned to
No-one
Submitted
6 months ago
Updated
2 months ago
Labels
enhancement

~emersion 5 months ago

Also causes issues with history, where echo-messages are stored in #channel and the rest are stored in #Channel.

~emersion 2 months ago

Relevant discussion:

2020-08-24 14:17:23	emersion	i'm writing a modern irc client and i don't want to go out of my way to support case-folding
2020-08-24 14:17:39	emersion	would it be reasonable to support a single hard-coded casefolding algorithm?
2020-08-24 14:18:26	@jwheare	case folding for what?
2020-08-24 14:18:34	emersion	channel names and nicknames
2020-08-24 14:19:03	@jwheare	you don't need to. casemapping rfc or ascii doesn't involve unicode case folding
2020-08-24 14:19:55	emersion	well, it'd be simpler for me to implement just unicode case-folding, and nothing else
2020-08-24 14:20:11	@jwheare	ircds don't
2020-08-24 14:20:46	@jwheare	you would potentially be preventing your users from accessing certain channel
2020-08-24 14:22:10	emersion	hm, how so?
2020-08-24 14:22:28	emersion	i'm starting to wonder whether case-mapping is worthwhile to implement at all
2020-08-24 14:24:28	@jwheare	casemapping is necessary
2020-08-24 14:24:38	emersion	i'm not clear on what "implement case-mapping" means it seems. i was under the impression that clients need to not case-map channel names when sending them to the server, but need to perform case-mapping when receiving state from servers
2020-08-24 14:24:40	hhirtz	how so? because if you consider "#Σ-sigma" and "#σ-sigma", the users won't be able to join one when they've joined the other.
2020-08-24 14:24:48	hhirtz	to be the same*
2020-08-24 14:24:57	emersion	ah, right
2020-08-24 14:25:13	emersion	why is case-mapping necessary?
2020-08-24 14:25:29	<--	eta (~eta@trainsplorer/developer/eta) has quit (Quit: we're here, we're queer, connection reset by peer)
2020-08-24 14:25:31	jess	because servers do it so clients also have to
2020-08-24 14:26:14	emersion	without case-mapping, all a client needs is that servers only use one form of the case-mapped channel name
2020-08-24 14:26:42	emersion	well, kind of. my client doesn't need to know when a JOIN command succeeds
2020-08-24 14:27:25	<--	clokep (~Thunderbi@unaffiliated/clokep) has quit (Read error: Connection reset by peer)
2020-08-24 14:27:39	-->	clokep (~Thunderbi@unaffiliated/clokep) has joined #ircv3
2020-08-24 14:27:51	jess	so if you join #asd} and im also in there, and i send a message to #asd], some ircds will propagate the message but not convert the casing to what you're expecting
2020-08-24 14:28:24	jess	so you'll miss the message
2020-08-24 14:29:09	<--	clokep (~Thunderbi@unaffiliated/clokep) has quit (Read error: Connection reset by peer)
2020-08-24 14:29:40	-->	clokep (~Thunderbi@unaffiliated/clokep) has joined #ircv3
2020-08-24 14:30:03	emersion	right
2020-08-24 14:30:57	jess	not knowing that two nicks are equal or two channel names are equal when they are as far as the ircd is concerned can lead to odd desynchronization
2020-08-24 14:32:12	jess	or if you're doing a wider fold than the ircd is, you might think two nicks are the same when they are not
2020-08-24 14:32:47	jess	which would only be an issue on an ircd that permits nicknames outside of ascii or rfc1459 folding
2020-08-24 14:33:17	 *	hhirtz thought rfc1459 mapping was not used anymore, then he looked at freenode's 005
2020-08-24 14:33:37	emersion	all right, i guess we'll need to implement the full thing then
2020-08-24 14:33:51	@jwheare	it's basic
2020-08-24 14:33:54	emersion	well
2020-08-24 14:33:56	@jwheare	far easier than unicode case folding
2020-08-24 14:33:57	xPaw	https://github.com/matrix-org/matrix-appservice-irc/blob/dcf572772a91e9b2e6f09cf36dee56b03defd276/src/irc/formatting.ts#L332 is thsi enough?
2020-08-24 14:34:00	emersion	it's basic but is intrusive
2020-08-24 14:34:19	jess	imagine you message jãke with super secret info because you thought jÃke and jãke are the same, but you meant to message jÃake
2020-08-24 14:34:20	jess	uhh
2020-08-24 14:34:23	jess	jÃke
2020-08-24 14:34:24	@jwheare	yeah you need to litter some normalise_irc_case  around the place
2020-08-24 14:34:27	emersion	when you're using a modern lang, unicode case-folding is one strings.ToLower away
2020-08-24 14:34:53	@jwheare	well just swap that strings.ToLower with your custom function
2020-08-24 14:35:15	emersion	yeah, make it per-network, and store the case-mapped channel name in the DB too, etc
2020-08-24 14:35:20	jess	I've gold server.casefold() where server is an instance of the server that knows what the 005 is
2020-08-24 14:35:28	jess	uh
2020-08-24 14:35:32	jess	s/gold/got/
2020-08-24 14:36:16	emersion	which is why a "strict"/"no-op" case-mapping would make everything simpler
2020-08-24 14:36:25	hhirtz	xPaw: .toLowerCase() performs UTF-8 casefolding
2020-08-24 14:36:31	@jwheare	storing in the db normalised you have to do anyway. but yeah you need to know the casemapping per server and pass that info around
2020-08-24 14:36:46	jess	you could mangle any line to preemptively fold params you think should be folded
2020-08-24 14:36:50	xPaw	hhirtz, and thats a problem?
2020-08-24 14:37:05	jess	and then by the time the line has left the context of a server instance, it's already folded
2020-08-24 14:37:14	emersion	ah, and you don't want to case-fold *everything*
2020-08-24 14:37:26	hhirtz	yep, as said previously: you'll consider different nicks to be the same
2020-08-24 14:37:41	emersion	if a message comes from #MySuperChannel, you don't want to display #mysuperchannel
2020-08-24 14:37:50	jess	xPaw: À isn't à in rfc1459/ascii
2020-08-24 14:38:21	xPaw	what server allows these chars, and doesn't casefold them?
2020-08-24 14:38:25	emersion	so having a big "filter" that case-maps every message received doesn't work
2020-08-24 14:38:32	jess	emersion: sure, my session parsing creates a Channel object when i join somewhere, and it has a .name with the correct folding, but everything else deals with the folded version
2020-08-24 14:38:41	jess	xPaw: unrealircd can do it
2020-08-24 14:38:47	emersion	yup
2020-08-24 14:38:48	jess	oragono too
2020-08-24 14:38:57	xPaw	so it allows unicode, but still only lowercases ascii?
2020-08-24 14:39:00	jess	uhh correct casing*
2020-08-24 14:39:03	jess	xPaw: yes
2020-08-24 14:39:06	xPaw	irc pls...
2020-08-24 14:39:23	jess	clients only know about rfc1459 and ascii casemaps so anything else would case problems
2020-08-24 14:39:25	@jwheare	there is no unicode casemapping. doing anything else would break clients
2020-08-24 14:39:38	jess	would cause* fuck sake
2020-08-24 14:39:50	xPaw	anyone have a JS implementation?
2020-08-24 14:40:25	jess	I've got a clever solution in python that might give you ideas
2020-08-24 14:40:53	xPaw	and what do you default it to, if no casemapping?
2020-08-24 14:41:09	jess	default to rfc1459
2020-08-24 14:41:30	jess	here's the fold algo https://github.com/jesopo/ircstates/blob/ca9abfc34b78b12e34bacd5515b5cae64d690dd5/ircstates/casemap.py#L9
2020-08-24 14:41:43	emersion	interesting
2020-08-24 14:41:50	xPaw	no strict-rfc?
2020-08-24 14:42:01	jess	nah
2020-08-24 14:42:15	jess	no one supports it anyway. someone tried to switch a server to it and caused problems
2020-08-24 14:42:26	jess	think it was inspircd's Atilla
2020-08-24 14:42:58	jess	bitbot supports it but I've never seen it
2020-08-24 14:43:51	-->	thomasross (~thomasros@ip-66-159-117-153.dsl.csolve.net) has joined #ircv3
2020-08-24 14:46:20	@jwheare	think i've seen it
2020-08-24 14:46:30	@jwheare	probably on ngircd
2020-08-24 14:46:46	jess	eww
2020-08-24 14:47:15	jess	ngircd seems to use ascii these days
2020-08-24 14:47:27	jess	source: irc.w3.org
2020-08-24 14:47:46	@jwheare	yeah i can't find any now actually
2020-08-24 14:48:52	@jwheare	https://stats.ircdocs.horse/isupport/#token-casemapping
2020-08-24 14:48:54	BitBot	[Title] IRC RPL_ISUPPORT Statistics
2020-08-24 14:48:57	emersion	do we need to handle the case where an irc server changes case-mapping?
2020-08-24 14:49:20	jess	charybdis has been talking about this recently
2020-08-24 14:49:36	jess	how would you switch freenode from rfc1459 to ascii without stopping the whole network
2020-08-24 14:50:35	hhirtz	CASEMAPPING=none should be the only casemapping
2020-08-24 14:50:50	jess	do you mean changing casemapping while still connected
2020-08-24 14:50:55	jess	or changing between connections
2020-08-24 14:51:03	xPaw	so realistically, which strings do you apply casemapping to? nicks and chan names?
2020-08-24 14:51:04	emersion	changing between connections
2020-08-24 14:51:09	hhirtz	oh yeah, it can change while still connected
2020-08-24 14:51:10	jess	oh, yes
2020-08-24 14:51:15	emersion	fun fun fun
2020-08-24 14:51:16	jess	you need to handle that
2020-08-24 14:51:36	jess	it shouldn't need a lot of handling i think?
2020-08-24 14:51:40	emersion	well
2020-08-24 14:51:46	jess	hhirtz: yes, nicks and channel names
2020-08-24 14:51:47	emersion	when you have a DB with channel names in it
2020-08-24 14:52:16	emersion	maybe i can get away with not storing the case-mapped channel name in the DB, i'll have to see
2020-08-24 14:52:56	jess	god i hate irc
2020-08-24 14:53:36	-->	eta (~eta@trainsplorer/developer/eta) has joined #ircv3
2020-08-24 14:56:29	jess	servers changing casemapping is typically rare but inspircd will apparently soon be switching
2020-08-24 14:57:30	xPaw	where did ^~ casefold come from? its not in rfc1459
2020-08-24 14:57:40	xPaw	despite the casemapping suggesting it is...
2020-08-24 14:57:50	jess	2812
2020-08-24 14:58:03	jess	which is why strict exists
2020-08-24 14:58:15	e	it's always been there
2020-08-24 14:58:26	xPaw	its not in 1459, e
2020-08-24 14:58:44	e	1459 is an inaccurate description of reality at the time
2020-08-24 14:59:37	jess	rfc2812 is also inaccurate because it has ~ and ^ the wrong way around but that doesn't cause technical issues
Register here or Log in to comment, or comment via email.