Moved from https://todo.sr.ht/~nicoco/slidge/118
This was fixed by the
wait_for()
thing, right?
It was for me, though it's unclear if we're all describing the same issues or not. Because the underlying issue was timing-related, we might've all seen slightly different effects. Let's keep open for a week and see if nobody reports this, then close.
Alright. I'm confident. This might also have fixed some telegram issues btw.
Fixed! Smallish issue left: the gateway status stays at "qr scan needed", but that's another issue.
I think this was fixed but I recently nuked my slidge-whatsapp dir, then for both my users, pairing got stuck at
Updating contact store with 491 push names from history sync
with no apparent activity, blocking everything for both users. This was fixed by restarting the container.
I think the issue here might be related to large history synchronizations, which seem to accumulate if not handled/ACKed somehow.
I've been planning on adding history synchronisation for (at least) group chats, but that depends on having a persistent store for existing messages (since we need to pass the "oldest message ID" in history sync requests); alas, I have not had time to take a look at doing so yet.
Resolving this conclusively might require some time, but it might be useful to gather some concrete data on where the process has been hanging until then -- is it worth adding better tracing here, you think?
It's quite bad for me since today's update. I restarted a few times but it's always stuck, with messages not coming in or going out. logs show the usual "Node handling took 1m.... for <presence" and "Failed getting avatar…". I'll keep on restarting a few times, but to get it back to working state I might try to nuke my slidge-whatsapp home dir and relink.
Interestingly, the "Node handling took " messages seem to only appear when I start a client, probably when it tries to join the group chats.
It just got unstuck! Nothing new under the sun after all.
Not sure if it's the same issue, but this morning I needed a few restarts, maybe because I had been using the "call go in subthreads" branch for a few days (which may bork "counters"?).
[ERROR] 2024-04-22T05:14:10Z SessionCipher.go:262 ▶ Unable to get or create message keys: received message with old counter (index: 20, count: 15)
[WARNING] 2024-04-22T05:14:10Z SessionCipher.go:214 ▶ failed to get or create message keys: received message with old counter (index: 20, count: 15)
[ERROR] 2024-04-22T05:14:10Z SessionCipher.go:268 ▶ Unable to verify ciphertext mac: mismatching MAC in signal message WARNING:legacy_module.gateway:Error decrypting message from XXX@s.whatsapp.net in XXX-XXX@g.us: failed to decrypt normal message: no valid sessions
A few restarts later it finally works, besides the usual "node handling took…" messages, I have a few new
Apr 22 06:52:32 vps-9bd6f395 whatsapp[1970709]: WARNING:legacy_module.gateway:Missing response in item #1 of response to 3EB002B527872E86E79A66
to report.
We'll get to the bottom of this…
nicoco referenced this ticket in commit 0dd83ef.
nicoco referenced this ticket in commit 8ce975d.
nicoco referenced this ticket in commit 4d9c94e.
nicoco referenced this ticket in commit d63cdaf.
nicoco referenced this ticket in commit 57da953.
nicoco referenced this ticket in commit d743c1f.
Maybe I'm going to jinx it, but I think I haven't had a single deadlock since these last two commits, despite a few restarts. If it does not deadlock on the next re-pair I have to do, I think I'll close it.
Dang, maybe we don't need to refactor after all! Still concerned about some indications of multi-thread performance issues, though we don't really have good data (or large deployments) to verify against.
Yeah, all calls blocking the main python thread is not great (especially stuff like fetching avatars and attachments obviously), but I can retry my "run_in_executor" stuff if it turns out that it really does not deadlock anymore. I'd say let's wait a bit. Il ne faut pas vendre la peau de l'ours avant de l'avoir tué.
I haven't had any deadlocks for a while, and am pretty happy the changes above improved reliability and responsiveness -- I'm closing this, but we can re-open if we get any more reports.