~nicoco/slidge-whatsapp#4: 
deadlock

Status
RESOLVED CLOSED
Submitter
~nicoco
Assigned to
Submitted
1 year, 10 months ago
Updated
6 months ago
Labels
bug

~nicoco 1 year, 10 months ago*

This was fixed by the wait_for() thing, right?

~deuill 1 year, 10 months ago

It was for me, though it's unclear if we're all describing the same issues or not. Because the underlying issue was timing-related, we might've all seen slightly different effects. Let's keep open for a week and see if nobody reports this, then close.

~nicoco 1 year, 10 months ago

Alright. I'm confident. This might also have fixed some telegram issues btw.

~nicoco REPORTED FIXED 1 year, 9 months ago

Fixed! Smallish issue left: the gateway status stays at "qr scan needed", but that's another issue.

~nicoco FIXED REPORTED 1 year, 7 months ago

I think this was fixed but I recently nuked my slidge-whatsapp dir, then for both my users, pairing got stuck at Updating contact store with 491 push names from history sync with no apparent activity, blocking everything for both users. This was fixed by restarting the container.

~deuill 1 year, 7 months ago

I think the issue here might be related to large history synchronizations, which seem to accumulate if not handled/ACKed somehow.

I've been planning on adding history synchronisation for (at least) group chats, but that depends on having a persistent store for existing messages (since we need to pass the "oldest message ID" in history sync requests); alas, I have not had time to take a look at doing so yet.

Resolving this conclusively might require some time, but it might be useful to gather some concrete data on where the process has been hanging until then -- is it worth adding better tracing here, you think?

~nicoco closed duplicate ticket #11 1 year, 5 months ago

~nicoco 1 year, 4 months ago

It's quite bad for me since today's update. I restarted a few times but it's always stuck, with messages not coming in or going out. logs show the usual "Node handling took 1m.... for <presence" and "Failed getting avatar…". I'll keep on restarting a few times, but to get it back to working state I might try to nuke my slidge-whatsapp home dir and relink.

Interestingly, the "Node handling took " messages seem to only appear when I start a client, probably when it tries to join the group chats.

~nicoco 1 year, 4 months ago

It just got unstuck! Nothing new under the sun after all.

~nicoco REPORTED CLOSED 1 year, 4 days ago*

Not sure if it's the same issue, but this morning I needed a few restarts, maybe because I had been using the "call go in subthreads" branch for a few days (which may bork "counters"?).

[ERROR] 2024-04-22T05:14:10Z SessionCipher.go:262 ▶ Unable to get or create message keys: received message with old counter (index: 20, count: 15)

[WARNING] 2024-04-22T05:14:10Z SessionCipher.go:214 ▶ failed to get or create message keys: received message with old counter (index: 20, count: 15)

[ERROR] 2024-04-22T05:14:10Z SessionCipher.go:268 ▶ Unable to verify ciphertext mac: mismatching MAC in signal message WARNING:legacy_module.gateway:Error decrypting message from XXX@s.whatsapp.net in XXX-XXX@g.us: failed to decrypt normal message: no valid sessions

A few restarts later it finally works, besides the usual "node handling took…" messages, I have a few new

Apr 22 06:52:32 vps-9bd6f395 whatsapp[1970709]: WARNING:legacy_module.gateway:Missing response in item #1 of response to 3EB002B527872E86E79A66

to report.

We'll get to the bottom of this…

~nicoco CLOSED REPORTED 1 year, 4 days ago

~nicoco 1 year, 2 days ago

nicoco referenced this ticket in commit 0dd83ef.

~nicoco 1 year, 2 days ago

nicoco referenced this ticket in commit 8ce975d.

~nicoco 1 year, 2 days ago

nicoco referenced this ticket in commit 4d9c94e.

~nicoco 1 year, 2 days ago

nicoco referenced this ticket in commit d63cdaf.

~nicoco 7 months ago

nicoco referenced this ticket in commit 57da953.

~nicoco 7 months ago

nicoco referenced this ticket in commit d743c1f.

~nicoco 7 months ago

Maybe I'm going to jinx it, but I think I haven't had a single deadlock since these last two commits, despite a few restarts. If it does not deadlock on the next re-pair I have to do, I think I'll close it.

~deuill 7 months ago

Dang, maybe we don't need to refactor after all! Still concerned about some indications of multi-thread performance issues, though we don't really have good data (or large deployments) to verify against.

~nicoco 7 months ago

Yeah, all calls blocking the main python thread is not great (especially stuff like fetching avatars and attachments obviously), but I can retry my "run_in_executor" stuff if it turns out that it really does not deadlock anymore. I'd say let's wait a bit. Il ne faut pas vendre la peau de l'ours avant de l'avoir tué.

~deuill 6 months ago*

Recent changes to event handling (93c2a4f and d3cdcc9) might have closed this for good, and also seem to have helped improve with how responsive/fast the bridge is in general.

~deuill REPORTED CLOSED 6 months ago

I haven't had any deadlocks for a while, and am pretty happy the changes above improved reliability and responsiveness -- I'm closing this, but we can re-open if we get any more reports.

Register here or Log in to comment, or comment via email.