~eliasnaur/gio#219: 
Intel GPU Hang on compute renderer

I tried updating my toy webcam program trideo to the latest gio because I've been running it successfully with the compute renderer for a while now, and I wanted to see the effect of recent changes.

It now hangs my GPU. This was on a Fedora 33 GNOME Desktop running Wayland, but trideo itself was running as an XWayland application.

This is the commit that broke it.

I was able to capture the GPU hang info in case it's useful. You can find that here.

Status
RESOLVED FIXED
Submitter
~whereswaldon
Assigned to
No-one
Submitted
5 months ago
Updated
30 days ago
Labels
No labels applied.

~whereswaldon 5 months ago

I was able to replicate this on different intel hardware. This GPU hang comes from an Arch Linux system running GNOME Wayland (trideo is still running under XWayland here).

~eliasnaur 5 months ago

The linked commit doesn't seem relevant. Did you mean to link to a Gio commit?

~whereswaldon 5 months ago

No, I was trying to show the version change from working to not in my go.mod. I haven't bisected this (I have to reboot each time this happens).

~eliasnaur 5 months ago

The commit doesn't change go.mod.

~whereswaldon 5 months ago

Whoops! https://git.sr.ht/~whereswaldon/trideo/commit/a885028bd64274f725e14ee1adc1d0fb6ba12097 is what I meant, sorry. Accidentally copied the parent commit

~eliasnaur 2 months ago

Can you please try the potential fix described in https://todo.sr.ht/~eliasnaur/gio/214#event-90218? Thanks.

~whereswaldon 2 months ago

Okay, running trideo on the latest gio commit did something interesting with the compute renderer. The GPU did hang, but the Gio program kept right on running and rendering frames. It didn't get a GL error and crash or anything.

My GPU hang (from dmesg):

[Tue Jul 20 07:48:55 2021] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[Tue Jul 20 07:48:55 2021] i915 0000:00:02.0: [drm] trideo[2705812] context reset due to GPU hang
[Tue Jul 20 07:48:55 2021] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:85ddfffa, in trideo [2705812]
[Tue Jul 20 07:49:26 2021] i915 0000:00:02.0: [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

I'm going to try the kitchen on this machine next, in case it's application-dependent, but I don't want to lose this partially-written comment in a reboot so I'm posting it now.

~whereswaldon 2 months ago

Okay, the kitchen is running fine on that latest commit. Something trideo is doing is causing this hang. Trideo is very simple. It draws a bunch of triangles every frame on top of a solid black background. It basically invokes this function hundreds of times each frame. Can you try it on a linux box with a webcam and see if you can replicate the hang? You may need to adjust the hardcoded webcam device path to the proper one for your local system.

~whereswaldon 2 months ago

Also, I tried the example program for the chat library ~jackmordaunt and I are building, and it seems to exhaust the material atlas?

git clone https://git.sr.ht/~gioverse/chat
cd chat
go get gioui.org@latest
GIORENDERER=forcecompute go run ./example/kitchen

I get a few frames, then the program crashes with:

error: premature window close: compute: no space left in material atlas
exit status 1

Now this could totally be an application error on our part. Perhaps the images we're loading are being duplicated in the texture cache or something without our knowledge. This does run on the old renderer though, so it seems on the surface like a difference in behavior.

~whereswaldon 2 months ago

On a bright note, I am able to run sprig on the compute renderer, so it's definitely working with a non-trivial application. The main branch isn't on the right gio commit, but the material-list branch is.

~eliasnaur 2 months ago

Thanks. Would it be possible to reduce trideo so that it don't require a webcam? I don't have an external webcam, unfortunately.

~whereswaldon 2 months ago

I have reduced trideo to just operate on a jpeg image provided as argv[1]. You can find that version on this branch.

It does reproduce the GPU hang for me in this form.

~eliasnaur a month ago

Thank you for the simple reproducer. Unfortunately it doesn't hang for me. I'm running Sway on Fedora 34. Is there a chance you can reproduce the hang on F34?

~whereswaldon a month ago

I can try to do that tomorrow. I got it in GNOME Wayland on Arch. Can you try GNOME on F34 while I work to get an F34 system? I no longer have one since leaving my last job.

~eliasnaur a month ago

No luck on GNOME.

~whereswaldon a month ago

Well, uh, this is awkward. I can no longer reproduce either. After updating to the latest code, it works really well. It doesn't hang at all. I'm going to keep experimenting with it for a bit, but we might be able to close this. Trideo's performance is really excellent now! I can run it with absurd numbers of triangles (2000 per frame) and it keeps up. Gio used to be the bottleneck, but now it's the CPU-bound image processing instead. Well done!

~eliasnaur a month ago

Turns out I'm the idiot: by adding "GIORENDERER=forcecompute" (and spelling it correctly) I can easily reproduce the hang. Sorry for putting you on a wild goose chase.

FWIW, there has been no change to the compute shaders from 4f40b58e0d14..8cec7e04eb71 so if the issue is gone at your end it probably means caching is somehow hiding the issue.

~eliasnaur REPORTED FIXED a month ago

Hm, I spoke too soon. The https://gioui.org/commit/b87cbc04f37453a064201a8590b0a23a169cf3f5 change fixes the hang for me, which indicates that the command stream to the compute programs before the change was sometimes incorrect. This is quite likely, so I'm going to close this issue for now and concentrate on #214 (NVIDIA), #221 (font glitches) and issues on my own Pixel 1 phone.

~eliasnaur a month ago

Chris, I've pushed a set of commits that lifts the restrictions your chat kitchen example ran into. The example now runs without issues on my mahcine.

~whereswaldon 30 days ago

Yeah, I'm no longer able to hang the chat example or trideo with the compute renderer. Thanks!

Register here or Log in to comment, or comment via email.