From nobody Fri Jun 25 02:30:52 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 1D6EC11D0A83 for ; Fri, 25 Jun 2021 02:30:55 +0000 (UTC) (envelope-from leres@freebsd.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4GB1Fq0DFxz4jmw for ; Fri, 25 Jun 2021 02:30:55 +0000 (UTC) (envelope-from leres@freebsd.org) Received: from ice.alameda.xse.com (unknown [IPv6:2600:1700:a570:e20:f2ad:4eff:fe09:150e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: leres) by smtp.freebsd.org (Postfix) with ESMTPSA id B2F03A1B for ; Fri, 25 Jun 2021 02:30:54 +0000 (UTC) (envelope-from leres@freebsd.org) From: Craig Leres Subject: nvidia_drv.so/Xorg crashes To: freebsd-hackers@freebsd.org Message-ID: Date: Thu, 24 Jun 2021 19:30:52 -0700 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-ThisMailContainsUnwantedMimeParts: N I have four (12.2-RELEASE) systems between the office at home that are full or part time FreeBSD desktops. All have pny nvidia quadro 410's. These have been mostly working well for about 6 years. For months I've started seeing screen corruption when using chrome or kicad; firefox and thunderbird are always ok. But just starting eeschema always damages the root window a little. And it's common when running chrome/kicad to see lines in the console xterm window jump up and down two lines. But for the last week or two Xorg has been crashing: [ 74574.029] (EE) Backtrace: [ 74574.032] (EE) 0: /usr/local/bin/Xorg (?+0x0) [0x41c98a] [ 74574.033] (EE) unw_get_proc_name failed: no unwind info found [-10] [ 74574.033] (EE) 1: /lib/libthr.so.3 (?+0x0) [0x800929b7e] [ 74574.035] (EE) unw_get_proc_name failed: no unwind info found [-10] [ 74574.035] (EE) 2: /lib/libthr.so.3 (?+0x0) [0x80092913f] [ 74574.037] (EE) 3: ? (?+0x0) [0x7ffffffff003] [ 74574.038] (EE) 4: /usr/local/lib/xorg/modules/drivers/nvidia_drv.so (?+0x0) [0x801cc8c20] [ 74574.038] (EE) [ 74574.038] (EE) Segmentation fault at address 0x50 [ 74574.038] (EE) Fatal server error: [ 74574.038] (EE) Caught signal 11 (Segmentation fault). Server aborting The crashes are always preceded by at least one nvidia "Xid" kernel message: Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb, ErrorCode 00000004 Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data fffffffb, ErrorCode 00000004 Jun 23 ... kernel: : NVRM: Xid (PCI:0000:05:00): 69, pid=6327, Class Error: ChId 0009, Class 0000902d, Offset 000008b4, Data ffffffb9, ErrorCode 00000004 Jun 23 ... kernel: : pid 6327 (Xorg), jid 0, uid 0: exited on signal 6 Worth noting is that it was not unusual to see many Xid ErrorCode 4 kernel messages without crashes. (And it's the only ErrorCode I've ever seen.) My first thought was bad nvidia-driver version. But after working my way, one by one, down to 460.39 (circa February 2021 -- months before the first crashes) I gave up on that theory. My next guess bad hardware but I swapped quadro's between two systems and the crashes persisted. Yesterday Xorg crashed often enough for me to zero on the trigger; it's the use of tvtwm's f.forcemove action (which is like f.move but allows moving a windows off the screen) if I move a window slightly off the bottom of the screen. Here's the .twmrc binding I use: Button2 = m s : window : f.forcemove The crash doesn't happen 100% of the time but it's pretty easy to trigger with half a dozen windows open. Just grab a window and randomly dip part of it past the bottom of the screen. So my new theory is a frame buffer operation in one of the libraries the path between Xorg and the nvidia driver has regressed and is asking the nvidia driver to do something that causes it to do something bad. I run a custom version of tvtwm but was able to easily crash Xorg using x11-wm/twm on a spare quadro 410 workstation; the key is f.forcemove. Does anybody know what this issue is? What are likely candidates of recently changed port libraries that I could try downgrading? Should I try opening a ticket with nvidia? Should I try even older 460.XX drivers? What else can I try? (Thanks for reading this far!) Craig