Skip to content

Possible race condition when setting indirect space registers #372

@Iceburgino

Description

@Iceburgino

As I understand, both /dev/mem and ryzen_smu backends work by addressing a memory-mapped indirect space register.

Each operation is performed in two steps - setting the target address, and then reading/writing from that address. I suspect that this is susceptible to a race condition where the same indirect register is used in two places concurrently. Example:

void smn_reg_write_mem(const os_access_obj_t *obj, const uint32_t addr, const uint32_t data) {

Reader/Writer A -> Sets value for a target address
Reader/Writer B -> Overwrites the address with its own
Reader/Writer A -> Reads/writes from/to the data register
Reader/Writer B -> Reads/writes from/to the data register

Reader/writer A gets/puts data from/to the register of reader/writer B.

And this, if a real issue, is probably true not only about the /dev/mem, but about ryzen_smu too - it's just that this race condition happens between reading the rsp register and deciding that you are free to write. Locks within ryzen_smu wouldn't help, since they only serialize operations relative to the driver itself, but if other drivers try to access the registers, you're still screwed.

https://github.com/amkillam/ryzen_smu/blob/172c316f53ac8f066afd7cb9e1da517084273368/smu.c#L190C9-L190C25

I think I have a potential candidate for the reproduce of this issue. The clash seems to happen between amdgpu and /dev/mem-backed ryzenadj when i disconnect laptop from AC. Here is a snippet of it happening:

Sep 07 21:16:37 nixos systemd[1]: Started AMD CPU Power Management.
Sep 07 21:16:37 nixos ryzenadj-apply[107468]: no compatible ryzen_smu kernel module found, fallback to /dev/mem
Sep 07 21:16:38 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* aie2_smu_exec: smu cmd 7 timed out
Sep 07 21:16:38 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* npu4_set_dpm: Set soft dpm level 0 failed, ret -110
Sep 07 21:16:39 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* aie2_smu_exec: smu cmd 4 timed out
Sep 07 21:16:39 nixos kernel: amdxdna 0000:65:00.1: [drm] *ERROR* aie2_smu_fini: Power off failed, ret -110
Sep 07 21:16:47 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Dumping IP State
Sep 07 21:17:01 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:01 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:07 nixos kernel: ACPI Error: Aborting method \_SB.A018 due to previous error (AE_AML_LOOP_TIMEOUT) (20250404/psparse-529)
Sep 07 21:17:07 nixos org_kde_powerdevil[2927]: kf.notifications: Playing audio notification failed: IO error
Sep 07 21:17:12 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:12 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:21 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:21 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000001A SMN_C2PMSG_82:0x00000000
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Failed to disable gfxoff!
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Dumping IP State Completed
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: [drm] AMDGPU device coredump file has been created
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: [drm] Check your /sys/class/drm/card1/device/devcoredump/data
Sep 07 21:17:30 nixos kernel: ACPI Error: Aborting method \_SB.ALIB due to previous error (AE_AML_LOOP_TIMEOUT) (20250404/psparse-529)
Sep 07 21:17:30 nixos systemd-logind[2014]: Power key pressed short.
Sep 07 21:17:30 nixos dbus-daemon[2572]: [session uid=1000 pid=2572] Activating service name='org.kde.LogoutPrompt' requested by ':1.31' (uid=1000 pid=2927 comm="/nix/store/09ds7q6mg0h4rvgwjjdmga8nja3yrih4-powerd")
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: ring gfx_0.0.0 timeout, signaled seq=261123, emitted seq=261125
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Process information: process .plasmashell-wr pid 2847 thread plasmashel:cs0 pid 2895
Sep 07 21:17:30 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Starting gfx_0.0.0 ring reset
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: MES failed to respond to msg=RESET
Sep 07 21:17:33 nixos kernel: [drm:amdgpu_mes_reset_legacy_queue [amdgpu]] *ERROR* failed to reset legacy queue
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: reset via MES failed and try pipe reset -110
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: The CPFW hasn't support pipe reset yet.
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: Ring gfx_0.0.0 reset failed
Sep 07 21:17:33 nixos kernel: amdgpu 0000:64:00.0: amdgpu: GPU reset begin!
Sep 07 21:17:37 nixos (udev-worker)[107335]: BAT0: Spawned process '/nix/store/dq5fdsfzs2p4ixvy204a8m8j5fkgf6zv-tlp-1.8.0/sbin/tlp auto' [107527] is taking longer than 59s to complete.
Sep 07 21:17:37 nixos systemd-udevd[97862]: BAT0: Worker [107335] processing SEQNUM=6294 is taking a long time
Sep 07 21:17:57 nixos kernel: watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u64:0:106018]
Sep 07 21:17:57 nixos kernel: Modules linked in: ccm qrtr xt_set ip_set xt_addrtype xfrm_user xfrm_algo overlay rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat af_packet cmac algif_hash algif_skcipher af_alg bnep xt_conntrack ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG nf_log_syslog nft_compat nf_tables sch_fq_codel xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth tun nvidia_uvm(O) nvidia_drm(O) uvcvideo nvidia_modeset(O) videobuf2_vmalloc uvc videobuf2_memops nls_iso8859_1 videobuf2_v4l2 nls_cp437 videobuf2_common btusb vfat fat btrtl videodev btintel btbcm btmtk mc bluetooth onboard_usb_dev nvidia(O) amdgpu snd_acp_legacy_mach snd_acp_mach snd_soc_nau8821 snd_acp3x_rn mt7925e mt7925_common snd_acp70 snd_acp_i2s mt792x_lib snd_acp_pdm snd_acp_pcm snd_soc_dmic snd_sof_amd_acp70 mt76_connac_lib snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt mt76 snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp mac80211 snd_sof joydev
Sep 07 21:17:57 nixos kernel:  snd_ctl_led mousedev snd_sof_utils snd_pci_ps snd_hda_codec_realtek snd_soc_acpi_amd_match snd_amd_sdw_acpi soundwire_amd snd_hda_codec_generic soundwire_generic_allocation snd_hda_scodec_component soundwire_bus snd_hda_codec_hdmi snd_hda_intel snd_soc_sdca snd_intel_dspcfg spd5118 hid_multitouch lenovo_wmi_hotkey_utilities intel_rapl_msr wmi_bmof snd_intel_sdw_acpi snd_soc_core cfg80211 snd_hda_codec snd_compress ac97_bus snd_pcm_dmaengine snd_rpl_pci_acp6x snd_acp_pci amdxcp drm_panel_backlight_quirks snd_amd_acpi_mach drm_buddy snd_acp_legacy_common drm_exec drm_suballoc_helper r8169 snd_pci_acp6x drm_display_helper snd_pci_acp5x snd_hda_core snd_rn_pci_acp3x realtek mdio_devres of_mdio fixed_phy sp5100_tco fwnode_mdio watchdog snd_acp_config snd_hwdep edac_mce_amd cec libphy ucsi_acpi amdxdna snd_pcm edac_core typec_ucsi ideapad_laptop snd_soc_acpi i2c_piix4 i2c_algo_bit drm_ttm_helper sparse_keymap ttm rfkill amd_atl intel_rapl_common polyval_clmulni snd_timer video ghash_clmulni_intel rapl roles
Sep 07 21:17:57 nixos kernel:  amd_pmf tpm_crb gpu_sched k10temp i2c_smbus crc16 snd snd_pci_acp3x libarc4 mdio_bus battery soundcore rtc_cmos thermal amdtee i2c_hid_acpi typec evdev wmi i2c_hid tiny_power_button amd_sfh tpm_tis platform_profile mac_hid tpm_tis_core button msr serio_raw ac psmouse loop kvm_amd ccp kvm irqbypass br_netfilter bridge fuse stp llc configfs efi_pstore nfnetlink efivarfs dmi_sysfs ip_tables autofs4 crc32c_cryptoapi dm_crypt encrypted_keys trusted asn1_encoder tee tpm rng_core libaescfb ecdh_generic ecc hid_generic usbhid hid input_leds led_class atkbd nvme libps2 xhci_pci vivaldi_fmap thunderbolt nvme_core xhci_hcd sha512_ssse3 i8042 sha1_ssse3 aesni_intel nvme_keyring serio nvme_auth dm_mod dax btrfs blake2b_generic xor raid6_pq
Sep 07 21:17:57 nixos kernel: CPU: 0 UID: 0 PID: 106018 Comm: kworker/u64:0 Tainted: G     U     O        6.16.5 #1-NixOS PREEMPT(voluntary) 
Sep 07 21:17:57 nixos kernel: Tainted: [U]=USER, [O]=OOT_MODULE
Sep 07 21:17:57 nixos kernel: Hardware name: LENOVO 83F1/LNVNB161216, BIOS RYCN23WW 06/27/2025
Sep 07 21:17:57 nixos kernel: Workqueue: dm_vblank_control_workqueue amdgpu_dm_crtc_vblank_control_worker [amdgpu]
Sep 07 21:17:57 nixos kernel: RIP: 0010:amdgpu_device_rreg.part.0+0x38/0xe0 [amdgpu]
Sep 07 21:17:57 nixos kernel: Code: 00 55 89 f5 53 48 89 fb 4c 3b a7 08 09 00 00 73 1b 83 e2 02 75 09 f6 87 a8 4b 05 00 10 75 77 4c 03 a3 10 09 00 00 45 8b 24 24 <eb> 12 4c 89 e6 48 8b 87 50 09 00 00 ff d0 0f 1f 00 41 89 c4 66 90
Sep 07 21:17:57 nixos kernel: RSP: 0018:ffffcc7e2099fc48 EFLAGS: 00000286
Sep 07 21:17:57 nixos kernel: RAX: ffffffffc2298a50 RBX: ffff8a02c3580000 RCX: 0000000000000000
Sep 07 21:17:57 nixos kernel: RDX: 0000000000000000 RSI: 0000000000003697 RDI: ffff8a02c3580000
Sep 07 21:17:57 nixos kernel: RBP: 0000000000003697 R08: 0000000000002000 R09: 0000000000000980
Sep 07 21:17:57 nixos kernel: R10: ffffcc7e5f91d100 R11: fefefefefefefeff R12: 0000000000000940
Sep 07 21:17:57 nixos kernel: R13: 0000000000000001 R14: ffffcc7e2099fdc7 R15: 0000000000000000
Sep 07 21:17:57 nixos kernel: FS:  0000000000000000(0000) GS:ffff8a0a0bb4f000(0000) knlGS:0000000000000000
Sep 07 21:17:57 nixos kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 07 21:17:57 nixos kernel: CR2: 00007ff372607b09 CR3: 000000012be82000 CR4: 0000000000f50ef0
Sep 07 21:17:57 nixos kernel: PKRU: 55555554
Sep 07 21:17:57 nixos kernel: Call Trace:
Sep 07 21:17:57 nixos kernel:  <TASK>
Sep 07 21:17:57 nixos kernel:  dm_read_reg_func+0x57/0xe0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dmub_srv_update_inbox_status.part.0+0x12/0xd0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dmub_srv_wait_for_idle+0x2c/0xa0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dc_dmub_srv_wait_for_idle+0x50/0x150 [amdgpu]
Sep 07 21:17:57 nixos kernel:  dmub_psr_enable+0x8f/0x110 [amdgpu]
Sep 07 21:17:57 nixos kernel:  edp_set_psr_allow_active+0x27b/0x3b0 [amdgpu]
Sep 07 21:17:57 nixos kernel:  amdgpu_dm_psr_disable+0x51/0x70 [amdgpu]
Sep 07 21:17:57 nixos kernel:  amdgpu_dm_crtc_vblank_control_worker+0x277/0x280 [amdgpu]
Sep 07 21:17:57 nixos kernel:  process_one_work+0x18a/0x340
Sep 07 21:17:57 nixos kernel:  worker_thread+0x225/0x360
Sep 07 21:17:57 nixos kernel:  ? __pfx_worker_thread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  kthread+0xf8/0x250
Sep 07 21:17:57 nixos kernel:  ? finish_task_switch.isra.0+0x99/0x2e0
Sep 07 21:17:57 nixos kernel:  ? __pfx_kthread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  ? __pfx_kthread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  ret_from_fork+0x164/0x190
Sep 07 21:17:57 nixos kernel:  ? __pfx_kthread+0x10/0x10
Sep 07 21:17:57 nixos kernel:  ret_from_fork_asm+0x1a/0x30
Sep 07 21:17:57 nixos kernel:  </TASK>

So my question is - is my intuition about it right, and is this that issue I'm seeing? Or is it something else?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions