Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-124109 / 24.04 / Merge v6.1.50 to add support for AVX extensions #120

Merged
merged 124 commits into from
Sep 14, 2023

Conversation

usaleem-ix
Copy link
Contributor

In Cobia nightlies, support for AVX extension is not available. Due to this, Fletcher4 and RAIDZ performance is hit.

Currently in Cobia nightlies:

root@truenas[~]# cat /sys/module/zfs/parameters/zfs_vdev_raidz_impl
cycle [fastest] original scalar sse2 ssse3 # 
root@truenas[~]# cat /sys/module/zfs/parameters/zfs_fletcher_4_impl 
[fastest] scalar superscalar superscalar4 sse2 ssse3 # 
root@truenas[~]# cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 8297888349 8256517936506
implementation   native         byteswap       
scalar           3952550218     3127757469     
superscalar      5292677479     4073405329     
superscalar4     4782032535     4162250508     
sse2             8355853056     4976411809     
ssse3            8354812782     7700633411     
fastest          sse2           ssse3  

After the update with v6.1.50:

root@truenas[~]# cat /sys/module/zfs/parameters/zfs_vdev_raidz_impl
cycle [fastest] original scalar sse2 ssse3 avx2 #
root@truenas[~]# cat /sys/module/zfs/parameters/zfs_fletcher_4_impl
[fastest] scalar superscalar superscalar4 sse2 ssse3 avx2 #  
root@truenas[~]# cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 6880632918 197638602372
implementation   native         byteswap       
scalar           3919565944     3087609160     
superscalar      5287129973     4141508055     
superscalar4     4763883430     4326696588     
sse2             7875843343     5013934356     
ssse3            8365560879     7826325634     
avx2             14541263999    13464817127    
fastest          avx2           avx2    

This also applies to Dragonfish since we are using the same kernel version in both releases.

Fedor Pchelkin and others added 30 commits August 30, 2023 16:10
[ Upstream commit 4e3733f ]

There is a slight issue with error handling code inside
nfs42_proc_getxattr(). If page allocating loop fails then we free the
failing page array element which is NULL but __free_page() can't deal with
NULL args.

Found by Linux Verification Center (linuxtesting.org).

Fixes: a1f2673 ("NFSv4.2: improve page handling for GETXATTR")
Signed-off-by: Fedor Pchelkin <[email protected]>
Reviewed-by: Benjamin Coddington <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit f4e89f1 ]

Another highly rare error case when a page allocating loop (inside
__nfs4_get_acl_uncached, this time) is not properly unwound on error.
Since pages array is allocated being uninitialized, need to free only
lower array indices. NULL checks were useful before commit 62a1573
("NFSv4 fix acl retrieval over krb5i/krb5p mounts") when the array had
been initialized to zero on stack.

Found by Linux Verification Center (linuxtesting.org).

Fixes: 62a1573 ("NFSv4 fix acl retrieval over krb5i/krb5p mounts")
Signed-off-by: Fedor Pchelkin <[email protected]>
Reviewed-by: Benjamin Coddington <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 895cedc ]

On server-initiated disconnect, rpcrdma_xprt_disconnect() was DMA-
unmapping the Receive buffers, but rpcrdma_post_recvs() neglected
to remap them after a new connection had been established. The
result was immediate failure of the new connection with the Receives
flushing with LOCAL_PROT_ERR.

Fixes: 671c450 ("xprtrdma: Fix oops in Receive handler after device removal")
Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit c1ebead ]

It's just open coded and matches.

Note that Thomas said that his version apparently failed for some
reason, but hey maybe we should try again.

Signed-off-by: Daniel Vetter <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Thomas Zimmermann <[email protected]>
Cc: Javier Martinez Canillas <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: [email protected]
Tested-by: Thomas Zimmmermann <[email protected]>
Reviewed-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Stable-dep-of: 5ae3716 ("video/aperture: Only remove sysfb on the default vga pci device")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 9b539c4 ]

It's not exactly the same since the open coded version doesn't set
primary correctly. But that's a bugfix, so shouldn't hurt really.

Signed-off-by: Daniel Vetter <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: [email protected]
Reviewed-by: Thomas Zimmermann <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Stable-dep-of: 5ae3716 ("video/aperture: Only remove sysfb on the default vga pci device")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 80e9939 ]

This one nukes all framebuffers, which is a bit much. In reality
gma500 is igpu and never shipped with anything discrete, so there should
not be any difference.

v2: Unfortunately the framebuffer sits outside of the pci bars for
gma500, and so only using the pci helpers won't be enough. Otoh if we
only use non-pci helper, then we don't get the vga handling, and
subsequent refactoring to untangle these special cases won't work.

It's not pretty, but the simplest fix (since gma500 really is the only
quirky pci driver like this we have) is to just have both calls.

v4:
- fix Daniel's S-o-b address

v5:
- add back an S-o-b tag with Daniel's Intel address

Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Cc: Patrik Jakobsson <[email protected]>
Cc: Thomas Zimmermann <[email protected]>
Cc: Javier Martinez Canillas <[email protected]>
Reviewed-by: Javier Martinez Canillas <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Stable-dep-of: 5ae3716 ("video/aperture: Only remove sysfb on the default vga pci device")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 62aeaea ]

Only really pci devices have a business setting this - it's for
figuring out whether the legacy vga stuff should be nuked too. And
with the preceding two patches those are all using the pci version of
this.

Which means for all other callers primary == false and we can remove
it now.

v2:
- Reorder to avoid compile fail (Thomas)
- Include gma500, which retained it's called to the non-pci version.

v4:
- fix Daniel's S-o-b address

v5:
- add back an S-o-b tag with Daniel's Intel address

Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Cc: Thomas Zimmermann <[email protected]>
Cc: Javier Martinez Canillas <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Cc: Maxime Ripard <[email protected]>
Cc: Deepak Rawat <[email protected]>
Cc: Neil Armstrong <[email protected]>
Cc: Kevin Hilman <[email protected]>
Cc: Jerome Brunet <[email protected]>
Cc: Martin Blumenstingl <[email protected]>
Cc: Thierry Reding <[email protected]>
Cc: Jonathan Hunter <[email protected]>
Cc: Emma Anholt <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Martin Blumenstingl <[email protected]>
Acked-by: Thierry Reding <[email protected]>
Reviewed-by: Javier Martinez Canillas <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Stable-dep-of: 5ae3716 ("video/aperture: Only remove sysfb on the default vga pci device")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 7450cd2 ]

Otherwise it's a bit silly, and we might throw out the driver for the
screen the user is actually looking at. I haven't found a bug report
for this case yet, but we did get bug reports for the analog case
where we're throwing out the efifb driver.

v2: Flip the check around to make it clear it's a special case for
kicking out the vgacon driver only (Thomas)

v4:
- fixes to commit message
- fix Daniel's S-o-b address

v5:
- add back an S-o-b tag with Daniel's Intel address

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216303
Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Cc: Thomas Zimmermann <[email protected]>
Cc: Javier Martinez Canillas <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: [email protected]
Reviewed-by: Javier Martinez Canillas <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Stable-dep-of: 5ae3716 ("video/aperture: Only remove sysfb on the default vga pci device")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit f1d599d ]

A few reasons for this:

- It's really the only one where this matters. I tried looking around,
  and I didn't find any non-pci vga-compatible controllers for x86
  (since that's the only platform where we had this until a few
  patches ago), where a driver participating in the aperture claim
  dance would interfere.

- I also don't expect that any future bus anytime soon will
  not just look like pci towards the OS, that's been the case for like
  25+ years by now for practically everything (even non non-x86).

- Also it's a bit funny if we have one part of the vga removal in the
  pci function, and the other in the generic one.

v2: Rebase.

v4:
- fix Daniel's S-o-b address

v5:
- add back an S-o-b tag with Daniel's Intel address

Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Daniel Vetter <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Cc: Thomas Zimmermann <[email protected]>
Cc: Javier Martinez Canillas <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: [email protected]
Reviewed-by: Javier Martinez Canillas <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Stable-dep-of: 5ae3716 ("video/aperture: Only remove sysfb on the default vga pci device")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 40613da ]

When using ACPI PCI hotplug, hotplugging a device with large BARs may fail
if bridge windows programmed by firmware are not large enough.

Reproducer:
  $ qemu-kvm -monitor stdio -M q35  -m 4G \
      -global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=on \
      -device id=rp1,pcie-root-port,bus=pcie.0,chassis=4 \
      disk_image

 wait till linux guest boots, then hotplug device:
   (qemu) device_add qxl,bus=rp1

 hotplug on guest side fails with:
   pci 0000:01:00.0: [1b36:0100] type 00 class 0x038000
   pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x03ffffff]
   pci 0000:01:00.0: reg 0x14: [mem 0x00000000-0x03ffffff]
   pci 0000:01:00.0: reg 0x18: [mem 0x00000000-0x00001fff]
   pci 0000:01:00.0: reg 0x1c: [io  0x0000-0x001f]
   pci 0000:01:00.0: BAR 0: no space for [mem size 0x04000000]
   pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x04000000]
   pci 0000:01:00.0: BAR 1: no space for [mem size 0x04000000]
   pci 0000:01:00.0: BAR 1: failed to assign [mem size 0x04000000]
   pci 0000:01:00.0: BAR 2: assigned [mem 0xfe800000-0xfe801fff]
   pci 0000:01:00.0: BAR 3: assigned [io  0x1000-0x101f]
   qxl 0000:01:00.0: enabling device (0000 -> 0003)
   Unable to create vram_mapping
   qxl: probe of 0000:01:00.0 failed with error -12

However when using native PCIe hotplug
  '-global ICH9-LPC.acpi-pci-hotplug-with-bridge-support=off'
it works fine, since kernel attempts to reassign unused resources.

Use the same machinery as native PCIe hotplug to (re)assign resources.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Igor Mammedov <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit f641519 ]

cpu_has_octeon_cache was tied to 0 for generic cpu-features,
whith this generic kernel built for octeon CPU won't boot.

Just enable this flag by cpu_type. It won't hurt orther platforms
because compiler will eliminate the code path on other processors.

Signed-off-by: Jiaxun Yang <[email protected]>
Signed-off-by: Thomas Bogendoerfer <[email protected]>
Stable-dep-of: 5487a7b ("MIPS: cpu-features: Use boot_cpu_type for CPU type based features")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 5487a7b ]

Some CPU feature macros were using current_cpu_type to mark feature
availability.

However current_cpu_type will use smp_processor_id, which is prohibited
under preemptable context.

Since those features are all uniform on all CPUs in a SMP system, use
boot_cpu_type instead of current_cpu_type to fix preemptable kernel.

Cc: [email protected]
Signed-off-by: Jiaxun Yang <[email protected]>
Signed-off-by: Thomas Bogendoerfer <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit be22255 ]

Since t_checkpoint_io_list was stop using in jbd2_log_do_checkpoint()
now, it's time to remove the whole t_checkpoint_io_list logic.

Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Stable-dep-of: 46f881b ("jbd2: fix a race when checking checkpoint buffer busy")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit b98dba2 ]

journal_clean_one_cp_list() and journal_shrink_one_cp_list() are almost
the same, so merge them into journal_shrink_one_cp_list(), remove the
nr_to_scan parameter, always scan and try to free the whole checkpoint
list.

Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Stable-dep-of: 46f881b ("jbd2: fix a race when checking checkpoint buffer busy")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 46f881b ]

Before removing checkpoint buffer from the t_checkpoint_list, we have to
check both BH_Dirty and BH_Lock bits together to distinguish buffers
have not been or were being written back. But __cp_buffer_busy() checks
them separately, it first check lock state and then check dirty, the
window between these two checks could be raced by writing back
procedure, which locks buffer and clears buffer dirty before I/O
completes. So it cannot guarantee checkpointing buffers been written
back to disk if some error happens later. Finally, it may clean
checkpoint transactions and lead to inconsistent filesystem.

jbd2_journal_forget() and __journal_try_to_free_buffer() also have the
same problem (journal_unmap_buffer() escape from this issue since it's
running under the buffer lock), so fix them through introducing a new
helper to try holding the buffer lock and remove really clean buffer.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Cc: [email protected]
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Zhang Yi <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Theodore Ts'o <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit ee8b94c ]

Got kmemleak errors with the following ltp can_filter testcase:

for ((i=1; i<=100; i++))
do
        ./can_filter &
        sleep 0.1
done

==============================================================
[<00000000db4a4943>] can_rx_register+0x147/0x360 [can]
[<00000000a289549d>] raw_setsockopt+0x5ef/0x853 [can_raw]
[<000000006d3d9ebd>] __sys_setsockopt+0x173/0x2c0
[<00000000407dbfec>] __x64_sys_setsockopt+0x61/0x70
[<00000000fd468496>] do_syscall_64+0x33/0x40
[<00000000b7e47d51>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

It's a bug in the concurrent scenario of unregister_netdevice_many()
and raw_release() as following:

             cpu0                                        cpu1
unregister_netdevice_many(can_dev)
  unlist_netdevice(can_dev) // dev_get_by_index() return NULL after this
  net_set_todo(can_dev)
						raw_release(can_socket)
						  dev = dev_get_by_index(, ro->ifindex); // dev == NULL
						  if (dev) { // receivers in dev_rcv_lists not free because dev is NULL
						    raw_disable_allfilters(, dev, );
						    dev_put(dev);
						  }
						  ...
						  ro->bound = 0;
						  ...

call_netdevice_notifiers(NETDEV_UNREGISTER, )
  raw_notify(, NETDEV_UNREGISTER, )
    if (ro->bound) // invalid because ro->bound has been set 0
      raw_disable_allfilters(, dev, ); // receivers in dev_rcv_lists will never be freed

Add a net_device pointer member in struct raw_sock to record bound
can_dev, and use rtnl_lock to serialize raw_socket members between
raw_bind(), raw_release(), raw_setsockopt() and raw_notify(). Use
ro->dev to decide whether to free receivers in dev_rcv_lists.

Fixes: 8d0caed ("can: bcm/raw/isotp: use per module netdevice notifier")
Reviewed-by: Oliver Hartkopp <[email protected]>
Acked-by: Oliver Hartkopp <[email protected]>
Signed-off-by: Ziyang Xuan <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Cc: [email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 11c9027 ]

syzbot complained about a lockdep issue [1]

Since raw_bind() and raw_setsockopt() first get RTNL
before locking the socket, we must adopt the same order in raw_release()

[1]
WARNING: possible circular locking dependency detected
6.5.0-rc1-syzkaller-00192-g78adb4bcf99e #0 Not tainted
------------------------------------------------------
syz-executor.0/14110 is trying to acquire lock:
ffff88804e4b6130 (sk_lock-AF_CAN){+.+.}-{0:0}, at: lock_sock include/net/sock.h:1708 [inline]
ffff88804e4b6130 (sk_lock-AF_CAN){+.+.}-{0:0}, at: raw_bind+0xb1/0xab0 net/can/raw.c:435

but task is already holding lock:
ffffffff8e3df368 (rtnl_mutex){+.+.}-{3:3}, at: raw_bind+0xa7/0xab0 net/can/raw.c:434

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (rtnl_mutex){+.+.}-{3:3}:
__mutex_lock_common kernel/locking/mutex.c:603 [inline]
__mutex_lock+0x181/0x1340 kernel/locking/mutex.c:747
raw_release+0x1c6/0x9b0 net/can/raw.c:391
__sock_release+0xcd/0x290 net/socket.c:654
sock_close+0x1c/0x20 net/socket.c:1386
__fput+0x3fd/0xac0 fs/file_table.c:384
task_work_run+0x14d/0x240 kernel/task_work.c:179
resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
exit_to_user_mode_prepare+0x210/0x240 kernel/entry/common.c:204
__syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
syscall_exit_to_user_mode+0x1d/0x50 kernel/entry/common.c:297
do_syscall_64+0x44/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x63/0xcd

-> #0 (sk_lock-AF_CAN){+.+.}-{0:0}:
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x2e3d/0x5de0 kernel/locking/lockdep.c:5144
lock_acquire kernel/locking/lockdep.c:5761 [inline]
lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5726
lock_sock_nested+0x3a/0xf0 net/core/sock.c:3492
lock_sock include/net/sock.h:1708 [inline]
raw_bind+0xb1/0xab0 net/can/raw.c:435
__sys_bind+0x1ec/0x220 net/socket.c:1792
__do_sys_bind net/socket.c:1803 [inline]
__se_sys_bind net/socket.c:1801 [inline]
__x64_sys_bind+0x72/0xb0 net/socket.c:1801
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(rtnl_mutex);
        lock(sk_lock-AF_CAN);
        lock(rtnl_mutex);
lock(sk_lock-AF_CAN);

*** DEADLOCK ***

1 lock held by syz-executor.0/14110:

stack backtrace:
CPU: 0 PID: 14110 Comm: syz-executor.0 Not tainted 6.5.0-rc1-syzkaller-00192-g78adb4bcf99e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/03/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x1b0 lib/dump_stack.c:106
check_noncircular+0x311/0x3f0 kernel/locking/lockdep.c:2195
check_prev_add kernel/locking/lockdep.c:3142 [inline]
check_prevs_add kernel/locking/lockdep.c:3261 [inline]
validate_chain kernel/locking/lockdep.c:3876 [inline]
__lock_acquire+0x2e3d/0x5de0 kernel/locking/lockdep.c:5144
lock_acquire kernel/locking/lockdep.c:5761 [inline]
lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5726
lock_sock_nested+0x3a/0xf0 net/core/sock.c:3492
lock_sock include/net/sock.h:1708 [inline]
raw_bind+0xb1/0xab0 net/can/raw.c:435
__sys_bind+0x1ec/0x220 net/socket.c:1792
__do_sys_bind net/socket.c:1803 [inline]
__se_sys_bind net/socket.c:1801 [inline]
__x64_sys_bind+0x72/0xb0 net/socket.c:1801
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd89007cb29
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 e1 20 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fd890d2a0c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 00007fd89019bf80 RCX: 00007fd89007cb29
RDX: 0000000000000010 RSI: 0000000020000040 RDI: 0000000000000003
RBP: 00007fd8900c847a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fd89019bf80 R15: 00007ffebf8124f8
</TASK>

Fixes: ee8b94c ("can: raw: fix receiver memory leak")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Ziyang Xuan <[email protected]>
Cc: Oliver Hartkopp <[email protected]>
Cc: [email protected]
Cc: Marc Kleine-Budde <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 72c2112 ]

Pointer variables of void * type do not require type cast.

Signed-off-by: Yu Zhe <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Vasily Gorbik <[email protected]>
Stable-dep-of: 4cfca53 ("s390/zcrypt: fix reply buffer calculations for CCA replies")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 4cfca53 ]

The length information for available buffer space for CCA
replies is covered with two fields in the T6 header prepended
on each CCA reply: fromcardlen1 and fromcardlen2. The sum of
these both values must not exceed the AP bus limit for this
card (24KB for CEX8, 12KB CEX7 and older) minus the always
present headers.

The current code adjusted the fromcardlen2 value in case
of exceeding the AP bus limit when there was a non-zero
value given from userspace. Some tests now showed that this
was the wrong assumption. Instead the userspace value given for
this field should always be trusted and if the sum of the
two fields exceeds the AP bus limit for this card the first
field fromcardlen1 should be adjusted instead.

So now the calculation is done with this new insight in mind.
Also some additional checks for overflow have been introduced
and some comments to provide some documentation for future
maintainers of this complicated calculation code.

Furthermore the 128 bytes of fix overhead which is used
in the current code is not correct. Investigations showed
that for a reply always the same two header structs are
prepended before a possible payload. So this is also fixed
with this patch.

Signed-off-by: Harald Freudenberger <[email protected]>
Reviewed-by: Holger Dengler <[email protected]>
Cc: [email protected]
Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit b2f59e9 ]

We always assumed that a device might either have AUX or FLAT
CCS, but this is an approximation that is not always true, e.g.
PVC represents an exception.

Set the basis for future finer selection by implementing a
boolean gen12_needs_ccs_aux_inv() function that tells whether aux
invalidation is needed or not.

Currently PVC is the only exception to the above mentioned rule.

Requires: 059ae7ae2a1c ("drm/i915/gt: Cleanup aux invalidation registers")
Signed-off-by: Andi Shyti <[email protected]>
Cc: Matt Roper <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Cc: <[email protected]> # v5.8+
Reviewed-by: Matt Roper <[email protected]>
Reviewed-by: Andrzej Hajda <[email protected]>
Reviewed-by: Nirmoy Das <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit c827655)
Signed-off-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 78a6ccd ]

All memory traffic must be quiesced before requesting
an aux invalidation on platforms that use Aux CCS.

Fixes: 972282c ("drm/i915/gen12: Add aux table invalidate for all engines")
Requires: a2a4aa0eef3b ("drm/i915: Add the gen12_needs_ccs_aux_inv helper")
Signed-off-by: Jonathan Cavitt <[email protected]>
Signed-off-by: Andi Shyti <[email protected]>
Cc: <[email protected]> # v5.8+
Reviewed-by: Nirmoy Das <[email protected]>
Reviewed-by: Andrzej Hajda <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit ad8ebf1)
Signed-off-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 0fde2f2 ]

For platforms that use Aux CCS, wait for aux invalidation to
complete by checking the aux invalidation register bit is
cleared.

Fixes: 972282c ("drm/i915/gen12: Add aux table invalidate for all engines")
Signed-off-by: Jonathan Cavitt <[email protected]>
Signed-off-by: Andi Shyti <[email protected]>
Cc: <[email protected]> # v5.8+
Reviewed-by: Nirmoy Das <[email protected]>
Reviewed-by: Andrzej Hajda <[email protected]>
Reviewed-by: Matt Roper <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit d459c86)
Signed-off-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 6a35f22 ]

Perform some refactoring with the purpose of keeping in one
single place all the operations around the aux table
invalidation.

With this refactoring add more engines where the invalidation
should be performed.

Fixes: 972282c ("drm/i915/gen12: Add aux table invalidate for all engines")
Signed-off-by: Andi Shyti <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Cc: Matt Roper <[email protected]>
Cc: <[email protected]> # v5.8+
Reviewed-by: Andrzej Hajda <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 76ff778)
Signed-off-by: Tvrtko Ursulin <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit b71645d ]

Trace ring buffer can no longer record anything after executing
following commands at the shell prompt:

  # cd /sys/kernel/tracing
  # cat tracing_cpumask
  fff
  # echo 0 > tracing_cpumask
  # echo 1 > snapshot
  # echo fff > tracing_cpumask
  # echo 1 > tracing_on
  # echo "hello world" > trace_marker
  -bash: echo: write error: Bad file descriptor

The root cause is that:
  1. After `echo 0 > tracing_cpumask`, 'record_disabled' of cpu buffers
     in 'tr->array_buffer.buffer' became 1 (see tracing_set_cpumask());
  2. After `echo 1 > snapshot`, 'tr->array_buffer.buffer' is swapped
     with 'tr->max_buffer.buffer', then the 'record_disabled' became 0
     (see update_max_tr());
  3. After `echo fff > tracing_cpumask`, the 'record_disabled' become -1;
Then array_buffer and max_buffer are both unavailable due to value of
'record_disabled' is not 0.

To fix it, enable or disable both array_buffer and max_buffer at the same
time in tracing_set_cpumask().

Link: https://lkml.kernel.org/r/[email protected]

Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Fixes: 71babb2 ("tracing: change CPU ring buffer state from tracing_cpumask")
Signed-off-by: Zheng Yejian <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit eecb91b ]

Kmemleak report a leak in graph_trace_open():

  unreferenced object 0xffff0040b95f4a00 (size 128):
    comm "cat", pid 204981, jiffies 4301155872 (age 99771.964s)
    hex dump (first 32 bytes):
      e0 05 e7 b4 ab 7d 00 00 0b 00 01 00 00 00 00 00 .....}..........
      f4 00 01 10 00 a0 ff ff 00 00 00 00 65 00 10 00 ............e...
    backtrace:
      [<000000005db27c8b>] kmem_cache_alloc_trace+0x348/0x5f0
      [<000000007df90faa>] graph_trace_open+0xb0/0x344
      [<00000000737524cd>] __tracing_open+0x450/0xb10
      [<0000000098043327>] tracing_open+0x1a0/0x2a0
      [<00000000291c3876>] do_dentry_open+0x3c0/0xdc0
      [<000000004015bcd6>] vfs_open+0x98/0xd0
      [<000000002b5f60c9>] do_open+0x520/0x8d0
      [<00000000376c7820>] path_openat+0x1c0/0x3e0
      [<00000000336a54b5>] do_filp_open+0x14c/0x324
      [<000000002802df13>] do_sys_openat2+0x2c4/0x530
      [<0000000094eea458>] __arm64_sys_openat+0x130/0x1c4
      [<00000000a71d7881>] el0_svc_common.constprop.0+0xfc/0x394
      [<00000000313647bf>] do_el0_svc+0xac/0xec
      [<000000002ef1c651>] el0_svc+0x20/0x30
      [<000000002fd4692a>] el0_sync_handler+0xb0/0xb4
      [<000000000c309c35>] el0_sync+0x160/0x180

The root cause is descripted as follows:

  __tracing_open() {  // 1. File 'trace' is being opened;
    ...
    *iter->trace = *tr->current_trace;  // 2. Tracer 'function_graph' is
                                        //    currently set;
    ...
    iter->trace->open(iter);  // 3. Call graph_trace_open() here,
                              //    and memory are allocated in it;
    ...
  }

  s_start() {  // 4. The opened file is being read;
    ...
    *iter->trace = *tr->current_trace;  // 5. If tracer is switched to
                                        //    'nop' or others, then memory
                                        //    in step 3 are leaked!!!
    ...
  }

To fix it, in s_start(), close tracer before switching then reopen the
new tracer after switching. And some tracers like 'wakeup' may not update
'iter->private' in some cases when reopen, then it should be cleared
to avoid being mistakenly closed again.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]

Fixes: d7350c3 ("tracing/core: make the read callbacks reentrants")
Signed-off-by: Zheng Yejian <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 05f3d5b ]

On SDP interfaces, frame oversize and undersize errors are
observed as driver is not considering packet sizes of all
subscribers of the link before updating the link config.

This patch fixes the same.

Fixes: 9b7dd87 ("octeontx2-af: Support to modify min/max allowed packet lengths")
Signed-off-by: Hariprasad Kelam <[email protected]>
Signed-off-by: Sunil Goutham <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit f05bd8e ]

The devlink code is hard to navigate with 13kLoC in one file.
I really like the way Michal split the ethtool into per-command
files and core. It'd probably be too much to split it all up,
but we can at least separate the core parts out of the per-cmd
implementations and put it in a directory so that new commands
can be separate files.

Move the code, subsequent commit will do a partial split.

Reviewed-by: Jacob Keller <[email protected]>
Reviewed-by: Jiri Pirko <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
Stable-dep-of: 2ebbc97 ("devlink: add missing unregister linecard notification")
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 2ebbc97 ]

Cited fixes commit introduced linecard notifications for register,
however it didn't add them for unregister. Fix that by adding them.

Fixes: c246f9b ("devlink: add support to create line card and expose to user")
Signed-off-by: Jiri Pirko <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
…rio gates

[ Upstream commit d44036c ]

The blamed commit resolved a bug where frames would still get stuck at
egress, even though they're smaller than the maxSDU[tc], because the
driver did not take into account the extra 33 ns that the queue system
needs for scheduling the frame.

It now takes that into account, but the arithmetic that we perform in
vsc9959_tas_remaining_gate_len_ps() is buggy, because we operate on
64-bit unsigned integers, so gate_len_ns - VSC9959_TAS_MIN_GATE_LEN_NS
may become a very large integer if gate_len_ns < 33 ns.

In practice, this means that we've introduced a regression where all
traffic class gates which are permanently closed will not get detected
by the driver, and we won't enable oversize frame dropping for them.

Before:
mscc_felix 0000:00:00.5: port 0: max frame size 1526 needs 12400000 ps, 1152000 ps for mPackets at speed 1000
mscc_felix 0000:00:00.5: port 0 tc 0 min gate len 1000000, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 1 min gate len 0, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 2 min gate len 0, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 3 min gate len 0, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 4 min gate len 0, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 5 min gate len 0, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 6 min gate len 0, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 5120 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 615 octets including FCS

After:
mscc_felix 0000:00:00.5: port 0: max frame size 1526 needs 12400000 ps, 1152000 ps for mPackets at speed 1000
mscc_felix 0000:00:00.5: port 0 tc 0 min gate len 1000000, sending all frames
mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 5120 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 615 octets including FCS

Fixes: 11afdc6 ("net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet")
Signed-off-by: Vladimir Oltean <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 76f3329 ]

*prot->memory_pressure is read/writen locklessly, we need
to add proper annotations.

A recent commit added a new race, it is time to audit all accesses.

Fixes: 2d0c88e ("sock: Fix misuse of sk_under_memory_pressure()")
Fixes: 4d93df0 ("[SCTP]: Rewrite of sctp buffer management code")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Abel Wu <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
jlelli and others added 23 commits August 30, 2023 16:11
commit 111cd11 upstream.

Turns out percpu_cpuset_rwsem - commit 1243dc5 ("cgroup/cpuset:
Convert cpuset_mutex to percpu_rwsem") - wasn't such a brilliant idea,
as it has been reported to cause slowdowns in workloads that need to
change cpuset configuration frequently and it is also not implementing
priority inheritance (which causes troubles with realtime workloads).

Convert percpu_cpuset_rwsem back to regular cpuset_mutex. Also grab it
only for SCHED_DEADLINE tasks (other policies don't care about stable
cpusets anyway).

Signed-off-by: Juri Lelli <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
[ Conflict in kernel/cgroup/cpuset.c due to pulling new code/comments.
  Reject all new code. Remove BUG_ON() about rwsem that doesn't exist on
  mainline. ]
Signed-off-by: Qais Yousef (Google) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 6c24849 upstream.

Qais reported that iterating over all tasks when rebuilding root domains
for finding out which ones are DEADLINE and need their bandwidth
correctly restored on such root domains can be a costly operation (10+
ms delays on suspend-resume).

To fix the problem keep track of the number of DEADLINE tasks belonging
to each cpuset and then use this information (followup patch) to only
perform the above iteration if DEADLINE tasks are actually present in
the cpuset for which a corresponding root domain is being rebuilt.

Reported-by: Qais Yousef (Google) <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Juri Lelli <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit c0f78fd upstream.

update_tasks_root_domain currently iterates over all tasks even if no
DEADLINE task is present on the cpuset/root domain for which bandwidth
accounting is being rebuilt. This has been reported to introduce 10+ ms
delays on suspend-resume operations.

Skip the costly iteration for cpusets that don't contain DEADLINE tasks.

Reported-by: Qais Yousef (Google) <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Juri Lelli <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 8598910 upstream.

While moving a set of tasks between exclusive cpusets,
cpuset_can_attach() -> task_can_attach() calls dl_cpu_busy(..., p) for
DL BW overflow checking and per-task DL BW allocation on the destination
root_domain for the DL tasks in this set.

This approach has the issue of not freeing already allocated DL BW in
the following error cases:

(1) The set of tasks includes multiple DL tasks and DL BW overflow
    checking fails for one of the subsequent DL tasks.

(2) Another controller next to the cpuset controller which is attached
    to the same cgroup fails in its can_attach().

To address this problem rework dl_cpu_busy():

(1) Split it into dl_bw_check_overflow() & dl_bw_alloc() and add a
    dedicated dl_bw_free().

(2) dl_bw_alloc() & dl_bw_free() take a `u64 dl_bw` parameter instead of
    a `struct task_struct *p` used in dl_cpu_busy(). This allows to
    allocate DL BW for a set of tasks too rather than only for a single
    task.

Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 2ef269e upstream.

cpuset_can_attach() can fail. Postpone DL BW allocation until all tasks
have been checked. DL BW is not allocated per-task but as a sum over
all DL tasks migrating.

If multiple controllers are attached to the cgroup next to the cpuset
controller a non-cpuset can_attach() can fail. In this case free DL BW
in cpuset_cancel_attach().

Finally, update cpuset DL task count (nr_deadline_tasks) only in
cpuset_attach().

Suggested-by: Waiman Long <[email protected]>
Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Qais Yousef (Google) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
…ug onwards

commit 583893a upstream.

Previously, on unplug events, the TMU mode was disabled first
followed by the Time Synchronization Handshake, irrespective of
whether the tb_switch_tmu_rate_write() API was successful or not.

However, this caused a problem with Thunderbolt 3 (TBT3)
devices, as the TSPacketInterval bits were always enabled by default,
leading the host router to assume that the device router's TMU was
already enabled and preventing it from initiating the Time
Synchronization Handshake. As a result, TBT3 monitors experienced
display flickering from the second hot plug onwards.

To address this issue, we have modified the code to only disable the
Time Synchronization Handshake during TMU disable if the
tb_switch_tmu_rate_write() function is successful. This ensures that
the TBT3 devices function correctly and eliminates the display
flickering issue.

Co-developed-by: Sanath S <[email protected]>
Signed-off-by: Sanath S <[email protected]>
Signed-off-by: Sanjay R Mehta <[email protected]>
Cc: [email protected]
Signed-off-by: Mika Westerberg <[email protected]>
[ USB4v2 introduced support for uni-directional TMU mode as part of
  d49b4f0 ("thunderbolt: Add support for enhanced uni-directional TMU mode")
  This is not a stable candidate commit, so adjust the code for backport. ]
Signed-off-by: Mario Limonciello <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 9c7c4bc upstream.

sizeof(struct ublksrv_io_cmd) is 16bytes, which can be held in 64byte SQE,
so not necessary to check IO_URING_F_SQE128.

With this change, we get chance to save half SQ ring memory.

Fixed: 71f28f3 ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit c275a17 upstream.

Commit ee8b94c ("can: raw: fix receiver memory leak") introduced
a new reference to the CAN netdevice that has assigned CAN filters.
But this new ro->dev reference did not maintain its own refcount which
lead to another KASAN use-after-free splat found by Eric Dumazet.

This patch ensures a proper refcount for the CAN nedevice.

Fixes: ee8b94c ("can: raw: fix receiver memory leak")
Reported-by: Eric Dumazet <[email protected]>
Cc: Ziyang Xuan <[email protected]>
Signed-off-by: Oliver Hartkopp <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
…folio for sharing check

commit 0e0e9bd upstream.

Commit 98b211d ("madvise: convert madvise_free_pte_range() to use a
folio") replaced the page_mapcount() with folio_mapcount() to check
whether the folio is shared by other mapping.

It's not correct for large folios. folio_mapcount() returns the total
mapcount of large folio which is not suitable to detect whether the folio
is shared.

Use folio_estimated_sharers() which returns a estimated number of shares.
That means it's not 100% correct. It should be OK for madvise case here.

User-visible effects is that the THP is skipped when user call madvise.
But the correct behavior is THP should be split and processed then.

NOTE: this change is a temporary fix to reduce the user-visible effects
before the long term fix from David is ready.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 98b211d ("madvise: convert madvise_free_pte_range() to use a folio")
Signed-off-by: Yin Fengwei <[email protected]>
Reviewed-by: Yu Zhao <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Vishal Moola (Oracle) <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 1bd3a76 upstream.

Commit 41320b1 ("scsi: snic: Fix possible memory leak if device_add()
fails") fixed the memory leak caused by dev_set_name() when device_add()
failed. However, it did not consider that 'tgt' has already been released
when put_device(&tgt->dev) is called. Remove kfree(tgt) in the error path
to avoid double free of 'tgt' and move put_device(&tgt->dev) after the
removed kfree(tgt) to avoid a use-after-free.

Fixes: 41320b1 ("scsi: snic: Fix possible memory leak if device_add() fails")
Signed-off-by: Zhu Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
commit 60c5fd2 upstream.

The raid_component_add() function was added to the kernel tree via patch
"[SCSI] embryonic RAID class" (2005). Remove this function since it never
has had any callers in the Linux kernel. And also raid_component_release()
is only used in raid_component_add(), so it is also removed.

Signed-off-by: Zhu Wang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Bart Van Assche <[email protected]>
Fixes: 04b5b5c ("scsi: core: Fix possible memory leak if device_add() fails")
Signed-off-by: Martin K. Petersen <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
[ Upstream commit 2746f13 ]

The COMMON_CLK config is not enabled in some of the architectures.
This causes below build issues:

pwm-rz-mtu3.c:(.text+0x114):
undefined reference to `clk_rate_exclusive_put'
pwm-rz-mtu3.c:(.text+0x32c):
undefined reference to `clk_rate_exclusive_get'

Fix these issues by moving clk_rate_exclusive_{get,put} inside COMMON_CLK
code block, as clk.c is enabled by COMMON_CLK.

Fixes: 55e9b8b ("clk: add clk_rate_exclusive api")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Biju Das <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Stephen Boyd <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
…node_to_map()

[ Upstream commit 661efa2 ]

Fix the below random NULL pointer crash during boot by serializing
pinctrl group and function creation/remove calls in
rzg2l_dt_subnode_to_map() with mutex lock.

Crash log:
    pc : __pi_strcmp+0x20/0x140
    lr : pinmux_func_name_to_selector+0x68/0xa4
    Call trace:
    __pi_strcmp+0x20/0x140
    pinmux_generic_add_function+0x34/0xcc
    rzg2l_dt_subnode_to_map+0x314/0x44c
    rzg2l_dt_node_to_map+0x164/0x194
    pinctrl_dt_to_map+0x218/0x37c
    create_pinctrl+0x70/0x3d8

While at it, add comments for bitmap_lock and lock.

Fixes: c4c4637 ("pinctrl: renesas: Add RZ/G2L pin and gpio controller driver")
Tested-by: Chris Paterson <[email protected]>
Signed-off-by: Biju Das <[email protected]>
Reviewed-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
…node_to_map()

[ Upstream commit f982b9d ]

Fix the below random NULL pointer crash during boot by serializing
pinctrl group and function creation/remove calls in
rzv2m_dt_subnode_to_map() with mutex lock.

Crash logs:
    pc : __pi_strcmp+0x20/0x140
    lr : pinmux_func_name_to_selector+0x68/0xa4
    Call trace:
    __pi_strcmp+0x20/0x140
    pinmux_generic_add_function+0x34/0xcc
    rzv2m_dt_subnode_to_map+0x2e4/0x418
    rzv2m_dt_node_to_map+0x15c/0x18c
    pinctrl_dt_to_map+0x218/0x37c
    create_pinctrl+0x70/0x3d8

While at it, add a comment for lock.

Fixes: 92a9b82 ("pinctrl: renesas: Add RZ/V2M pin and gpio controller driver")
Signed-off-by: Biju Das <[email protected]>
Reviewed-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
…group,{add,remove}_function}

[ Upstream commit 8fcc1c4 ]

The pinctrl group and function creation/remove calls expect
caller to take care of locking. Add lock around these functions.

Fixes: b59d0e7 ("pinctrl: Add RZ/A2 pin and gpio controller")
Signed-off-by: Biju Das <[email protected]>
Reviewed-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit e531fdb ]

If a signal callback releases the sw_sync fence, that will trigger a
deadlock as the timeline_fence_release recurses onto the fence->lock
(used both for signaling and the the timeline tree).

To avoid that, temporarily hold an extra reference to the signalled
fences until after we drop the lock.

(This is an alternative implementation of https://patchwork.kernel.org/patch/11664717/
which avoids some potential UAF issues with the original patch.)

v2: Remove now obsolete comment, use list_move_tail() and
    list_del_init()

Reported-by: Bas Nieuwenhuizen <[email protected]>
Fixes: d3c6dd1 ("dma-buf/sw_sync: Synchronize signal vs syncpt free")
Signed-off-by: Rob Clark <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Christian König <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit ab4109f ]

If a GPIO simulator device is unbound with interrupts still requested,
we will hit a use-after-free issue in __irq_domain_deactivate_irq(). The
owner of the irq domain must dispose of all mappings before destroying
the domain object.

Fixes: cb8c474 ("gpio: sim: new testing module")
Signed-off-by: Bartosz Golaszewski <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit 6e39c1a ]

Associate the swnode of the GPIO device's (which is the interrupt
controller here) with the irq domain. Otherwise the interrupt-controller
device attribute is a no-op.

Fixes: cb8c474 ("gpio: sim: new testing module")
Signed-off-by: Bartosz Golaszewski <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit c008323 ]

Lenovo 82SJ doesn't have DMIC connected like 82V2 does.  Narrow
the match down to only cover 82V2.

Reported-by: [email protected]
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217063
Fixes: 2232b2d ("ASoC: amd: yc: Add Lenovo Yoga Slim 7 Pro X to quirks table")
Signed-off-by: Mario Limonciello <[email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]
Signed-off-by: Sasha Levin <[email protected]>
[ Upstream commit cfeb6ae ]

The current implementation of append may cause duplicate data and/or
incorrect ranges to be returned to a reader during an update.  Although
this has not been reported or seen, disable the append write operation
while the tree is in rcu mode out of an abundance of caution.

During the analysis of the mas_next_slot() the following was
artificially created by separating the writer and reader code:

Writer:                                 reader:
mas_wr_append
    set end pivot
    updates end metata
    Detects write to last slot
    last slot write is to start of slot
    store current contents in slot
    overwrite old end pivot
                                        mas_next_slot():
                                                read end metadata
                                                read old end pivot
                                                return with incorrect range
    store new value

Alternatively:

Writer:                                 reader:
mas_wr_append
    set end pivot
    updates end metata
    Detects write to last slot
    last lost write to end of slot
    store value
                                        mas_next_slot():
                                                read end metadata
                                                read old end pivot
                                                read new end pivot
                                                return with incorrect range
    set old end pivot

There may be other accesses that are not safe since we are now updating
both metadata and pointers, so disabling append if there could be rcu
readers is the safest action.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 54a611b ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Sasha Levin <[email protected]>
commit fd0a7ec upstream.

The vangogh driver just gained a link time dependency that now causes
randconfig builds to fail:

x86_64-linux-ld: sound/soc/amd/vangogh/pci-acp5x.o: in function `snd_acp5x_probe':
pci-acp5x.c:(.text+0xbb): undefined reference to `snd_amd_acp_find_config'

Fixes: e89f45e ("ASoC: amd: vangogh: Add check for acp config flags in vangogh platform")
Signed-off-by: Arnd Bergmann <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Mark Brown <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Tested-by: Conor Dooley <[email protected]>
Tested-by: Linux Kernel Functional Testing <[email protected]>
Tested-by: Salvatore Bonaccorso <[email protected]>
Tested-by: Joel Fernandes (Google) <[email protected]>
Tested-by: SeongJae Park <[email protected]>
Tested-by: Bagas Sanjaya <[email protected]>
Tested-by: Sudip Mukherjee <[email protected]>
Tested-by: Takeshi Ogasawara <[email protected]>
Tested-by: Shuah Khan <[email protected]>
Tested-by: Florian Fainelli <[email protected]>
Tested-by: Guenter Roeck <[email protected]>
Tested-by: Jon Hunter <[email protected]>
Tested-by: Pavel Machek (CIP) <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
This is the 6.1.50 stable release

Signed-off-by: Umer Saleem <[email protected]>
@bugclerk
Copy link

@amotin amotin merged commit 6752e97 into truenas/linux-6.1 Sep 14, 2023
6 checks passed
@amotin amotin deleted the nwrelmrg branch September 14, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.