Questioning our use of cgroups #439

eroussy · 2024-03-20T15:27:28Z

Context
There is currently two ways to handle the CPUs on which a VM has access :

Using the isolated VM feature in the inventory : This feature will pin the KVM threads running the VM's vCPUs on the CPUs described in the cpuset list.
This feature only pins the KVM threads and not the qemu thread responsible for managing the VM.
Putting the VM in the machine-rt or machine-nort slice. These cgroups are configured during Ansible setup. They come with allowed CPUs defined in the variables cpumachinesrt and cpumachinesnort Ansible variables.
Both KVM and qemu threads of the VM will execute on the allowed CPUs

These two configurations have the same purpose but not the same philosophy. They duplicate a feature and doesn't interact easily with each others.
Plus, the cgroup configuration is only on Debian for now.
We have to clarify the isolation feature we want on SEAPATH.

Concerns regarding cgroups
I see two problems with these cgroups today :

The first is a bug described in the issue VM endless boot when pinning to the first allowed CPU of the machine-rt slice. #438
The second is that more threads (for example qemu-system-x86) are run on the allowed CPUs of the slice. We may want to run only the KVM threads here.

The second point can cause a problem, for example :

Give two CPUs to machine-rt.slice
Deploy an RT VM with two CPUs.
In that case, the two RT KVM thread will prevent qemu to execute, and the VM will never boot.
In that case, we have given the slice exactly the number of CPUs we wanted (here : 2) and it will not work.
We should give 3 CPUs to machine-rt.slice in order to make it work.

Isolation of non-RT VMs
The use of the machine-nort cgroup allows isolating threads of non-RT VMs.
Is this relevant to isolate them if we do not have special RT needs. Wouldn't it be better to let the Linux scheduler handle these VMs on the system's CPUs ?

We now need to choose the isolation method we want and use it on both on Debian and Yocto versions.
I leave this question open, feel free to add your remarks below.

The text was updated successfully, but these errors were encountered:

eroussy · 2024-03-20T15:28:07Z

Another question to ask is about the ease of use.
I understand the idea of setting the allowed CPUs during Ansible deployment and not having to wonder about that later. But actually, I find it more confusing than the "isolated" VM feature.

ebail · 2024-03-20T15:48:58Z

I think that in any case we need to have a common way to configure RT capabilities to VM.
VM should be deployed on both Debian and Yocto without any change.
@insatomcat @dupremathieu could you please share your opinion ?

Best,

insatomcat · 2024-03-20T16:10:11Z

slices/cgroups (cpuset actually), are the recommended way to do cpu isolation (isolcpus is deprecated), so I think it's nice SEAPATH is already proposing something with cpusets.
The vcpupin feature of libvirt is complementary.
If you really want to isolate a core and dedicate it to a vcpu for low latency purposes, I think you need both.
vcpupin will ensure all the work of the vcpu will be done by a specific physical core, but slices (libvirt partitions) will make sure no other workload is going to land on this physical core. The only other way is to use isolcpus, but as mentionned before it's supposed to be deprecated.

Anyway, all those configurations are optional, so in the end I feel we can't really choose (because a seapath user may need both), but it's not really an issue because we don't have to.

dupremathieu · 2024-03-20T17:33:37Z

I think you have missed the issue @insatomcat.
All processes spawned by libvirt for the virtualization will be in the same cgroup and will share the same cpuset.
Here is an example of processes spawned by libvirtd:

qemu-system-x86
qemu-system-x86
log
msgr-worker-0
msgr-worker-1
msgr-worker-2
service
io_context_pool
io_context_pool
ceph_timer
ms_dispatch
ms_local
safe_timer
safe_timer
safe_timer
safe_timer
taskfin_librbd
vhost-16369
IOmon_iothread
CPU0/KVM
kvm
kvm-nx-lpage-recovery-16369
kvm-pit/16369

We can pin some processes tweaking the libvirt XML (vcpupin, emulatorpin and iothreadpin) but we can pin it only in the cgroup cpuset and not all processes can be pinned. The unpinned processes are free to run on all CPUs inside the cpuset even pinned CPUs.

It is usually not an issue, but if you have KVM RT task pinned on all available CPUs all other none RT tasks will never be scheduled and the VM will never boot.

So to avoid this in our implementation, we have to reserve an extra CPU core only for these processes.

There are two ways to solve that, remove all cpuset and use isolcpus domaine or keep VM in the machine slice, remove vcpupin in the xml and create a qemu hook to change the slice of KVM thread to machine-rt slice and apply pinning and RT priority.

Regarding the isolcpus deprecated flags, it is just the recommended way which has been changed. I don't know if the kernel preempt RT patch modify something in this part.

@eroussy if you do not want to use cpuset just do not set it inside the Ansible inventory and add the isolcpus domain kernel parameter.

insatomcat · 2024-03-20T17:53:04Z

All processes spawned by libvirt for the virtualization will be in the same cgroup and will share the same cpuset.

I do not notice this on my setup. Of course I use isolcpus since this is still something done by SEAPATH.
Is your setup running only the slices isolation and no isolcpus?
On our example inventory, cpumachinesrt is the same as isolcpus.

I have a RT VM with 2 vcpu:

# virsh dumpxml debian | grep cpu
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='2'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <emulatorpin cpuset='4'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='2'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='2'/>
  </cputune>
  <cpu mode='custom' match='exact' check='full'>
  </cpu>

And if I ignore the core for "emulation" this is what I see on the 2 dedicated cores:

# ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr |  grep " 2$"
cpuhp/2         [cpuhp/2]                        37      37      - TS    2
idle_inject/2   [idle_inject/2]                  38      38     50 FF    2
irq_work/2      [irq_work/2]                     39      39      1 FF    2
migration/2     [migration/2]                    40      40     99 FF    2
rcuc/2          [rcuc/2]                         41      41     10 FF    2
ktimers/2       [ktimers/2]                      42      42      1 FF    2
ksoftirqd/2     [ksoftirqd/2]                    43      43      - TS    2
kworker/2:0-eve [kworker/2:0-events]             44      44      - TS    2
irq/125-PCIe PM [irq/125-PCIe PME]              325     325     50 FF    2
kworker/2:1     [kworker/2:1]                   334     334      - TS    2
irq/151-megasas [irq/151-megasas0-msix3]        488     488     50 FF    2
CPU 0/KVM       /usr/bin/qemu-system-x86_64    4739    4775      2 FF    2

# ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr |  grep " 14$"
cpuhp/14        [cpuhp/14]                      160     160      - TS   14
idle_inject/14  [idle_inject/14]                161     161     50 FF   14
irq_work/14     [irq_work/14]                   162     162      1 FF   14
migration/14    [migration/14]                  163     163     99 FF   14
rcuc/14         [rcuc/14]                       164     164     10 FF   14
ktimers/14      [ktimers/14]                    165     165      1 FF   14
ksoftirqd/14    [ksoftirqd/14]                  166     166      - TS   14
kworker/14:0-ev [kworker/14:0-events]           167     167      - TS   14
kworker/14:1    [kworker/14:1]                  339     339      - TS   14
irq/163-megasas [irq/163-megasas0-msix15]       500     500     50 FF   14
irq/194-eno8403 [irq/194-eno8403-tx-0]         3226    3226     50 FF   14
CPU 1/KVM       /usr/bin/qemu-system-x86_64    4739    4777      2 FF   14

Basically nothing besides the vcpu process and bounded kthreads...
So really I don't know what those unpinned libvirt processes are about.

eroussy · 2024-03-21T08:55:33Z

Basically nothing besides the vcpu process and bounded kthreads...
So really I don't know what those unpinned libvirt processes are about.

Here you are only looking the processes running exactly on the two cores you choose for the RT VM.
You have to look on all cores in the machine-rt.slice allowed CPU's

For example, on my setup :
The machine-rt slice allowed CPUs are 4-7

root@seapath:/home/virtu# cat /etc/systemd/system/machine-rt.slice | grep AllowedCPUs
AllowedCPUs=4-7

And the processes on cores 4 to 7 (I display only one part of it) :

root@seapath:/home/virtu# ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr |  grep " [4-7]$"
[...]
qemu-system-x86 /usr/bin/qemu-system-x86_64  158603  158603      - TS    4
call_rcu        /usr/bin/qemu-system-x86_64  158603  158607      - TS    4
log             /usr/bin/qemu-system-x86_64  158603  158608      - TS    4
msgr-worker-0   /usr/bin/qemu-system-x86_64  158603  158609      - TS    4
msgr-worker-1   /usr/bin/qemu-system-x86_64  158603  158610      - TS    4
msgr-worker-2   /usr/bin/qemu-system-x86_64  158603  158611      - TS    4
service         /usr/bin/qemu-system-x86_64  158603  158615      - TS    4
io_context_pool /usr/bin/qemu-system-x86_64  158603  158616      - TS    4
io_context_pool /usr/bin/qemu-system-x86_64  158603  158617      - TS    4
ceph_timer      /usr/bin/qemu-system-x86_64  158603  158618      - TS    4
ms_dispatch     /usr/bin/qemu-system-x86_64  158603  158619      - TS    4
ms_local        /usr/bin/qemu-system-x86_64  158603  158620      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158621      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158622      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158623      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158624      - TS    4
taskfin_librbd  /usr/bin/qemu-system-x86_64  158603  158625      - TS    4
vhost-158603    /usr/bin/qemu-system-x86_64  158603  158648      - TS    4
vhost-158603    /usr/bin/qemu-system-x86_64  158603  158649      - TS    4
IO mon_iothread /usr/bin/qemu-system-x86_64  158603  158650      - TS    4
CPU 0/KVM       /usr/bin/qemu-system-x86_64  158603  158651      1 FF    5
CPU 1/KVM       /usr/bin/qemu-system-x86_64  158603  158652      1 FF    6
kvm             [kvm]                        158626  158626      - TS    4
kvm-nx-lpage-re [kvm-nx-lpage-recovery-1586  158627  158627      - TS    4
kvm-pit/158603  [kvm-pit/158603]             158654  158654      - TS    4
kworker/4:0-kdm [kworker/4:0-kdmflush/254:0 2099770 2099770      - TS    4

The question is :
should all these threads run on these CPUs ?
And if yes, how can we run them on other CPUs than the 4th ?

(These questions are also related to the issue #438

dupremathieu · 2024-03-21T09:33:17Z

@insatomcat you didn't notice because we have a large cpuset range.

Reduce your machine-rt cpuset to have a number of CPUs cores equals to the number of your virtual CPUs.

insatomcat · 2024-03-21T10:33:59Z

I have a slice with cpuset 2-6,14-18 and a GUEST with 4 vcpus:

# cat /etc/systemd/system/machine-rt.slice
[Unit]
Description=VM rt slice
Before=slices.target
Wants=machine.slice

[Slice]
AllowedCPUs=2-6,14-18

# virsh dumpxml XXX | grep cpu
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='5'/>
    <vcpupin vcpu='1' cpuset='6'/>
    <vcpupin vcpu='2' cpuset='17'/>
    <vcpupin vcpu='3' cpuset='18'/>
    <emulatorpin cpuset='16'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='5'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='5'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='5'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='5'/>
  </cputune>

If I look at all the cores, then I see the processes you are talking about, but all on core 16, which is the emulatorpin chosen core:

# for i in 2 3 4 5 6 14 15 16 17 18; do ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr | grep " $i$"; done
cpuhp/2         [cpuhp/2]                        38      38      - TS    2
idle_inject/2   [idle_inject/2]                  39      39     50 FF    2
irq_work/2      [irq_work/2]                     40      40      1 FF    2
migration/2     [migration/2]                    41      41     99 FF    2
rcuc/2          [rcuc/2]                         42      42     10 FF    2
ktimers/2       [ktimers/2]                      43      43      1 FF    2
ksoftirqd/2     [ksoftirqd/2]                    44      44      - TS    2
kworker/2:0-eve [kworker/2:0-events]             45      45      - TS    2
irq/125-PCIe PM [irq/125-PCIe PME]              326     326     50 FF    2
kworker/2:1     [kworker/2:1]                   335     335      - TS    2
irq/134-megasas [irq/134-megasas0-msix3]        491     491     50 FF    2
irq/281-iavf-en [irq/281-iavf-ens1f1v1-TxRx    3485    3485     50 FF    2
irq/247-i40e-en [irq/247-i40e-ens1f1-TxRx-1    3810    3810     50 FF    2
irq/174-i40e-en [irq/174-i40e-ens1f0-TxRx-8    3963    3963     50 FF    2
irq/264-vfio-ms [irq/264-vfio-msix[0](0000:    7314    7314     50 FF    2
cpuhp/3         [cpuhp/3]                        48      48      - TS    3
idle_inject/3   [idle_inject/3]                  49      49     50 FF    3
irq_work/3      [irq_work/3]                     50      50      1 FF    3
migration/3     [migration/3]                    51      51     99 FF    3
rcuc/3          [rcuc/3]                         52      52     10 FF    3
ktimers/3       [ktimers/3]                      53      53      1 FF    3
ksoftirqd/3     [ksoftirqd/3]                    54      54      - TS    3
kworker/3:0-eve [kworker/3:0-events]             55      55      - TS    3
irq/126-PCIe PM [irq/126-PCIe PME]              327     327     50 FF    3
kworker/3:1     [kworker/3:1]                   336     336      - TS    3
irq/135-megasas [irq/135-megasas0-msix4]        492     492     50 FF    3
irq/282-iavf-en [irq/282-iavf-ens1f1v1-TxRx    3486    3486     50 FF    3
irq/248-i40e-en [irq/248-i40e-ens1f1-TxRx-1    3811    3811     50 FF    3
irq/175-i40e-en [irq/175-i40e-ens1f0-TxRx-9    3964    3964     50 FF    3
irq/266-vfio-ms [irq/266-vfio-msix[1](0000:    7293    7293     50 FF    3
cpuhp/4         [cpuhp/4]                        58      58      - TS    4
idle_inject/4   [idle_inject/4]                  59      59     50 FF    4
irq_work/4      [irq_work/4]                     60      60      1 FF    4
migration/4     [migration/4]                    61      61     99 FF    4
rcuc/4          [rcuc/4]                         62      62     10 FF    4
ktimers/4       [ktimers/4]                      63      63      1 FF    4
ksoftirqd/4     [ksoftirqd/4]                    64      64      - TS    4
kworker/4:0-eve [kworker/4:0-events]             65      65      - TS    4
irq/127-PCIe PM [irq/127-PCIe PME]              328     328     50 FF    4
kworker/4:1     [kworker/4:1]                   337     337      - TS    4
irq/136-megasas [irq/136-megasas0-msix5]        493     493     50 FF    4
irq/249-i40e-en [irq/249-i40e-ens1f1-TxRx-1    3812    3812     50 FF    4
irq/176-i40e-en [irq/176-i40e-ens1f0-TxRx-1    3965    3965     50 FF    4
irq/268-vfio-ms [irq/268-vfio-msix[2](0000:    7294    7294     50 FF    4
cpuhp/5         [cpuhp/5]                        68      68      - TS    5
idle_inject/5   [idle_inject/5]                  69      69     50 FF    5
irq_work/5      [irq_work/5]                     70      70      1 FF    5
migration/5     [migration/5]                    71      71     99 FF    5
rcuc/5          [rcuc/5]                         72      72     10 FF    5
ktimers/5       [ktimers/5]                      73      73      1 FF    5
ksoftirqd/5     [ksoftirqd/5]                    74      74      - TS    5
kworker/5:0-eve [kworker/5:0-events]             75      75      - TS    5
irq/128-PCIe PM [irq/128-PCIe PME]              329     329     50 FF    5
kworker/5:1     [kworker/5:1]                   338     338      - TS    5
irq/137-megasas [irq/137-megasas0-msix6]        494     494     50 FF    5
irq/250-i40e-en [irq/250-i40e-ens1f1-TxRx-2    3813    3813     50 FF    5
irq/177-i40e-en [irq/177-i40e-ens1f0-TxRx-1    3966    3966     50 FF    5
CPU 0/KVM       /usr/bin/qemu-system-x86_64    5899    5954      5 FF    5
irq/270-vfio-ms [irq/270-vfio-msix[3](0000:    7295    7295     50 FF    5
cpuhp/6         [cpuhp/6]                        78      78      - TS    6
idle_inject/6   [idle_inject/6]                  79      79     50 FF    6
irq_work/6      [irq_work/6]                     80      80      1 FF    6
migration/6     [migration/6]                    81      81     99 FF    6
rcuc/6          [rcuc/6]                         82      82     10 FF    6
ktimers/6       [ktimers/6]                      83      83      1 FF    6
ksoftirqd/6     [ksoftirqd/6]                    84      84      - TS    6
kworker/6:0-eve [kworker/6:0-events]             85      85      - TS    6
kworker/6:1-mm_ [kworker/6:1-mm_percpu_wq]      316     316      - TS    6
irq/129-PCIe PM [irq/129-PCIe PME]              330     330     50 FF    6
irq/138-megasas [irq/138-megasas0-msix7]        495     495     50 FF    6
irq/251-i40e-en [irq/251-i40e-ens1f1-TxRx-2    3814    3814     50 FF    6
irq/178-i40e-en [irq/178-i40e-ens1f0-TxRx-1    3967    3967     50 FF    6
CPU 1/KVM       /usr/bin/qemu-system-x86_64    5899    5956      5 FF    6
irq/273-vfio-ms [irq/273-vfio-msix[4](0000:    7298    7298     50 FF    6
cpuhp/14        [cpuhp/14]                      161     161      - TS   14
idle_inject/14  [idle_inject/14]                162     162     50 FF   14
irq_work/14     [irq_work/14]                   163     163      1 FF   14
migration/14    [migration/14]                  164     164     99 FF   14
rcuc/14         [rcuc/14]                       165     165     10 FF   14
ktimers/14      [ktimers/14]                    166     166      1 FF   14
ksoftirqd/14    [ksoftirqd/14]                  167     167      - TS   14
kworker/14:0-ev [kworker/14:0-events]           168     168      - TS   14
kworker/14:1    [kworker/14:1]                  339     339      - TS   14
irq/210-ahci[00 [irq/210-ahci[0000:00:17.0]     471     471     50 FF   14
irq/147-megasas [irq/147-megasas0-msix15]       503     503     50 FF   14
irq/236-i40e-en [irq/236-i40e-ens1f1-TxRx-6    3798    3798     50 FF   14
irq/186-i40e-en [irq/186-i40e-ens1f0-TxRx-2    3975    3975     50 FF   14
cpuhp/15        [cpuhp/15]                      171     171      - TS   15
idle_inject/15  [idle_inject/15]                172     172     50 FF   15
irq_work/15     [irq_work/15]                   173     173      1 FF   15
migration/15    [migration/15]                  174     174     99 FF   15
rcuc/15         [rcuc/15]                       175     175     10 FF   15
ktimers/15      [ktimers/15]                    176     176      1 FF   15
ksoftirqd/15    [ksoftirqd/15]                  177     177      - TS   15
kworker/15:0-ev [kworker/15:0-events]           178     178      - TS   15
kworker/15:1    [kworker/15:1]                  340     340      - TS   15
irq/146-megasas [irq/146-megasas0-msix16]       504     504     50 FF   15
irq/237-i40e-en [irq/237-i40e-ens1f1-TxRx-7    3799    3799     50 FF   15
irq/187-i40e-en [irq/187-i40e-ens1f0-TxRx-2    3976    3976     50 FF   15
cpuhp/16        [cpuhp/16]                      181     181      - TS   16
idle_inject/16  [idle_inject/16]                182     182     50 FF   16
irq_work/16     [irq_work/16]                   183     183      1 FF   16
migration/16    [migration/16]                  184     184     99 FF   16
rcuc/16         [rcuc/16]                       185     185     10 FF   16
ktimers/16      [ktimers/16]                    186     186      1 FF   16
ksoftirqd/16    [ksoftirqd/16]                  187     187      - TS   16
kworker/16:0-ev [kworker/16:0-events]           188     188      - TS   16
kworker/16:0H-e [kworker/16:0H-events_highp     189     189      - TS   16
kworker/16:1    [kworker/16:1]                  341     341      - TS   16
irq/148-megasas [irq/148-megasas0-msix17]       505     505     50 FF   16
irq/254-i40e-00 [irq/254-i40e-0000:18:00.1:    1597    1597     50 FF   16
irq/238-i40e-en [irq/238-i40e-ens1f1-TxRx-8    3800    3800     50 FF   16
irq/188-i40e-en [irq/188-i40e-ens1f0-TxRx-2    3977    3977     50 FF   16
qemu-system-x86 /usr/bin/qemu-system-x86_64    5899    5899      - TS   16
qemu-system-x86 /usr/bin/qemu-system-x86_64    5899    5907      - TS   16
log             /usr/bin/qemu-system-x86_64    5899    5927      - TS   16
msgr-worker-0   /usr/bin/qemu-system-x86_64    5899    5928      - TS   16
msgr-worker-1   /usr/bin/qemu-system-x86_64    5899    5929      - TS   16
msgr-worker-2   /usr/bin/qemu-system-x86_64    5899    5930      - TS   16
service         /usr/bin/qemu-system-x86_64    5899    5934      - TS   16
io_context_pool /usr/bin/qemu-system-x86_64    5899    5935      - TS   16
io_context_pool /usr/bin/qemu-system-x86_64    5899    5936      - TS   16
ceph_timer      /usr/bin/qemu-system-x86_64    5899    5937      - TS   16
ms_dispatch     /usr/bin/qemu-system-x86_64    5899    5938      - TS   16
ms_local        /usr/bin/qemu-system-x86_64    5899    5939      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5940      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5941      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5942      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5943      - TS   16
taskfin_librbd  /usr/bin/qemu-system-x86_64    5899    5944      - TS   16
vhost-5899      /usr/bin/qemu-system-x86_64    5899    5952      - TS   16
IO mon_iothread /usr/bin/qemu-system-x86_64    5899    5953      - TS   16
SPICE Worker    /usr/bin/qemu-system-x86_64    5899    5982      - TS   16
vhost-5899      /usr/bin/qemu-system-x86_64    5899    6003      - TS   16
kworker/16:1H-k [kworker/16:1H-kblockd]        5906    5906      - TS   16
kvm-nx-lpage-re [kvm-nx-lpage-recovery-5899    5945    5945      - TS   16
cpuhp/17        [cpuhp/17]                      191     191      - TS   17
idle_inject/17  [idle_inject/17]                192     192     50 FF   17
irq_work/17     [irq_work/17]                   193     193      1 FF   17
migration/17    [migration/17]                  194     194     99 FF   17
rcuc/17         [rcuc/17]                       195     195     10 FF   17
ktimers/17      [ktimers/17]                    196     196      1 FF   17
ksoftirqd/17    [ksoftirqd/17]                  197     197      - TS   17
kworker/17:0-ev [kworker/17:0-events]           198     198      - TS   17
kworker/17:1    [kworker/17:1]                  342     342      - TS   17
irq/149-megasas [irq/149-megasas0-msix18]       506     506     50 FF   17
irq/239-i40e-en [irq/239-i40e-ens1f1-TxRx-9    3801    3801     50 FF   17
irq/166-i40e-en [irq/166-i40e-ens1f0-TxRx-0    3955    3955     50 FF   17
irq/189-i40e-en [irq/189-i40e-ens1f0-TxRx-2    3978    3978     50 FF   17
CPU 2/KVM       /usr/bin/qemu-system-x86_64    5899    5957      5 FF   17
cpuhp/18        [cpuhp/18]                      201     201      - TS   18
idle_inject/18  [idle_inject/18]                202     202     50 FF   18
irq_work/18     [irq_work/18]                   203     203      1 FF   18
migration/18    [migration/18]                  204     204     99 FF   18
rcuc/18         [rcuc/18]                       205     205     10 FF   18
ktimers/18      [ktimers/18]                    206     206      1 FF   18
ksoftirqd/18    [ksoftirqd/18]                  207     207      - TS   18
kworker/18:0-ev [kworker/18:0-events]           208     208      - TS   18
kworker/18:1    [kworker/18:1]                  343     343      - TS   18
irq/151-megasas [irq/151-megasas0-msix19]       507     507     50 FF   18
irq/240-i40e-en [irq/240-i40e-ens1f1-TxRx-1    3802    3802     50 FF   18
irq/167-i40e-en [irq/167-i40e-ens1f0-TxRx-1    3956    3956     50 FF   18
CPU 3/KVM       /usr/bin/qemu-system-x86_64    5899    5958      5 FF   18

What's the emulatorpin setting in your setup?

dupremathieu · 2024-03-21T11:01:05Z

What we want for emulatorpin is to use the no-rt cpuset (0-1,7-13,19-N in your case) to avoid reserve and loose a core for it.

insatomcat · 2024-03-21T11:33:12Z

I don't think you can do that while at the same time asking libvirt to use the machine-rt slice for the same guest

dupremathieu · 2024-03-21T13:39:06Z

It should be possible using qemu hook, but to me, we should just mention it in the documentation.

I suggest indicating in the documentation:

The pinned processes have to be pinned inside the cgroup cpuset.
If you use RT privilege KVM threads, the minimal cpuset size has to be: total of number of vcpus + 1.
If it is not what you want, do not define any cgroup cpuset and use isolcpus domain instead.

insatomcat · 2024-03-21T14:12:25Z

If you use RT privilege KVM threads, the minimal cpuset size has to be: total of number of vcpus + 1

--> total of number of "isolated" vcpus + 1

You don't have to isolate all the vcpu. You may even need a "non isolated" vcpu for the housekeeping inside the vm.
So basically if you need N isolated cores for your realtime workload, your guest may need N+1 cores.
But this "+1" can be shared with other RT guests and with the emulator.

eroussy · 2024-03-22T13:15:16Z

I suggest indicating in the documentation:
* The pinned processes have to be pinned inside the cgroup cpuset.
* If you use RT privilege KVM threads, the minimal cpuset size has to be: total of number of vcpus + 1.
* If it is not what you want, do not define any cgroup cpuset and use isolcpus domain instead.

I don't think it is a good idea to propose both.
We should choose one isolation method and argument why we choosed it.

Personally,

I don't like that the emulatorpin cpuset has to be in the machine-rt.slice allowed CPUs. We cannot pin him where we want.
I don't like that we have to assigned number_of_isolated_vcpu + 1 cores to the slice. I find this confusing. It also needs one more core which is difficult on machines with few cores.
I also don't like that we have to choose the allowed CPUs during ansible setup. I think these have to be chosen for each VMs during the deployment.
Finally, I think that all the emulator CPUs have to be handled by Linux scheduler on system's core.
- It takes less cores
- It will be well scheduled by Linux
- The user don't have to think about them
- It is less difficult to use
- It avoid race conditions (VM endless boot when pinning to the first allowed CPU of the machine-rt slice. #438)

The only argument I see (for now) in favor of using cgroups is that it prevents from hardware processor attacks like meltdown and spectre.
Do we really want to prevent these types of attacks ?
Is there any other argument I'm missing ?

eroussy · 2024-03-26T13:30:35Z

Hi all,

We need to close this question.
I discussed with Mathieu and we conclude that :

Cgroup are useful for specific configurations
Cgroup must remain optional
Cgroup are an advanced feature of SEAPATH and must not be presented directly to newcomers.
The isolcpu question will soon be handled with tuned, which is better.

So, regarding the work to do :

All the systemd slices (system, user, ovs, machine, machine-rt and machine-nonrt) are already optionnal and must stay optionnal.
Their configuration must be backported to yocto (and also be optional)
Tuned will be merged to handle isolcpu
Documentation should be written about the technical questions we discussed in this issue (emulatorpin, qemu thread, number_of_isolated_vcpu + 1 etc ...)
Slices variable (cpusystem, cpuuser ...) must be removed of the inventory examples and described in the inventories README as advanced feature.

@insatomcat @dupremathieu what do you think of that ? Did I miss something ?

insatomcat · 2024-03-26T15:02:10Z

I'm ok with all that.

ebail · 2024-03-27T08:15:41Z

Great. Maybe @eroussy it worth documenting it on LFEnergy Wiki ?

eroussy · 2024-04-08T11:20:30Z

The topic is now covered in this wiki page : https://wiki.lfenergy.org/display/SEAP/Scheduling+and+priorization
Feel free to reopen if you have questions or remarks.

eroussy added enhancement New feature or request Yocto Debian labels Mar 20, 2024

eroussy added this to SEAPATH Board Mar 20, 2024

github-project-automation bot moved this to Todo in SEAPATH Board Mar 20, 2024

dupremathieu removed this from SEAPATH Board Mar 20, 2024

dupremathieu added the question Further information is requested label Mar 20, 2024

eroussy added this to SEAPATH Board Mar 26, 2024

github-project-automation bot moved this to Todo in SEAPATH Board Mar 26, 2024

eroussy moved this from Todo to In Progress in SEAPATH Board Mar 26, 2024

eroussy closed this as completed Apr 8, 2024

github-project-automation bot moved this from In Progress to Done in SEAPATH Board Apr 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questioning our use of cgroups #439

Questioning our use of cgroups #439

eroussy commented Mar 20, 2024

eroussy commented Mar 20, 2024

ebail commented Mar 20, 2024

insatomcat commented Mar 20, 2024 •

edited

Loading

dupremathieu commented Mar 20, 2024

insatomcat commented Mar 20, 2024 •

edited

Loading

eroussy commented Mar 21, 2024 •

edited

Loading

dupremathieu commented Mar 21, 2024

insatomcat commented Mar 21, 2024

dupremathieu commented Mar 21, 2024

insatomcat commented Mar 21, 2024 •

edited

Loading

dupremathieu commented Mar 21, 2024

insatomcat commented Mar 21, 2024

eroussy commented Mar 22, 2024

eroussy commented Mar 26, 2024

insatomcat commented Mar 26, 2024

ebail commented Mar 27, 2024

eroussy commented Apr 8, 2024

Questioning our use of cgroups #439

Questioning our use of cgroups #439

Comments

eroussy commented Mar 20, 2024

eroussy commented Mar 20, 2024

ebail commented Mar 20, 2024

insatomcat commented Mar 20, 2024 • edited Loading

dupremathieu commented Mar 20, 2024

insatomcat commented Mar 20, 2024 • edited Loading

eroussy commented Mar 21, 2024 • edited Loading

dupremathieu commented Mar 21, 2024

insatomcat commented Mar 21, 2024

dupremathieu commented Mar 21, 2024

insatomcat commented Mar 21, 2024 • edited Loading

dupremathieu commented Mar 21, 2024

insatomcat commented Mar 21, 2024

eroussy commented Mar 22, 2024

eroussy commented Mar 26, 2024

insatomcat commented Mar 26, 2024

ebail commented Mar 27, 2024

eroussy commented Apr 8, 2024

insatomcat commented Mar 20, 2024 •

edited

Loading

insatomcat commented Mar 20, 2024 •

edited

Loading

eroussy commented Mar 21, 2024 •

edited

Loading

insatomcat commented Mar 21, 2024 •

edited

Loading