Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questioning our use of cgroups #439

Closed
eroussy opened this issue Mar 20, 2024 · 17 comments
Closed

Questioning our use of cgroups #439

eroussy opened this issue Mar 20, 2024 · 17 comments
Labels
Debian enhancement New feature or request question Further information is requested Yocto

Comments

@eroussy
Copy link
Member

eroussy commented Mar 20, 2024

Context
There is currently two ways to handle the CPUs on which a VM has access :

  • Using the isolated VM feature in the inventory : This feature will pin the KVM threads running the VM's vCPUs on the CPUs described in the cpuset list.
    This feature only pins the KVM threads and not the qemu thread responsible for managing the VM.
  • Putting the VM in the machine-rt or machine-nort slice. These cgroups are configured during Ansible setup. They come with allowed CPUs defined in the variables cpumachinesrt and cpumachinesnort Ansible variables.
    Both KVM and qemu threads of the VM will execute on the allowed CPUs

These two configurations have the same purpose but not the same philosophy. They duplicate a feature and doesn't interact easily with each others.
Plus, the cgroup configuration is only on Debian for now.
We have to clarify the isolation feature we want on SEAPATH.

Concerns regarding cgroups
I see two problems with these cgroups today :

The second point can cause a problem, for example :

  • Give two CPUs to machine-rt.slice
  • Deploy an RT VM with two CPUs.
    In that case, the two RT KVM thread will prevent qemu to execute, and the VM will never boot.
    In that case, we have given the slice exactly the number of CPUs we wanted (here : 2) and it will not work.
    We should give 3 CPUs to machine-rt.slice in order to make it work.

Isolation of non-RT VMs
The use of the machine-nort cgroup allows isolating threads of non-RT VMs.
Is this relevant to isolate them if we do not have special RT needs. Wouldn't it be better to let the Linux scheduler handle these VMs on the system's CPUs ?

We now need to choose the isolation method we want and use it on both on Debian and Yocto versions.
I leave this question open, feel free to add your remarks below.

@eroussy
Copy link
Member Author

eroussy commented Mar 20, 2024

Another question to ask is about the ease of use.
I understand the idea of setting the allowed CPUs during Ansible deployment and not having to wonder about that later. But actually, I find it more confusing than the "isolated" VM feature.

@ebail
Copy link
Member

ebail commented Mar 20, 2024

I think that in any case we need to have a common way to configure RT capabilities to VM.
VM should be deployed on both Debian and Yocto without any change.
@insatomcat @dupremathieu could you please share your opinion ?

Best,

@insatomcat
Copy link
Member

insatomcat commented Mar 20, 2024

slices/cgroups (cpuset actually), are the recommended way to do cpu isolation (isolcpus is deprecated), so I think it's nice SEAPATH is already proposing something with cpusets.
The vcpupin feature of libvirt is complementary.
If you really want to isolate a core and dedicate it to a vcpu for low latency purposes, I think you need both.
vcpupin will ensure all the work of the vcpu will be done by a specific physical core, but slices (libvirt partitions) will make sure no other workload is going to land on this physical core. The only other way is to use isolcpus, but as mentionned before it's supposed to be deprecated.

Anyway, all those configurations are optional, so in the end I feel we can't really choose (because a seapath user may need both), but it's not really an issue because we don't have to.

@dupremathieu dupremathieu added the question Further information is requested label Mar 20, 2024
@dupremathieu
Copy link
Member

I think you have missed the issue @insatomcat.
All processes spawned by libvirt for the virtualization will be in the same cgroup and will share the same cpuset.
Here is an example of processes spawned by libvirtd:

qemu-system-x86
qemu-system-x86
log
msgr-worker-0
msgr-worker-1
msgr-worker-2
service
io_context_pool
io_context_pool
ceph_timer
ms_dispatch
ms_local
safe_timer
safe_timer
safe_timer
safe_timer
taskfin_librbd
vhost-16369
IOmon_iothread
CPU0/KVM
kvm
kvm-nx-lpage-recovery-16369
kvm-pit/16369

We can pin some processes tweaking the libvirt XML (vcpupin, emulatorpin and iothreadpin) but we can pin it only in the cgroup cpuset and not all processes can be pinned. The unpinned processes are free to run on all CPUs inside the cpuset even pinned CPUs.

It is usually not an issue, but if you have KVM RT task pinned on all available CPUs all other none RT tasks will never be scheduled and the VM will never boot.

So to avoid this in our implementation, we have to reserve an extra CPU core only for these processes.

There are two ways to solve that, remove all cpuset and use isolcpus domaine or keep VM in the machine slice, remove vcpupin in the xml and create a qemu hook to change the slice of KVM thread to machine-rt slice and apply pinning and RT priority.

Regarding the isolcpus deprecated flags, it is just the recommended way which has been changed. I don't know if the kernel preempt RT patch modify something in this part.

@eroussy if you do not want to use cpuset just do not set it inside the Ansible inventory and add the isolcpus domain kernel parameter.

@insatomcat
Copy link
Member

insatomcat commented Mar 20, 2024

All processes spawned by libvirt for the virtualization will be in the same cgroup and will share the same cpuset.

I do not notice this on my setup. Of course I use isolcpus since this is still something done by SEAPATH.
Is your setup running only the slices isolation and no isolcpus?
On our example inventory, cpumachinesrt is the same as isolcpus.

I have a RT VM with 2 vcpu:

# virsh dumpxml debian | grep cpu
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='2'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <emulatorpin cpuset='4'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='2'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='2'/>
  </cputune>
  <cpu mode='custom' match='exact' check='full'>
  </cpu>

And if I ignore the core for "emulation" this is what I see on the 2 dedicated cores:

# ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr |  grep " 2$"
cpuhp/2         [cpuhp/2]                        37      37      - TS    2
idle_inject/2   [idle_inject/2]                  38      38     50 FF    2
irq_work/2      [irq_work/2]                     39      39      1 FF    2
migration/2     [migration/2]                    40      40     99 FF    2
rcuc/2          [rcuc/2]                         41      41     10 FF    2
ktimers/2       [ktimers/2]                      42      42      1 FF    2
ksoftirqd/2     [ksoftirqd/2]                    43      43      - TS    2
kworker/2:0-eve [kworker/2:0-events]             44      44      - TS    2
irq/125-PCIe PM [irq/125-PCIe PME]              325     325     50 FF    2
kworker/2:1     [kworker/2:1]                   334     334      - TS    2
irq/151-megasas [irq/151-megasas0-msix3]        488     488     50 FF    2
CPU 0/KVM       /usr/bin/qemu-system-x86_64    4739    4775      2 FF    2

# ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr |  grep " 14$"
cpuhp/14        [cpuhp/14]                      160     160      - TS   14
idle_inject/14  [idle_inject/14]                161     161     50 FF   14
irq_work/14     [irq_work/14]                   162     162      1 FF   14
migration/14    [migration/14]                  163     163     99 FF   14
rcuc/14         [rcuc/14]                       164     164     10 FF   14
ktimers/14      [ktimers/14]                    165     165      1 FF   14
ksoftirqd/14    [ksoftirqd/14]                  166     166      - TS   14
kworker/14:0-ev [kworker/14:0-events]           167     167      - TS   14
kworker/14:1    [kworker/14:1]                  339     339      - TS   14
irq/163-megasas [irq/163-megasas0-msix15]       500     500     50 FF   14
irq/194-eno8403 [irq/194-eno8403-tx-0]         3226    3226     50 FF   14
CPU 1/KVM       /usr/bin/qemu-system-x86_64    4739    4777      2 FF   14

Basically nothing besides the vcpu process and bounded kthreads...
So really I don't know what those unpinned libvirt processes are about.

@eroussy
Copy link
Member Author

eroussy commented Mar 21, 2024

Basically nothing besides the vcpu process and bounded kthreads...
So really I don't know what those unpinned libvirt processes are about.

Here you are only looking the processes running exactly on the two cores you choose for the RT VM.
You have to look on all cores in the machine-rt.slice allowed CPU's

For example, on my setup :
The machine-rt slice allowed CPUs are 4-7

root@seapath:/home/virtu# cat /etc/systemd/system/machine-rt.slice | grep AllowedCPUs
AllowedCPUs=4-7

And the processes on cores 4 to 7 (I display only one part of it) :

root@seapath:/home/virtu# ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr |  grep " [4-7]$"
[...]
qemu-system-x86 /usr/bin/qemu-system-x86_64  158603  158603      - TS    4
call_rcu        /usr/bin/qemu-system-x86_64  158603  158607      - TS    4
log             /usr/bin/qemu-system-x86_64  158603  158608      - TS    4
msgr-worker-0   /usr/bin/qemu-system-x86_64  158603  158609      - TS    4
msgr-worker-1   /usr/bin/qemu-system-x86_64  158603  158610      - TS    4
msgr-worker-2   /usr/bin/qemu-system-x86_64  158603  158611      - TS    4
service         /usr/bin/qemu-system-x86_64  158603  158615      - TS    4
io_context_pool /usr/bin/qemu-system-x86_64  158603  158616      - TS    4
io_context_pool /usr/bin/qemu-system-x86_64  158603  158617      - TS    4
ceph_timer      /usr/bin/qemu-system-x86_64  158603  158618      - TS    4
ms_dispatch     /usr/bin/qemu-system-x86_64  158603  158619      - TS    4
ms_local        /usr/bin/qemu-system-x86_64  158603  158620      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158621      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158622      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158623      - TS    4
safe_timer      /usr/bin/qemu-system-x86_64  158603  158624      - TS    4
taskfin_librbd  /usr/bin/qemu-system-x86_64  158603  158625      - TS    4
vhost-158603    /usr/bin/qemu-system-x86_64  158603  158648      - TS    4
vhost-158603    /usr/bin/qemu-system-x86_64  158603  158649      - TS    4
IO mon_iothread /usr/bin/qemu-system-x86_64  158603  158650      - TS    4
CPU 0/KVM       /usr/bin/qemu-system-x86_64  158603  158651      1 FF    5
CPU 1/KVM       /usr/bin/qemu-system-x86_64  158603  158652      1 FF    6
kvm             [kvm]                        158626  158626      - TS    4
kvm-nx-lpage-re [kvm-nx-lpage-recovery-1586  158627  158627      - TS    4
kvm-pit/158603  [kvm-pit/158603]             158654  158654      - TS    4
kworker/4:0-kdm [kworker/4:0-kdmflush/254:0 2099770 2099770      - TS    4

The question is :
should all these threads run on these CPUs ?
And if yes, how can we run them on other CPUs than the 4th ?

(These questions are also related to the issue #438

@dupremathieu
Copy link
Member

@insatomcat you didn't notice because we have a large cpuset range.

Reduce your machine-rt cpuset to have a number of CPUs cores equals to the number of your virtual CPUs.

@insatomcat
Copy link
Member

I have a slice with cpuset 2-6,14-18 and a GUEST with 4 vcpus:

# cat /etc/systemd/system/machine-rt.slice
[Unit]
Description=VM rt slice
Before=slices.target
Wants=machine.slice

[Slice]
AllowedCPUs=2-6,14-18

# virsh dumpxml XXX | grep cpu
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='5'/>
    <vcpupin vcpu='1' cpuset='6'/>
    <vcpupin vcpu='2' cpuset='17'/>
    <vcpupin vcpu='3' cpuset='18'/>
    <emulatorpin cpuset='16'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='5'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='5'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='5'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='5'/>
  </cputune>

If I look at all the cores, then I see the processes you are talking about, but all on core 16, which is the emulatorpin chosen core:

# for i in 2 3 4 5 6 14 15 16 17 18; do ps -eT -o comm,cmd,pid,tid,rtprio,policy,psr | grep " $i$"; done
cpuhp/2         [cpuhp/2]                        38      38      - TS    2
idle_inject/2   [idle_inject/2]                  39      39     50 FF    2
irq_work/2      [irq_work/2]                     40      40      1 FF    2
migration/2     [migration/2]                    41      41     99 FF    2
rcuc/2          [rcuc/2]                         42      42     10 FF    2
ktimers/2       [ktimers/2]                      43      43      1 FF    2
ksoftirqd/2     [ksoftirqd/2]                    44      44      - TS    2
kworker/2:0-eve [kworker/2:0-events]             45      45      - TS    2
irq/125-PCIe PM [irq/125-PCIe PME]              326     326     50 FF    2
kworker/2:1     [kworker/2:1]                   335     335      - TS    2
irq/134-megasas [irq/134-megasas0-msix3]        491     491     50 FF    2
irq/281-iavf-en [irq/281-iavf-ens1f1v1-TxRx    3485    3485     50 FF    2
irq/247-i40e-en [irq/247-i40e-ens1f1-TxRx-1    3810    3810     50 FF    2
irq/174-i40e-en [irq/174-i40e-ens1f0-TxRx-8    3963    3963     50 FF    2
irq/264-vfio-ms [irq/264-vfio-msix[0](0000:    7314    7314     50 FF    2
cpuhp/3         [cpuhp/3]                        48      48      - TS    3
idle_inject/3   [idle_inject/3]                  49      49     50 FF    3
irq_work/3      [irq_work/3]                     50      50      1 FF    3
migration/3     [migration/3]                    51      51     99 FF    3
rcuc/3          [rcuc/3]                         52      52     10 FF    3
ktimers/3       [ktimers/3]                      53      53      1 FF    3
ksoftirqd/3     [ksoftirqd/3]                    54      54      - TS    3
kworker/3:0-eve [kworker/3:0-events]             55      55      - TS    3
irq/126-PCIe PM [irq/126-PCIe PME]              327     327     50 FF    3
kworker/3:1     [kworker/3:1]                   336     336      - TS    3
irq/135-megasas [irq/135-megasas0-msix4]        492     492     50 FF    3
irq/282-iavf-en [irq/282-iavf-ens1f1v1-TxRx    3486    3486     50 FF    3
irq/248-i40e-en [irq/248-i40e-ens1f1-TxRx-1    3811    3811     50 FF    3
irq/175-i40e-en [irq/175-i40e-ens1f0-TxRx-9    3964    3964     50 FF    3
irq/266-vfio-ms [irq/266-vfio-msix[1](0000:    7293    7293     50 FF    3
cpuhp/4         [cpuhp/4]                        58      58      - TS    4
idle_inject/4   [idle_inject/4]                  59      59     50 FF    4
irq_work/4      [irq_work/4]                     60      60      1 FF    4
migration/4     [migration/4]                    61      61     99 FF    4
rcuc/4          [rcuc/4]                         62      62     10 FF    4
ktimers/4       [ktimers/4]                      63      63      1 FF    4
ksoftirqd/4     [ksoftirqd/4]                    64      64      - TS    4
kworker/4:0-eve [kworker/4:0-events]             65      65      - TS    4
irq/127-PCIe PM [irq/127-PCIe PME]              328     328     50 FF    4
kworker/4:1     [kworker/4:1]                   337     337      - TS    4
irq/136-megasas [irq/136-megasas0-msix5]        493     493     50 FF    4
irq/249-i40e-en [irq/249-i40e-ens1f1-TxRx-1    3812    3812     50 FF    4
irq/176-i40e-en [irq/176-i40e-ens1f0-TxRx-1    3965    3965     50 FF    4
irq/268-vfio-ms [irq/268-vfio-msix[2](0000:    7294    7294     50 FF    4
cpuhp/5         [cpuhp/5]                        68      68      - TS    5
idle_inject/5   [idle_inject/5]                  69      69     50 FF    5
irq_work/5      [irq_work/5]                     70      70      1 FF    5
migration/5     [migration/5]                    71      71     99 FF    5
rcuc/5          [rcuc/5]                         72      72     10 FF    5
ktimers/5       [ktimers/5]                      73      73      1 FF    5
ksoftirqd/5     [ksoftirqd/5]                    74      74      - TS    5
kworker/5:0-eve [kworker/5:0-events]             75      75      - TS    5
irq/128-PCIe PM [irq/128-PCIe PME]              329     329     50 FF    5
kworker/5:1     [kworker/5:1]                   338     338      - TS    5
irq/137-megasas [irq/137-megasas0-msix6]        494     494     50 FF    5
irq/250-i40e-en [irq/250-i40e-ens1f1-TxRx-2    3813    3813     50 FF    5
irq/177-i40e-en [irq/177-i40e-ens1f0-TxRx-1    3966    3966     50 FF    5
CPU 0/KVM       /usr/bin/qemu-system-x86_64    5899    5954      5 FF    5
irq/270-vfio-ms [irq/270-vfio-msix[3](0000:    7295    7295     50 FF    5
cpuhp/6         [cpuhp/6]                        78      78      - TS    6
idle_inject/6   [idle_inject/6]                  79      79     50 FF    6
irq_work/6      [irq_work/6]                     80      80      1 FF    6
migration/6     [migration/6]                    81      81     99 FF    6
rcuc/6          [rcuc/6]                         82      82     10 FF    6
ktimers/6       [ktimers/6]                      83      83      1 FF    6
ksoftirqd/6     [ksoftirqd/6]                    84      84      - TS    6
kworker/6:0-eve [kworker/6:0-events]             85      85      - TS    6
kworker/6:1-mm_ [kworker/6:1-mm_percpu_wq]      316     316      - TS    6
irq/129-PCIe PM [irq/129-PCIe PME]              330     330     50 FF    6
irq/138-megasas [irq/138-megasas0-msix7]        495     495     50 FF    6
irq/251-i40e-en [irq/251-i40e-ens1f1-TxRx-2    3814    3814     50 FF    6
irq/178-i40e-en [irq/178-i40e-ens1f0-TxRx-1    3967    3967     50 FF    6
CPU 1/KVM       /usr/bin/qemu-system-x86_64    5899    5956      5 FF    6
irq/273-vfio-ms [irq/273-vfio-msix[4](0000:    7298    7298     50 FF    6
cpuhp/14        [cpuhp/14]                      161     161      - TS   14
idle_inject/14  [idle_inject/14]                162     162     50 FF   14
irq_work/14     [irq_work/14]                   163     163      1 FF   14
migration/14    [migration/14]                  164     164     99 FF   14
rcuc/14         [rcuc/14]                       165     165     10 FF   14
ktimers/14      [ktimers/14]                    166     166      1 FF   14
ksoftirqd/14    [ksoftirqd/14]                  167     167      - TS   14
kworker/14:0-ev [kworker/14:0-events]           168     168      - TS   14
kworker/14:1    [kworker/14:1]                  339     339      - TS   14
irq/210-ahci[00 [irq/210-ahci[0000:00:17.0]     471     471     50 FF   14
irq/147-megasas [irq/147-megasas0-msix15]       503     503     50 FF   14
irq/236-i40e-en [irq/236-i40e-ens1f1-TxRx-6    3798    3798     50 FF   14
irq/186-i40e-en [irq/186-i40e-ens1f0-TxRx-2    3975    3975     50 FF   14
cpuhp/15        [cpuhp/15]                      171     171      - TS   15
idle_inject/15  [idle_inject/15]                172     172     50 FF   15
irq_work/15     [irq_work/15]                   173     173      1 FF   15
migration/15    [migration/15]                  174     174     99 FF   15
rcuc/15         [rcuc/15]                       175     175     10 FF   15
ktimers/15      [ktimers/15]                    176     176      1 FF   15
ksoftirqd/15    [ksoftirqd/15]                  177     177      - TS   15
kworker/15:0-ev [kworker/15:0-events]           178     178      - TS   15
kworker/15:1    [kworker/15:1]                  340     340      - TS   15
irq/146-megasas [irq/146-megasas0-msix16]       504     504     50 FF   15
irq/237-i40e-en [irq/237-i40e-ens1f1-TxRx-7    3799    3799     50 FF   15
irq/187-i40e-en [irq/187-i40e-ens1f0-TxRx-2    3976    3976     50 FF   15
cpuhp/16        [cpuhp/16]                      181     181      - TS   16
idle_inject/16  [idle_inject/16]                182     182     50 FF   16
irq_work/16     [irq_work/16]                   183     183      1 FF   16
migration/16    [migration/16]                  184     184     99 FF   16
rcuc/16         [rcuc/16]                       185     185     10 FF   16
ktimers/16      [ktimers/16]                    186     186      1 FF   16
ksoftirqd/16    [ksoftirqd/16]                  187     187      - TS   16
kworker/16:0-ev [kworker/16:0-events]           188     188      - TS   16
kworker/16:0H-e [kworker/16:0H-events_highp     189     189      - TS   16
kworker/16:1    [kworker/16:1]                  341     341      - TS   16
irq/148-megasas [irq/148-megasas0-msix17]       505     505     50 FF   16
irq/254-i40e-00 [irq/254-i40e-0000:18:00.1:    1597    1597     50 FF   16
irq/238-i40e-en [irq/238-i40e-ens1f1-TxRx-8    3800    3800     50 FF   16
irq/188-i40e-en [irq/188-i40e-ens1f0-TxRx-2    3977    3977     50 FF   16
qemu-system-x86 /usr/bin/qemu-system-x86_64    5899    5899      - TS   16
qemu-system-x86 /usr/bin/qemu-system-x86_64    5899    5907      - TS   16
log             /usr/bin/qemu-system-x86_64    5899    5927      - TS   16
msgr-worker-0   /usr/bin/qemu-system-x86_64    5899    5928      - TS   16
msgr-worker-1   /usr/bin/qemu-system-x86_64    5899    5929      - TS   16
msgr-worker-2   /usr/bin/qemu-system-x86_64    5899    5930      - TS   16
service         /usr/bin/qemu-system-x86_64    5899    5934      - TS   16
io_context_pool /usr/bin/qemu-system-x86_64    5899    5935      - TS   16
io_context_pool /usr/bin/qemu-system-x86_64    5899    5936      - TS   16
ceph_timer      /usr/bin/qemu-system-x86_64    5899    5937      - TS   16
ms_dispatch     /usr/bin/qemu-system-x86_64    5899    5938      - TS   16
ms_local        /usr/bin/qemu-system-x86_64    5899    5939      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5940      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5941      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5942      - TS   16
safe_timer      /usr/bin/qemu-system-x86_64    5899    5943      - TS   16
taskfin_librbd  /usr/bin/qemu-system-x86_64    5899    5944      - TS   16
vhost-5899      /usr/bin/qemu-system-x86_64    5899    5952      - TS   16
IO mon_iothread /usr/bin/qemu-system-x86_64    5899    5953      - TS   16
SPICE Worker    /usr/bin/qemu-system-x86_64    5899    5982      - TS   16
vhost-5899      /usr/bin/qemu-system-x86_64    5899    6003      - TS   16
kworker/16:1H-k [kworker/16:1H-kblockd]        5906    5906      - TS   16
kvm-nx-lpage-re [kvm-nx-lpage-recovery-5899    5945    5945      - TS   16
cpuhp/17        [cpuhp/17]                      191     191      - TS   17
idle_inject/17  [idle_inject/17]                192     192     50 FF   17
irq_work/17     [irq_work/17]                   193     193      1 FF   17
migration/17    [migration/17]                  194     194     99 FF   17
rcuc/17         [rcuc/17]                       195     195     10 FF   17
ktimers/17      [ktimers/17]                    196     196      1 FF   17
ksoftirqd/17    [ksoftirqd/17]                  197     197      - TS   17
kworker/17:0-ev [kworker/17:0-events]           198     198      - TS   17
kworker/17:1    [kworker/17:1]                  342     342      - TS   17
irq/149-megasas [irq/149-megasas0-msix18]       506     506     50 FF   17
irq/239-i40e-en [irq/239-i40e-ens1f1-TxRx-9    3801    3801     50 FF   17
irq/166-i40e-en [irq/166-i40e-ens1f0-TxRx-0    3955    3955     50 FF   17
irq/189-i40e-en [irq/189-i40e-ens1f0-TxRx-2    3978    3978     50 FF   17
CPU 2/KVM       /usr/bin/qemu-system-x86_64    5899    5957      5 FF   17
cpuhp/18        [cpuhp/18]                      201     201      - TS   18
idle_inject/18  [idle_inject/18]                202     202     50 FF   18
irq_work/18     [irq_work/18]                   203     203      1 FF   18
migration/18    [migration/18]                  204     204     99 FF   18
rcuc/18         [rcuc/18]                       205     205     10 FF   18
ktimers/18      [ktimers/18]                    206     206      1 FF   18
ksoftirqd/18    [ksoftirqd/18]                  207     207      - TS   18
kworker/18:0-ev [kworker/18:0-events]           208     208      - TS   18
kworker/18:1    [kworker/18:1]                  343     343      - TS   18
irq/151-megasas [irq/151-megasas0-msix19]       507     507     50 FF   18
irq/240-i40e-en [irq/240-i40e-ens1f1-TxRx-1    3802    3802     50 FF   18
irq/167-i40e-en [irq/167-i40e-ens1f0-TxRx-1    3956    3956     50 FF   18
CPU 3/KVM       /usr/bin/qemu-system-x86_64    5899    5958      5 FF   18

What's the emulatorpin setting in your setup?

@dupremathieu
Copy link
Member

What we want for emulatorpin is to use the no-rt cpuset (0-1,7-13,19-N in your case) to avoid reserve and loose a core for it.

@insatomcat
Copy link
Member

insatomcat commented Mar 21, 2024

I don't think you can do that while at the same time asking libvirt to use the machine-rt slice for the same guest

@dupremathieu
Copy link
Member

It should be possible using qemu hook, but to me, we should just mention it in the documentation.

I suggest indicating in the documentation:

  • The pinned processes have to be pinned inside the cgroup cpuset.
  • If you use RT privilege KVM threads, the minimal cpuset size has to be: total of number of vcpus + 1.
  • If it is not what you want, do not define any cgroup cpuset and use isolcpus domain instead.

@insatomcat
Copy link
Member

If you use RT privilege KVM threads, the minimal cpuset size has to be: total of number of vcpus + 1

--> total of number of "isolated" vcpus + 1

You don't have to isolate all the vcpu. You may even need a "non isolated" vcpu for the housekeeping inside the vm.
So basically if you need N isolated cores for your realtime workload, your guest may need N+1 cores.
But this "+1" can be shared with other RT guests and with the emulator.

@eroussy
Copy link
Member Author

eroussy commented Mar 22, 2024

I suggest indicating in the documentation:
* The pinned processes have to be pinned inside the cgroup cpuset.
* If you use RT privilege KVM threads, the minimal cpuset size has to be: total of number of vcpus + 1.
* If it is not what you want, do not define any cgroup cpuset and use isolcpus domain instead.

I don't think it is a good idea to propose both.
We should choose one isolation method and argument why we choosed it.

Personally,

  • I don't like that the emulatorpin cpuset has to be in the machine-rt.slice allowed CPUs. We cannot pin him where we want.
  • I don't like that we have to assigned number_of_isolated_vcpu + 1 cores to the slice. I find this confusing. It also needs one more core which is difficult on machines with few cores.
  • I also don't like that we have to choose the allowed CPUs during ansible setup. I think these have to be chosen for each VMs during the deployment.
  • Finally, I think that all the emulator CPUs have to be handled by Linux scheduler on system's core.

The only argument I see (for now) in favor of using cgroups is that it prevents from hardware processor attacks like meltdown and spectre.
Do we really want to prevent these types of attacks ?
Is there any other argument I'm missing ?

@eroussy
Copy link
Member Author

eroussy commented Mar 26, 2024

Hi all,

We need to close this question.
I discussed with Mathieu and we conclude that :

  • Cgroup are useful for specific configurations
  • Cgroup must remain optional
  • Cgroup are an advanced feature of SEAPATH and must not be presented directly to newcomers.
  • The isolcpu question will soon be handled with tuned, which is better.

So, regarding the work to do :

  • All the systemd slices (system, user, ovs, machine, machine-rt and machine-nonrt) are already optionnal and must stay optionnal.
  • Their configuration must be backported to yocto (and also be optional)
  • Tuned will be merged to handle isolcpu
  • Documentation should be written about the technical questions we discussed in this issue (emulatorpin, qemu thread, number_of_isolated_vcpu + 1 etc ...)
  • Slices variable (cpusystem, cpuuser ...) must be removed of the inventory examples and described in the inventories README as advanced feature.

@insatomcat @dupremathieu what do you think of that ? Did I miss something ?

@eroussy eroussy moved this from Todo to In Progress in SEAPATH Board Mar 26, 2024
@insatomcat
Copy link
Member

I'm ok with all that.

@ebail
Copy link
Member

ebail commented Mar 27, 2024

Great. Maybe @eroussy it worth documenting it on LFEnergy Wiki ?

@eroussy
Copy link
Member Author

eroussy commented Apr 8, 2024

The topic is now covered in this wiki page : https://wiki.lfenergy.org/display/SEAP/Scheduling+and+priorization
Feel free to reopen if you have questions or remarks.

@eroussy eroussy closed this as completed Apr 8, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in SEAPATH Board Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Debian enhancement New feature or request question Further information is requested Yocto
Projects
Status: Done
Development

No branches or pull requests

4 participants