hubertf's NetBSD Blog
Send interesting links to hubert at feyrer dot de!
 
[20161113] Learning more about the NetBSD scheduler (... than I wanted to know)
I've had another chat with Michael on the scheduler issue, and we agreed that someone should review his proposed patch. Some interesting things came out from there:
  1. I learned a bit more about the scheduler from Michael. With multiple CPUs, each CPU has a queue of processes that are either "on the CPU" (running) or waiting to be serviced (run) on that CPU. Those processes count as "migratable" in runqueue_t. Every now and then, the system checks all its run queues to see if a CPU is idle, and can thus "steal" (migrate) processes from a busy CPU. This is done in sched_balance().

    Such "stealing" (migration) has the positive effect that the process doesn't have to wait for getting serviced on the CPU it's currently waiting on. On the other side, migrating the process has effects on CPU's data and instruction caches, so switching CPUs shouldn't be taken too easy.

    If migration happens, then this should be done from the CPU with the most processes that are waiting for CPU time. In this calculation, not only the current number should be counted in, but a bit of the CPU's history is taken into account, so processes that just started on a CPU are not taken away again immediately. This is what is done with the help of the processes currently migratable (r_mcount) and also some "historic" average. This "historic" value is taken from the previous round in r_avgcount. More or less weight can be given to this, and it seems that the current number of migratable processes had too little weight over all to be considerend.

    What happens in effect is that a process is not taken from its CPU, left waiting there, with another CPU spinning idle. Which is exactly what I saw in the first place.

  2. What I also learned from Michael was that there are a number of sysctl variables that can be used to influence the scheduler. Those are available under the "kern.sched" sysctl-tree:
    % sysctl -d kern.sched
    kern.sched.cacheht_time: Cache hotness time (in ticks)
    kern.sched.balance_period: Balance period (in ticks)
    kern.sched.min_catch: Minimal count of threads for catching
    kern.sched.timesoftints: Track CPU time for soft interrupts
    kern.sched.kpreempt_pri: Minimum priority to trigger kernel preemption
    kern.sched.upreempt_pri: Minimum priority to trigger user preemption
    kern.sched.rtts: Round-robin time quantum (in milliseconds)
    kern.sched.pri_min: Minimal POSIX real-time priority
    kern.sched.pri_max: Maximal POSIX real-time priority 
    The above text shows that much more can be written about the scheduler and its whereabouts, but this remains to be done by someone else (volunteers welcome!).

  3. Now, while digging into this, I also learned that I'm not the first to discover this issue, and there is already another PR on this. I have opened PR kern/51615 but there is also kern/43561. Funny enough, the solution proposed there is about the same, though with a slightly different implementation. Still, *2 and <<1 are the same as are /2 and >>1, so no change there. And renaming variables for fun doesn't count anyways. ;) Last but not least, it's worth noting that this whole issue is not Xen-specific.
So, with this in mind, I went to do a bit of testing. I had already tested running concurrent, long-running processes that did use up all the CPU they got, and the test was good.

To test a different load on the system, I've started a "build.sh -j8" on a (VMware Fusion) VM with 4 CPUs on a Macbook Pro, and it nearly brought the machine to a halt - What I saw was lots of idle time on all CPUs though. I aborted the exercise to get some CPU cycles for me back. I blame the VM handling here, not the guest operating system.

I restarted the exercise with 2 CPUs in the same VM, and there I saw load distribution on both CPUs (not much wonder with -j8), but there was also quite some idle times in the 'make clean / install' phases that I'm not sure is normal. During the actual build phases I wasn't able to see idle time, though the system spent quite some time in the kernel (system). Example top(1) output:

    load averages:  9.01,  8.60,  7.15;               up 0+01:24:11      01:19:33
    67 processes: 7 runnable, 58 sleeping, 2 on CPU
    CPU0 states:  0.0% user, 55.4% nice, 44.6% system,  0.0% interrupt,  0.0% idle
    CPU1 states:  0.0% user, 69.3% nice, 30.7% system,  0.0% interrupt,  0.0% idle
    Memory: 311M Act, 99M Inact, 6736K Wired, 23M Exec, 322M File, 395M Free
    Swap: 1536M Total, 21M Used, 1516M Free
    
    PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
    27028 feyrer    20    5    62M   27M CPU/1      0:00  9.74%  0.93% cc1
      728 feyrer    85    0    78M 3808K select/1   1:03  0.73%  0.73% sshd
    23274 feyrer    21    5    36M   14M RUN/0      0:00 10.00%  0.49% cc1
    21634 feyrer    20    5    44M   20M RUN/0      0:00  7.00%  0.34% cc1
    24697 feyrer    77    5  7988K 2480K select/1   0:00  0.31%  0.15% nbmake
    24964 feyrer    74    5    11M 5496K select/1   0:00  0.44%  0.15% nbmake
    18221 feyrer    21    5    49M   15M RUN/0      0:00  2.00%  0.10% cc1
    14513 feyrer    20    5    43M   16M RUN/0      0:00  2.00%  0.10% cc1
      518 feyrer    43    0    15M 1764K CPU/0      0:02  0.00%  0.00% top
    20842 feyrer    21    5  6992K  340K RUN/0      0:00  0.00%  0.00% x86_64--netb
    16215 feyrer    21    5    28M  172K RUN/0      0:00  0.00%  0.00% cc1
     8922 feyrer    20    5    51M   14M RUN/0      0:00  0.00%  0.00% cc1 
All in all, I'd say the patch is a good step forward from the current situation, which does not properly distribute pure CPU hogs, at all.

[Tags: , ]


Disclaimer: All opinion expressed here is purely my own. No responsibility is taken for anything.

Access count: 36038138
Copyright (c) Hubert Feyrer