syscalls are really not that expensive. The hardware overhead to cross the privi...

vidarh · on July 30, 2021

It's expensive enough that the first thing I do when diagnosing slow software is to break out "strace -c", because shockingly often the cause is someone who thought syscalls wouldn't be that expensive.

You're right, they're not that expensive, but once who end up triggering multiple syscalls in a tight loop where a tiny fraction of them were needed, it often does totally kill performance.

Especially because while the call itself is not that expensive, you rarely call something that does no work. So the goal should be not just to remove the syscall, but to remove the reason why someone made lots of syscalls in the first instance.

E.g. pet peeve of mine: People who call read() with size 1 because they don't know the length of remaining data instead of doing async reads into a userspace buffer. Have cut CPU use drastically so many times because of applications that did that. The problem then is of course not just the context switch, but the massive number of read calls.

oshiar53-0 · on July 30, 2021

> read() with size 1

FYI shells do this when reading line-by-line from non-seekable streams (e.g. pipes). It's not really feasible to do buffer I/O when you have subprocesses around, and you can't just ungetc(3) into arbitrary file descriptors. A potential performance worth nothing IMO—that is, if you ever do text processing with a bunch of read commands and not sed/awk/whatever like a normal person. Of course this doesn't apply to seekable files, but it still has to rewind back to immediately after the last newline character so there's that.

vidarh · on July 30, 2021

Yeah, there are occasional acceptable reasons to do this, but they are thankfully rarely what causes an issue.

More like e.g. how the MySQL client library used to do this (many years ago; last I checked years ago it was fixed)

jcelerier · on July 30, 2021

> The hardware overhead to cross the privilege boundary both times is only like 100 nanoseconds on modern hardware

When your total time budget is 1 or 2 milliseconds that sounds really really expensive

Veserv · on July 30, 2021

100 nanoseconds out of 1 millisecond is 1/10000 or ~0.01% which is really not very significant. As I said, if you have hot loops with a 1 microsecond budget then you might see material benefits with a kernel architecture that requires fewer privilege crossings. However, in most such cases there are fairly easy ways of reorganizing the code to use the system architecture more efficiently, such as by batching multiple writes in userspace, that will also result in fewer syscalls in your critical path (and thus fewer privilege crossings) without changing the kernel architecture. It is actually a fairly rare problem where the only solution is to issue a vast numbers of syscalls quickly and in those cases that is usually a limitation of the kernel data structures and syscall implementation. Just to clarify the distinction, in the former you issue the same number of "syscalls", but the new kernel architecture results in fewer privilege crossing events where as my statement here is to rearchitect so you need to issue fewer syscalls.

fulafel · on July 31, 2021

Cases where you have soft realtime constrains like above are often interactive apps like browsers or games that do a lot of input, audio and graphics related syscalls that don't easily batch.

_y5hn · on July 30, 2021

Are there any differences in cache misses and mechanical sympathy?

anarazel · on July 30, 2021

I measure ~80ns here for the most-trivial syscalls (I measured getpid() and ftruncate(-1)).

However, the "real" cost often is higher, because a syscall essentially prevents out-of-order execution from hiding the cost of those cycles. Quoting the Intel manual:

> Instruction ordering. Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible)

On modern out-of-order CPUs that's often going to cause a lot of knock-on slowdown. Waiting for all prior instructions to retire will often mean waiting for memory accesses etc that would effectively be "free" without the syscall.

And then you have the icache, TLB costs of the syscall - harder to measure, because it won't show up in simple test programs...

Veserv · on July 30, 2021

Yes, there is some overhead due to not being able to speculate across the privilege boundary. However, the cache effects that you mention are not syscall overhead that you could get rid of by reducing privilege crossings. They are caused by the fact that the syscall is actually executing code. You would see the same effects even if you were to link everything together into the kernel and it were just a subroutine call.

That means that if cache effects due to syscalls are your major problem then you will basically not benefit even if you were to use io_uring or equivalent mechanisms to reduce your privilege crossings. The actual problem is that you are asking the kernel to do too much work and the only solution there is to rearchitect to require less time manipulating kernel data structures.

anarazel · on July 30, 2021

> However, the cache effects that you mention are not syscall overhead that you could get rid of by reducing privilege crossings. They are caused by the fact that the syscall is actually executing code.

That's part of it, sure. But not the whole issue. You need code mapped elsewhere leading to icache/iTLB effects, the GP registers needs to be saved to memory costing you data cache entries, etc.

> That means that if cache effects due to syscalls are your major problem then you will basically not benefit even if you were to use io_uring or equivalent mechanisms to reduce your privilege crossings. The actual problem is that you are asking the kernel to do too much work and the only solution there is to rearchitect to require less time manipulating kernel data structures.

I don't agree at all with this. When executing nontrivial syscalls, the various cache misses inside the kernel alone are a major performance issue.

Compare e.g. the perf stat output for these two fio runs. Both use the following "base" commmand: fio --time_based=1 --runtime=10 --name test --filename=/dev/nvme6n1 --numjobs=1 --rw randread --allow_file_create=0 --invalidate=0 --ioengine io_uring --direct=1 --bs=4k --registerfiles --fixedbufs --gtod_reduce=1 --iodepth 128 One uses --iodepth_batch_submit=1 --iodepth_batch_complete_max=1 and the other --iodepth_batch_submit=128 --iodepth_batch_complete_max=128. Which leads the first to do

There's a decent IO throughput difference between the two (non-batched IOPS=402k, batched IOPS=586k)- but that's partially due to the different number of interrupts, so I don't want to focus on that too much.

The difference in numbers of instructions executed / IPC (34B to 46B / 1.27 to 1.7 IPC), iTLB misses (64M vs 20M), icache misses (2.9B vs 441M) IMO show pretty clearly that there's a significant difference in cache usage.

Non-batched:

         11,192.12 msec task-clock                #    1.081 CPUs utilized          
             2,249      context-switches          #  200.945 /sec                   
                40      cpu-migrations            #    3.574 /sec                   
             3,355      page-faults               #  299.765 /sec                   
    26,815,015,190      cycles                    #    2.396 GHz                      (39.24%)
    34,035,530,697      instructions              #    1.27  insn per cycle           (46.28%)
     7,219,263,051      branches                  #  645.031 M/sec                    (45.30%)
        70,070,545      branch-misses             #    0.97% of all branches          (44.46%)
    10,964,930,461      L1-dcache-loads           #  979.701 M/sec                    (44.52%)
       377,649,453      L1-dcache-load-misses     #    3.44% of all L1-dcache accesses  (44.86%)
        10,565,529      LLC-loads                 #  944.015 K/sec                    (31.40%)
         4,111,354      LLC-load-misses           #   38.91% of all LL-cache accesses  (33.55%)
   <not supported>      L1-icache-loads                                             
     2,937,881,568      L1-icache-load-misses                                         (33.44%)
     9,856,349,351      dTLB-loads                #  880.651 M/sec                    (33.01%)
            95,858      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (32.66%)
           202,652      iTLB-loads                #   18.107 K/sec                    (32.21%)
        64,404,931      iTLB-load-misses          # 31781.05% of all iTLB cache accesses  (31.82%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      10.349357908 seconds time elapsed

       4.271785000 seconds user
       6.925925000 seconds sys

Batched:

         11,047.76 msec task-clock                #    1.068 CPUs utilized          
             2,215      context-switches          #  200.493 /sec                   
                40      cpu-migrations            #    3.621 /sec                   
             3,349      page-faults               #  303.138 /sec                   
    26,607,362,936      cycles                    #    2.408 GHz                      (38.78%)
    46,169,616,345      instructions              #    1.74  insn per cycle           (45.89%)
    10,085,406,417      branches                  #  912.891 M/sec                    (44.98%)
        33,167,238      branch-misses             #    0.33% of all branches          (44.29%)
    14,626,458,068      L1-dcache-loads           #    1.324 G/sec                    (44.17%)
       384,373,355      L1-dcache-load-misses     #    2.63% of all L1-dcache accesses  (44.93%)
        11,998,821      LLC-loads                 #    1.086 M/sec                    (31.99%)
         5,558,319      LLC-load-misses           #   46.32% of all LL-cache accesses  (33.42%)
   <not supported>      L1-icache-loads                                             
       441,021,238      L1-icache-load-misses                                         (33.50%)
    13,012,218,216      dTLB-loads                #    1.178 G/sec                    (33.08%)
           101,259      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (32.59%)
           197,359      iTLB-loads                #   17.864 K/sec                    (32.13%)
        20,294,577      iTLB-load-misses          # 10283.08% of all iTLB cache accesses  (31.62%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses

      10.348157248 seconds time elapsed

       3.922330000 seconds user
       7.130332000 seconds sys

You can see similar effects even if you eliminate the actual storage device. E.g. when doing buffered reads from a file in page cache I get extreme differences in icache misses (1.7B vs 31M), and iITLB misses (27M vs 600k) yielding IOPS=790k vs IOPS=1098k (with memory copying being the bottleneck on this machine, thanks to intel server CPU coherency handling, but proportions are similar on my laptop).

fulafel · on July 30, 2021

Indeed. That's equivalent to one cache miss. Spending engineering effort on memory locality and making your data structures compact probably pays off better for most apps. But maybe eBPF could help with that as well?