syscalls are really not that expensive. The hardware overhead to cross the privilege boundary both times is only like 100 nanoseconds on modern hardware with maybe a few hundred nanoseconds with full speculation mitigations enabled. Essentially everything else is work that someone in the system needs to do. That work would not go away even if you were to link everything into the kernel, so you really only save the hardware overhead by reducing the number of syscalls. So, unless your code is mainly hot loops <1-5 us in duration with a syscall needed per loop you would not really be getting that much efficiency by reducing the number of syscalls. To gain real benefits you need to be able to make the syscall implementations faster which is certainly possible with a different syscall paradigm.
It's expensive enough that the first thing I do when diagnosing slow software is to break out "strace -c", because shockingly often the cause is someone who thought syscalls wouldn't be that expensive.
You're right, they're not that expensive, but once who end up triggering multiple syscalls in a tight loop where a tiny fraction of them were needed, it often does totally kill performance.
Especially because while the call itself is not that expensive, you rarely call something that does no work. So the goal should be not just to remove the syscall, but to remove the reason why someone made lots of syscalls in the first instance.
E.g. pet peeve of mine: People who call read() with size 1 because they don't know the length of remaining data instead of doing async reads into a userspace buffer. Have cut CPU use drastically so many times because of applications that did that. The problem then is of course not just the context switch, but the massive number of read calls.
FYI shells do this when reading line-by-line from non-seekable streams (e.g. pipes). It's not really feasible to do buffer I/O when you have subprocesses around, and you can't just ungetc(3) into arbitrary file descriptors. A potential performance worth nothing IMO—that is, if you ever do text processing with a bunch of read commands and not sed/awk/whatever like a normal person. Of course this doesn't apply to seekable files, but it still has to rewind back to immediately after the last newline character so there's that.
100 nanoseconds out of 1 millisecond is 1/10000 or ~0.01% which is really not very significant. As I said, if you have hot loops with a 1 microsecond budget then you might see material benefits with a kernel architecture that requires fewer privilege crossings. However, in most such cases there are fairly easy ways of reorganizing the code to use the system architecture more efficiently, such as by batching multiple writes in userspace, that will also result in fewer syscalls in your critical path (and thus fewer privilege crossings) without changing the kernel architecture. It is actually a fairly rare problem where the only solution is to issue a vast numbers of syscalls quickly and in those cases that is usually a limitation of the kernel data structures and syscall implementation. Just to clarify the distinction, in the former you issue the same number of "syscalls", but the new kernel architecture results in fewer privilege crossing events where as my statement here is to rearchitect so you need to issue fewer syscalls.
Cases where you have soft realtime constrains like above are often interactive apps like browsers or games that do a lot of input, audio and graphics related syscalls that don't easily batch.
I measure ~80ns here for the most-trivial syscalls (I measured getpid() and ftruncate(-1)).
However, the "real" cost often is higher, because a syscall essentially prevents out-of-order execution from hiding the cost of those cycles. Quoting the Intel manual:
> Instruction ordering. Instructions following a SYSCALL may be fetched from memory before earlier instructions complete execution, but they will not execute (even speculatively) until all instructions prior to the SYSCALL have completed execution (the later instructions may execute before data stored by the earlier instructions have become globally visible)
On modern out-of-order CPUs that's often going to cause a lot of knock-on slowdown. Waiting for all prior instructions to retire will often mean waiting for memory accesses etc that would effectively be "free" without the syscall.
And then you have the icache, TLB costs of the syscall - harder to measure, because it won't show up in simple test programs...
Yes, there is some overhead due to not being able to speculate across the privilege boundary. However, the cache effects that you mention are not syscall overhead that you could get rid of by reducing privilege crossings. They are caused by the fact that the syscall is actually executing code. You would see the same effects even if you were to link everything together into the kernel and it were just a subroutine call.
That means that if cache effects due to syscalls are your major problem then you will basically not benefit even if you were to use io_uring or equivalent mechanisms to reduce your privilege crossings. The actual problem is that you are asking the kernel to do too much work and the only solution there is to rearchitect to require less time manipulating kernel data structures.
> However, the cache effects that you mention are not syscall overhead that you could get rid of by reducing privilege crossings. They are caused by the fact that the syscall is actually executing code.
That's part of it, sure. But not the whole issue. You need code mapped elsewhere leading to icache/iTLB effects, the GP registers needs to be saved to memory costing you data cache entries, etc.
> That means that if cache effects due to syscalls are your major problem then you will basically not benefit even if you were to use io_uring or equivalent mechanisms to reduce your privilege crossings. The actual problem is that you are asking the kernel to do too much work and the only solution there is to rearchitect to require less time manipulating kernel data structures.
I don't agree at all with this. When executing nontrivial syscalls, the various cache misses inside the kernel alone are a major performance issue.
Compare e.g. the perf stat output for these two fio runs. Both use the following "base" commmand:
fio --time_based=1 --runtime=10 --name test --filename=/dev/nvme6n1 --numjobs=1 --rw randread --allow_file_create=0 --invalidate=0 --ioengine io_uring --direct=1 --bs=4k --registerfiles --fixedbufs --gtod_reduce=1 --iodepth 128
One uses --iodepth_batch_submit=1 --iodepth_batch_complete_max=1
and the other --iodepth_batch_submit=128 --iodepth_batch_complete_max=128. Which leads the first to do
There's a decent IO throughput difference between the two (non-batched IOPS=402k, batched IOPS=586k)- but that's partially due to the different number of interrupts, so I don't want to focus on that too much.
The difference in numbers of instructions executed / IPC (34B to 46B / 1.27 to 1.7 IPC), iTLB misses (64M vs 20M), icache misses (2.9B vs 441M) IMO show pretty clearly that there's a significant difference in cache usage.
You can see similar effects even if you eliminate the actual storage device. E.g. when doing buffered reads from a file in page cache I get extreme differences in icache misses (1.7B vs 31M), and iITLB misses (27M vs 600k) yielding IOPS=790k vs IOPS=1098k (with memory copying being the bottleneck on this machine, thanks to intel server CPU coherency handling, but proportions are similar on my laptop).
Indeed. That's equivalent to one cache miss. Spending engineering effort on memory locality and making your data structures compact probably pays off better for most apps. But maybe eBPF could help with that as well?