Pipe Pingpong
Here is a benchmark program rewritten from perf bench
to measure the IPC latency between a parent and a child process. It is quite similar to lm_bench
I mentioned in Worksheet 5’s bonus challenge.
You can compile everything by typing make
.
Here’s how it work:
- The parent writes a 4-byte integer to the child through the
p2c
pipe. - The parent then immediately read the
c2p
pipe. Since the child has not yet write anything intoc2p
, this pipe is empty, and the parent process blocks (goes to sleep). - Because the parent is now blocked, the OS scheduler must find another process to run on that CPU core.
- The scheduler will likely pick the child, which was just made runnable. The scheduler must perform a context switch to transfer control from the parent to the child.
The child performs the same sequence in reverse: It blocks on reading In the first version, we force the two processes to run on the same CPU core. In the second version, we force the two processes to run on two separate CPU cores. In the third version, we let the CPU scheduler design which CPU core to run th process on. The overhead consists of two system calls, two Inter-Processor Interrupts, two context switches, and cache misses. The two processes in this experiment are not parallelizable: they are dependent on each other. Forcing it onto two cores does not speed it up; it only exposes the communication costs between those cores. A good scheduler might have eventually migrated both processes to the same core to optimize for the cache locality. That’s why a scheduler might not always choose to be work conserving (not utilize all the available CPU cores), because that doesn’t always lead to higher throughput. Running A single time quantum is at the millisecond scale, while the cost of one context switch is at the microsecond scale. This is 1000x difference. That’s why the kernel can switch processes frequently. If the context switch overhead were much larger, then time sharing would not work because most CPU time would be wasted switching rather than running user code.p2c
and forcing another context switch back to the parent. Each round trip in the benchmark triggers two context switches. Because the amount of data being sent is trivial, the total time measured is dominated by the overhead of the system calls and the context switch.Three versions
/workspaces/lab-pipe-pingpong (master) $ ./ping-00
@yunchih ➜ # 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU0
# Assumed 2 context switches per round-trip
: 0.945113 (s)
total time-trip: 9451.13 (ns)
per roundswitch (approx): 4725.57 (ns)
per context -trips per second: 105807
round
/workspaces/lab-pipe-pingpong (master) $ ./ping-01
@yunchih ➜ # 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU1
# Assumed 2 context switches per round-trip
: 2.618768 (s)
total time-trip: 26187.68 (ns)
per roundswitch (approx): 13093.84 (ns)
per context -trips per second: 38186
round
/workspaces/lab-pipe-pingpong (master) $ ./ping
@yunchih ➜ # 100000 pipe round-trips (parent->child->parent)
# No Affinity
# Assumed 2 context switches per round-trip
: 2.231797 (s)
total time-trip: 22317.97 (ns)
per roundswitch (approx): 11158.98 (ns)
per context -trips per second: 44807 round
:~/code/lab-pipe-pingpong$ ./ping-00
yunchih@angel# 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU0
# Assumed 2 context switches per round-trip
: 0.252196 (s)
total time-trip: 2521.96 (ns)
per roundswitch (approx): 1260.98 (ns)
per context -trips per second: 396517
round
:~/code/lab-pipe-pingpong$ ./ping-01
yunchih@angel# 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU1
# Assumed 2 context switches per round-trip
: 0.272840 (s)
total time-trip: 2728.40 (ns)
per roundswitch (approx): 1364.20 (ns)
per context -trips per second: 366515
round
:~/code/lab-pipe-pingpong$ ./ping
yunchih@angel# 100000 pipe round-trips (parent->child->parent)
# No Affinity
# Assumed 2 context switches per round-trip
: 0.251951 (s)
total time-trip: 2519.51 (ns)
per roundswitch (approx): 1259.76 (ns)
per context -trips per second: 396902 round
Version 1: Parent and Child on the Same Core
p2c
.c2p
and blocks.write
).read()
system call. This will again unblocks the parent and cause another context switch.Version 2: Parent and Child on Different Cores
p2c
.c2p
and blocks.What can we learn?
Time Sharing
ping-00
on Github Codespace and on my system shows that the overhead of a single context switch is approximately 4.7 microseconds and 1.3 microseconds. On typical Linux systems, a process gets a time slice (time quantum) of about 1 to 100 milliseconds. This is the amount of CPU time a process can use before the scheduler forces a context switch to let another runnable process use the CPU.