Pipe Pingpong

Here is a benchmark program rewritten from perf bench to measure the IPC latency between a parent and a child process. It is quite similar to lm_bench I mentioned in Worksheet 5’s bonus challenge.

You can compile everything by typing make.

Here’s how it work:

The parent writes a 4-byte integer to the child through the p2c pipe.
The parent then immediately read the c2p pipe. Since the child has not yet write anything into c2p, this pipe is empty, and the parent process blocks (goes to sleep).
Because the parent is now blocked, the OS scheduler must find another process to run on that CPU core.
The scheduler will likely pick the child, which was just made runnable. The scheduler must perform a context switch to transfer control from the parent to the child.

The child performs the same sequence in reverse: It blocks on reading p2c and forcing another context switch back to the parent. Each round trip in the benchmark triggers two context switches. Because the amount of data being sent is trivial, the total time measured is dominated by the overhead of the system calls and the context switch.

Three versions

In the first version, we force the two processes to run on the same CPU core. In the second version, we force the two processes to run on two separate CPU cores. In the third version, we let the CPU scheduler design which CPU core to run th process on.

Result of running on Github Codespace

@yunchih ➜ /workspaces/lab-pipe-pingpong (master) $ ./ping-00
# 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU0
# Assumed 2 context switches per round-trip
total time: 0.945113 (s)
per round-trip: 9451.13 (ns)
per context switch (approx): 4725.57 (ns)
round-trips per second: 105807

@yunchih ➜ /workspaces/lab-pipe-pingpong (master) $ ./ping-01
# 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU1
# Assumed 2 context switches per round-trip
total time: 2.618768 (s)
per round-trip: 26187.68 (ns)
per context switch (approx): 13093.84 (ns)
round-trips per second: 38186

@yunchih ➜ /workspaces/lab-pipe-pingpong (master) $ ./ping
# 100000 pipe round-trips (parent->child->parent)
# No Affinity
# Assumed 2 context switches per round-trip
total time: 2.231797 (s)
per round-trip: 22317.97 (ns)
per context switch (approx): 11158.98 (ns)
round-trips per second: 44807

Result of running on my own PC

yunchih@angel:~/code/lab-pipe-pingpong$ ./ping-00
# 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU0
# Assumed 2 context switches per round-trip
total time: 0.252196 (s)
per round-trip: 2521.96 (ns)
per context switch (approx): 1260.98 (ns)
round-trips per second: 396517

yunchih@angel:~/code/lab-pipe-pingpong$ ./ping-01
# 100000 pipe round-trips (parent->child->parent)
# Affinity: parent->CPU0, child->CPU1
# Assumed 2 context switches per round-trip
total time: 0.272840 (s)
per round-trip: 2728.40 (ns)
per context switch (approx): 1364.20 (ns)
round-trips per second: 366515

yunchih@angel:~/code/lab-pipe-pingpong$ ./ping
# 100000 pipe round-trips (parent->child->parent)
# No Affinity
# Assumed 2 context switches per round-trip
total time: 0.251951 (s)
per round-trip: 2519.51 (ns)
per context switch (approx): 1259.76 (ns)
round-trips per second: 396902

Version 1: Parent and Child on the Same Core

The parent writes to p2c.
The parent then reads from c2p and blocks.
The scheduler on CPU 0 sees that the parent is no longer runnable. It looks at its local runqueue and finds that the child process is now ready to run (it was unblocked by the parent’s write).
The scheduler performs a context switch from the parent to the child. This is an entirely software-based operation managed by the kernel on a single CPU. It’s very fast.
The child runs, reads the 4-byte integer, writes a response, and then blocks on its next read() system call. This will again unblocks the parent and cause another context switch.

Version 2: Parent and Child on Different Cores

The parent writes to p2c.
The parent then reads from c2p and blocks.
The kernel now needs to wake up the child process. However, the child is on a different physical core (CPU 1).
The kernel on CPU 0 must send an Inter-Processor Interrupt (IPI) to CPU 1. An IPI is a hardware-level signal used to get the attention of another core. This is a lot slower than a software context switch.
CPU 1 receives the IPI, and its scheduler wakes up the child process.
The child reads the data. This will result in a cache miss (the data written by CPU core 0 might need to be fetched across the CPU interconnect into CPU core 1’s cache).
The child writes its response and blocks. The kernel on CPU 1 now sends an IPI back to CPU 0 to wake the parent.

The overhead consists of two system calls, two Inter-Processor Interrupts, two context switches, and cache misses.

What can we learn?

The two processes in this experiment are not parallelizable: they are dependent on each other. Forcing it onto two cores does not speed it up; it only exposes the communication costs between those cores. A good scheduler might have eventually migrated both processes to the same core to optimize for the cache locality. That’s why a scheduler might not always choose to be work conserving (not utilize all the available CPU cores), because that doesn’t always lead to higher throughput.