fork()/exec()

Why the UNIX chose a two-step interface (fork then exec) instead of one single API for process creation?

The biggest reason is that the parent can adjust many of the child process’s execution environment:

  1. scheduling priority (nice)
  2. resource limits (rlimit)
  3. open file (dup2)
  4. permission (umask)
  5. working directory (chdir)
  6. user ID (setuid)
  7. signal handling.

A management process run as root. It forks a child, and drops child’s privileges from root to the nobody user. This will exec() the target binary with minimum security risk.

If you have a single API for process creation, you would need to populate a very big struct of options.

When we exec() a program, we can also pass some inputs into the program. There are two ways to pass inputs to the program:

  • command-line arguments (via argv)
  • environment variables (via envp)

Read OSTEP-proc-5 to find out more!

Environment Variables

Have you ever run AI tool like this one that needs an API key to talk to OpenAI? It is a bad idea to just write this in the code:

# inside my python code ...
api_key = "sk-ABCDEF123456"

Why? Because if you commit your code to GitHub, people can see your API key 🙃.

It is also a bad idea to pass the API key as command line argument like this:

$ python3 llama.py api="sk-ABCDEF123456"

Why? Because if you run the program on a public workstation, anyone on the same workstation can see your secret by looking at top or ps.

Instead, the best practice is to pass the API key or any secret information as environment variable:

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

Before we run the code, we set the environment variable by pasting the API key we get from OpenAI: export OPENAI_API_KEY=sk-ABCDEF123456 Environment variables let us pass context to the code we’re running without hard-coding it in the code.

Try it on Github Codespace

Let’s play with an example script that query Taipei YouBike 2.0 real-time API. Open a previous Github Codespace, and run the following in the terminal to install some packages:

sudo apt update
sudo apt install -y jq curl
  1. Save this script as ubike.sh
#!/usr/bin/env bash
#
# Simple helper for Taipei YouBike 2.0 real-time data
#   • If $STATION is unset/empty  : list all station names (sna)
#   • If $STATION is set          : print the specific field for the matching sna

DATA_URL="https://tcgbusfs.blob.core.windows.net/dotapp/youbike/v2/youbike_immediate.json"

json=$(curl -s "$DATA_URL")

if [[ -z "${STATION:-}" ]]; then
  echo "Available stations:"
  echo "$json" | jq -r '.[] | .sna' | sort -u
else
  echo "Details for station: $STATION"
  echo "$json" | jq -r --arg sna "$STATION" '
    .[]
    | select(.sna == $sna)
    | to_entries[]
    | "\(.key)=\(.value)"
  '
fi
  1. make it executable: chmod +x ubike.sh

  2. Run it:

# List every station name
./ubike.sh # STATION is empty

# Inspect one specific station (Chinese names work fine)
export STATION="YouBike2.0_捷運科技大樓站"
./ubike.sh

As you can see, we just change the behavior of the ubike.sh without passing an argument.

Real-World Example: Starting a PostgreSQL Container

Docker container images often don’t change once they are built, meaning that their contents (binaries, scripts, configurations) are fixed. Environment variables help us inject dynamic behavior, such as API keys or password, into a container at runtime.

The following command launches a PostgreSQL database container. Note the -e flags: they inject environment variables into the container (more customizable variables here).

docker run --name some-postgres \
  -e POSTGRES_USER=myuser \
  -e POSTGRES_PASSWORD=mypassword \
  -e POSTGRES_DB=mydatabase \
  -d postgres

Under the hood, Docker calls execve() to start the database process. The database process reads values like POSTGRES_USER from the environment.

Question: Why not just use command-line arguments: docker run postgres --user=myuser --password=mypassword?

Running Programs in the Background: Daemonize

You ssh into a Linux server, start a long-running program (say ./my_model_training), and then your network drops or you log out. Your probably know that the program will get killed….

How to keep it alive? We usually use tmux and detach. But the classic UNIX way is to daemonize: turn your program into a background service that is no longer tied to your terminal.

How PTT Did It (The Old Days)

In the 1990s, PTT was run at 杜奕瑾’s dormitory, manually from a terminal. To keep the program running, PTT called a daemonize() function from logind, which can handle thousands of log-ins per second. The daemonize() function works like this:

  • fork() → parent exits, child keeps running in the background.
  • setsid() → child starts a new “session” with no terminal attached.
  • fork() again (double fork) → ensures the process can never accidentally grab a terminal again.
  • redirect input/output → stdin/stdout/stderr go to /dev/null or log files.
  • write a PID file → so admins can later find and control the daemon.

That’s why even if the admin logged out, PTT kept running

Here’s the specific code that redirect stderr to logfile. A daemon has no screen, so we send errors into a file.

if (logfile) {
    if ((fd = OpenCreate(logfile, O_WRONLY | O_APPEND)) < 0) {
        perror("Can't open logfile");
        exit(1);
    }
    if (fd != 2) {
        dup2(fd, 2);
        close(fd);
    }
}

Here’s the code that sends stdin and stdout to black hole

if ((fd = open("/dev/null", O_RDWR)) < 0) {
    perror("Can't open /dev/null");
    exit(1);
}

dup2(fd, 0);
dup2(fd, 1);
if (!logfile)
    dup2(fd, 2);

This is equivalent to running this in a shell:

$ ./ptt </dev/null >/dev/null 2>>"$logfile"

Today: systemd Instead of DIY

On modern Linux servers, you rarely write daemonize() yourself. Instead, you write a systemd service unit:

[Service]
ExecStart=/usr/local/bin/myapp
Environment="API_KEY=sk-XXX"
Restart=always

Systemd then:

  • starts your program in the background,
  • keeps it alive if it crashes,
  • capture error logs (so you can debug later),
  • injects environment variables.

Did you see systemd being the ancestor of all the processes in the system? These processes like databases, web servers, or messaging systems. They run for months, listening on a network port and responding to requests. If they ever crash, they must be restarted immediately to keep the system available.

Back to top