OSS reference · PostgreSQL · RDBMS Core

PostgreSQL — Architecture, Query Lifecycle, Storage & WAL, Runtime, Source Tree

PostgreSQL is a multi-process relational database whose defining choices — process-per-connection rather than thread-per-connection, MVCC visibility rather than locking readers, WAL-based durability rather than direct fsync, and a mailing-list contribution workflow rather than GitHub PRs — propagate through every part of the design. This page is a compact engineer-facing reference to those choices: how the processes are laid out, how a SQL statement moves through them, where bytes live at runtime, how PostgreSQL compares to other RDBMS, and where the source code is.

1. Architecture

A PostgreSQL cluster is one postmaster process plus a fixed set of auxiliary processes plus one backend process per client connection. The postmaster never serves SQL itself — it owns the listening socket, accepts new connections, and fork()s a child backend for each one. Backends share state through a fixed-size shared memory segment (allocated at startup) that holds the buffer pool, the WAL buffers, the lock tables, and the catalog cache invalidation queues. Everything else — the planner, the executor, the per-statement memory contexts — lives in the backend's private address space.

The auxiliary processes are durable for the lifetime of the cluster: walwriter flushes WAL pages out of the WAL buffer ring, bgwriter writes dirty shared-buffer pages back to OS so backends don't stall on victim search, checkpointer drives the periodic checkpoint that bounds recovery time, autovacuum launcher + autovacuum workers reclaim space and update planner statistics, archiver copies completed WAL segments to off-cluster storage, walsender serves WAL to physical replicas + logical subscribers (one per peer), and on each replica a walreceiver consumes from the corresponding walsender.

CLIENTS libpq / JDBC / psql one TCP per connection replica walreceiver streaming replication logical subscriber CDC / pub-sub POSTMASTER + BACKENDS postmaster listens on the socket fork() backend per connect backend (per session) parse · plan · execute per-statement mem context walsender (per peer) tails WAL, streams to replica forked by postmaster SHARED MEMORY (single segment, fixed size at start) shared_buffers · WAL buffers · lock tables · proc array · catalog cache invalidation queue backends + auxiliary processes attach here at fork time no copy-on-write surprise — shared memory is genuinely shared, only private heap is COW AUXILIARY PROCESSES walwriter flushes WAL buffers to pg_wal/ bgwriter writes dirty buffers to ease backend stalls checkpointer periodic checkpoint bounds recovery time autovacuum launcher + N workers reclaim + ANALYZE archiver archive_command per complete WAL segment logical replication workers apply worker on subscriber spawned per subscription connect fork connect (replication slot) blue = client / replication TCP · green = replication slot connect · amber = logical pub/sub No threads — every box above is a separate OS process. Costs in this model live mostly in fork() + per-process page tables, not in lock contention.

Process-per-connection (vs. MySQL's thread-per-connection) is the single biggest architectural decision. It gives PostgreSQL strong isolation between sessions and lets the OS handle scheduling for free, at the cost of higher per-connection RAM and slower connection turnover. Connection poolers (PgBouncer / pgpool) exist precisely to amortise this cost over many short-lived client connections.

2. Query Lifecycle

A single SQL statement travels through five stages inside the backend that runs it. The boundary between stages is intentional — every stage has its own data structure that can be inspected (and is, by EXPLAIN + the auto_explain contrib module), and replaced (the rewriter is how rules + RLS predicates fold into the query, the planner is what extensions like pg_hint_plan hook into).

  1. Parseraw_parser() turns the SQL string into a list of raw parse trees using the gram.y bison grammar.
  2. Analyzeparse_analyze() resolves names against the catalog (tables, columns, types, functions) and produces a Query tree with semantic info attached.
  3. RewriteQueryRewrite() applies rules: view substitution, RLS predicates, UPDATE/DELETESELECT for cursor planning. Output is still a Query tree (possibly several, if a rule expands one statement into many).
  4. Planpg_plan_query() picks join orders, access methods (Seq Scan / Index Scan / Bitmap Heap Scan / ...), and sort / aggregation strategies. Output is a PlannedStmt tree of executor nodes. Cost-based; uses statistics gathered by ANALYZE.
  5. ExecuteExecutorRun() walks the plan tree using a Volcano-style iterator (ExecProcNode → child ExecProcNode). Tuples flow up one at a time (or in batches under VECTOR mode in 17+). The executor talks to the buffer manager for pages, the lock manager for row locks, and the WAL machinery for any modifications.

3. Storage & WAL

Every PostgreSQL table is a flat sequence of 8 KB pages on disk (the page size is compile-time, default 8K). A page holds a header, a line-pointer array, and tuples packed from the end backwards. Each tuple carries an xmin / xmax pair — the transaction that created it and the transaction that deleted it. This is MVCC: an UPDATE never overwrites; it writes a new tuple version with xmin = current and sets the old tuple's xmax = current. Readers consult their snapshot to decide which version is "visible to me". Dead tuples are reclaimed by vacuum.

Backends never touch disk pages directly. They go through shared_buffers, a fixed-size pool of 8 KB slots in shared memory (typical sizing: 25% of RAM). When a backend needs a page, it asks the buffer manager; the buffer manager either returns a pinned slot or evicts a victim (clock-sweep) and reads the page from disk. The page underneath is in the OS page cache, so a "miss" in shared_buffers is usually a hit in page cache — only the worst case is real disk I/O.

Durability is provided by the WAL (Write-Ahead Log), not by writing data pages on every change. Every modification first appends a WAL record describing it; only the WAL is fsynced on COMMIT (controlled by synchronous_commit). Data pages get written back asynchronously — by bgwriter as part of buffer eviction, and by checkpointer at checkpoint_timeout / max_wal_size boundaries. If the cluster crashes, recovery replays WAL from the last checkpoint to bring the data pages back to a consistent state.

BACKEND (one process per session) backend INSERT / UPDATE / DELETE mutate page in shared_buffers append WAL record SHARED MEMORY shared_buffers (≈25% of RAM) 8 KB slots · clock-sweep eviction pinned by current users · dirty bit backed by OS page cache underneath WAL buffers (ring) wal_buffers (default 16 MB) every change → record here first walwriter / commit drain → disk AUXILIARY PROCESSES checkpointer flush dirty buffers on timeout / WAL size bgwriter trickle write of dirties to ease backend stalls walwriter flush WAL buffers commit fsync() archiver / walsender archive_command + stream WAL to replicas DISK ($PGDATA) base/ — data files (8 KB pages, per relation) pg_wal/ — WAL segments (16 MB each, sequential) mutate append WAL record async commit fsync

The hot path on a write is tuple → shared_buffer → WAL buffer → fsync(WAL) → return. Data files (base/) are only updated asynchronously by checkpointer and bgwriter. The reason synchronous_commit = off is so much faster is that it skips the fsync(WAL) at commit — durability falls back to the next WAL flush boundary, which is the trade-off you accept.

4. PostgreSQL vs Other RDBMS

Four open / source-available OLTP databases that come up in the same room as PostgreSQL. The differences are real and shape what a workload feels like — connection pooling needs, index design, replication options, on-call playbook — even when the SQL surface looks the same.

PostgreSQL MySQL (InnoDB) SQLite DuckDB
Process model Process per connection (postmaster fork()s a backend) Thread per connection (one server process, N threads) In-process library (no server) In-process library (no server)
Storage layout Heap (row store) · secondary indexes hold ctid Clustered index on primary key · secondary indexes hold PK Single-file row store · B-tree Column store (vectorised)
Concurrency MVCC · readers never block writers · UPDATE creates a new tuple version MVCC via UNDO log · UPDATE writes in-place + UNDO entry Database-level write lock (WAL mode allows readers during writer) Single-writer multi-reader inside one process
Durability WAL (write-ahead log) · fsync on commit InnoDB redo log · doublewrite buffer for page tearing WAL or rollback journal mode WAL (since 0.9) · optimised for bulk OLAP
Replication Physical (streaming WAL) + logical (decoded WAL → publisher/subscriber) Statement / row binlog · semi-sync / async External (litestream, rqlite) — not first-class Not built-in (single-node analytical engine)
Extensibility Procedural languages (PL/pgSQL, PL/Python, PL/V8), C extensions, custom data types / index access methods (GIN, GiST, BRIN, SP-GiST) UDFs, stored procedures · pluggable storage engines (InnoDB, MyISAM, NDB) Loadable extensions (FTS5, JSON1, R*Tree) Loadable extensions (httpfs, parquet, json) · UDFs in Python / R
Use-case sweet spot General-purpose OLTP · mixed workloads · "default SQL database when in doubt" Web-tier OLTP at scale · clustered-index heavy reads Embedded / on-device / single-writer apps Single-node analytics on columnar files
License PostgreSQL License (BSD-style, permissive) GPLv2 (server) · proprietary (Oracle) Public Domain MIT
Governance PostgreSQL Global Development Group (PGDG) · mailing-list driven · no single vendor Oracle (since 2010) · pluggable storage engines from other vendors (Percona, MariaDB fork) One maintainer (Hwaci) · external contributions not accepted DuckDB Labs (Mark Raasveldt, Hannes Mühleisen) + CWI Amsterdam
Contribution workflow Patches → pgsql-hackers mailing list → CommitFest review → committer pushes GitHub PR to mysql/mysql-server · MySQL Worklog (WL) for design Not accepting external contributions GitHub PR to duckdb/duckdb · very fast turnaround

5. CPU / Memory / Network at Runtime

A running PostgreSQL cluster's resource profile is dominated by three things: connection count (because each is a real OS process), shared_buffers + work_mem × concurrency (RAM), and WAL throughput (the fsync hot path). Almost every tuning conversation reduces to a knob in one of these three categories.

CPU / process model

  • One OS process per connection. The cost of an idle backend is small (a few MB of private heap + page tables), the cost of a busy backend is whatever its query is doing. There is no thread pool to size — concurrency is sized by max_connections and by the OS limit on processes.
  • No internal thread pool. A single query uses one backend; parallel query (parallel_workers) lets a backend spawn worker backends for that one query but they exit when it finishes.
  • Auxiliary processes (walwriter, bgwriter, checkpointer, autovacuum workers) are mostly idle, with periodic CPU spikes — checkpointer's spike is the most visible, controlled by checkpoint_completion_target which spreads writes across the interval.
  • Connection poolers (PgBouncer, pgpool-II) sit between clients and PostgreSQL to amortise the fork() + per-process RAM cost when applications open and close many short-lived connections. Almost every real deployment uses one.

Memory layout

  • shared_buffers — one fixed segment allocated at startup; sized to ~25% of RAM is the usual starting point. Holds 8 KB pages of all relations being touched + the dictionary that maps page-id → slot. Backed by OS page cache underneath, which is why "more shared_buffers" past a certain point gives diminishing returns.
  • WAL buffers — small (16 MB default) ring of WAL pages waiting for walwriter or commit-driven flush. Almost never the bottleneck.
  • Per-backend private memorywork_mem per sort / hash operator (not per connection — a single query with 3 sorts can use 3× work_mem), plus relcache / catcache caches (typically a few MB per backend), plus temp_buffers for temp tables. Scaling formula: peak RAM ≈ shared_buffers + (avg per-backend × max_connections).
  • OS page cache — everything not in shared_buffers lives here. PostgreSQL deliberately keeps shared_buffers small relative to RAM so the OS can do its job.

Network traffic

  • Client ↔ backend — one TCP per client connection, long-lived if the application reuses the connection, otherwise fork() per request. PostgreSQL's wire protocol is a simple length-prefixed message format; libpq / JDBC / pgx all speak it directly.
  • Replication — physical replicas open one TCP per replica to the primary's walsender, which streams WAL records as they're flushed. Backpressure is the replica's recovery_min_apply_delay + how fast it can apply.
  • Logical replication / CDC — same walsender mechanism, but the WAL is decoded into row-level changes that match the subscriber's interest (table / publication filter). Used by every modern CDC tool (Debezium, pgoutput, wal2json).
  • Archive shippingarchive_command runs per completed 16 MB WAL segment, copies to off-cluster storage (S3 / GCS / NFS). PITR + base backup + archive = the standard backup recipe.

6. Source Tree (for contributors)

Official source: read-only GitHub mirror at github.com/postgres/postgres; the canonical git repository is git.postgresql.org/postgresql.git. PostgreSQL does not accept GitHub pull requests — every contribution goes through the pgsql-hackers@lists.postgresql.org mailing list and is tracked in the per-cycle CommitFest. The patch submission guide walks the workflow end-to-end. Build is ./configure && make (a parallel Meson build also lives in the tree); JDK is not relevant — most of the codebase is C, with bison + flex for the grammar.

Top-level modules

  • src/backend/ — the server. By far the largest tree. Sub-directories track the query lifecycle: parser/ (gram.y), analyzer/ + rewrite/, optimizer/ (planner), executor/. Plus storage/ (buffer manager, lock manager, smgr), access/ (heap, btree, gist, gin, brin, hash — the access methods), commands/ (DDL implementations), catalog/ (system tables + catalog cache), libpq/ (the wire protocol server-side), replication/ (walsender, walreceiver, logical decoding), tcop/ (the postmaster main loop), utils/ (date/time, hash, sort, memory contexts, the float comparison helpers used by every numeric opclass).
  • src/include/ — public + internal headers. Mirrors src/backend/'s tree. Touching a header here is how cross-module changes start.
  • src/bin/ — client-side tools (psql, pg_dump, pg_restore, pg_basebackup, initdb, pg_rewind, pg_upgrade, vacuumdb). Most contributors land their first patches here — small, well-bounded surface.
  • src/interfaces/libpq/ — the C client library that everything else (JDBC, psycopg, pgx) wraps or re-implements. The protocol contract lives here.
  • src/pl/ — procedural languages (plpgsql/, plpython/, plperl/, pltcl/). Each is its own loadable module.
  • src/test/ — regression infrastructure. regress/ is the main suite (SQL + expected output diffs, run via pg_regress), isolation/ covers concurrency, recovery/ exercises crash recovery, subscription/ covers logical replication, plus TAP tests under perl/.
  • contrib/ — bundled extensions that ship with the server tarball but are loaded explicitly via CREATE EXTENSION. postgres_fdw (PostgreSQL foreign data wrapper), pg_stat_statements, pgcrypto, btree_gin / btree_gist, hstore, ltree, pgbench, pageinspect, amcheck, dblink, fuzzystrmatch, isn, dict_int, ... Each has its own .c sources, SQL extension scripts, and a sql/ + expected/ regression test pair. Reviewers are more tolerant of contrib changes than core changes — this is the recommended on-ramp for a first contribution.
  • doc/src/sgml/ — the user-facing documentation in DocBook XML. Doc-only patches are a real entry path; many committers got their first commit by improving a confusing paragraph.
  • config/, configure, meson.build — autoconf + meson build systems. Rarely touched.

Version history

  • 0.01 → 6.x (1995 – 1997) — derived from POSTGRES at Berkeley by Andrew Yu and Jolly Chen. SQL added (replacing the original QUEL).
  • 7.x (2000 – 2004) — WAL (7.1), foreign keys (7.0), schemas (7.3), prepared statements (7.4). The era when PostgreSQL stopped being a research project.
  • 8.x (2005 – 2009) — point-in-time recovery, native Windows port (8.0), two-phase commit, table partitioning via inheritance, common table expressions (8.4).
  • 9.x (2010 – 2016) — streaming replication (9.0), JSON / JSONB (9.2 / 9.4), materialized views (9.3), foreign data wrappers, logical decoding (9.4), UPSERT (9.5), parallel query (9.6).
  • 10 – 13 (2017 – 2020) — logical replication (publisher / subscriber), declarative partitioning (10 / 11), JIT (11), incremental sorts (13).
  • 14 – 17 (2021 – 2024) — pipeline mode for libpq (14), SQL/JSON path (12 → 17 polish), MERGE (15), logical replication on standby (16), incremental backup (17), bgwriter / WAL improvements every cycle.
  • 18 (current stable) — released 2025-09. Asynchronous I/O for many sequential paths, OAuth in libpq, more SQL/JSON, virtual columns.
  • 19 (next, in development) — open CF /59/ window; my btree_gist NaN patch targets this cycle.

Branching & release model

The tree's main branch is master; each released major (REL_17_STABLE, REL_18_STABLE, ...) is a long-lived branch. Bug fixes are committed to master first and back-patched by the committer to every still-supported branch (typically the last five majors). Major releases ship roughly once a year in autumn (September / October). Minor releases ship four times a year (Feb / May / Aug / Nov) and contain bug fixes + security fixes only — no features. The release manager for each cycle is rotated; the release process is run on the pgsql-release@lists.postgresql.org internal list.

Related