OSS reference · PostgreSQL · RDBMS Core
PostgreSQL — Architecture, Query Lifecycle, Storage & WAL, Runtime, Source Tree
PostgreSQL is a multi-process relational database whose defining choices — process-per-connection rather than thread-per-connection, MVCC visibility rather than locking readers, WAL-based durability rather than direct fsync, and a mailing-list contribution workflow rather than GitHub PRs — propagate through every part of the design. This page is a compact engineer-facing reference to those choices: how the processes are laid out, how a SQL statement moves through them, where bytes live at runtime, how PostgreSQL compares to other RDBMS, and where the source code is.
1. Architecture
A PostgreSQL cluster is one postmaster process plus a fixed set of auxiliary processes plus one backend process per client connection. The postmaster never serves SQL itself — it owns the listening socket, accepts new connections, and fork()s a child backend for each one. Backends share state through a fixed-size shared memory segment (allocated at startup) that holds the buffer pool, the WAL buffers, the lock tables, and the catalog cache invalidation queues. Everything else — the planner, the executor, the per-statement memory contexts — lives in the backend's private address space.
The auxiliary processes are durable for the lifetime of the cluster: walwriter flushes WAL pages out of the WAL buffer ring, bgwriter writes dirty shared-buffer pages back to OS so backends don't stall on victim search, checkpointer drives the periodic checkpoint that bounds recovery time, autovacuum launcher + autovacuum workers reclaim space and update planner statistics, archiver copies completed WAL segments to off-cluster storage, walsender serves WAL to physical replicas + logical subscribers (one per peer), and on each replica a walreceiver consumes from the corresponding walsender.
Process-per-connection (vs. MySQL's thread-per-connection) is the single biggest architectural decision. It gives PostgreSQL strong isolation between sessions and lets the OS handle scheduling for free, at the cost of higher per-connection RAM and slower connection turnover. Connection poolers (PgBouncer / pgpool) exist precisely to amortise this cost over many short-lived client connections.
2. Query Lifecycle
A single SQL statement travels through five stages inside the backend that runs it. The boundary between stages is intentional — every stage has its own data structure that can be inspected (and is, by EXPLAIN + the auto_explain contrib module), and replaced (the rewriter is how rules + RLS predicates fold into the query, the planner is what extensions like pg_hint_plan hook into).
- Parse —
raw_parser()turns the SQL string into a list of raw parse trees using thegram.ybison grammar. - Analyze —
parse_analyze()resolves names against the catalog (tables, columns, types, functions) and produces aQuerytree with semantic info attached. - Rewrite —
QueryRewrite()applies rules: view substitution, RLS predicates,UPDATE/DELETE→SELECTfor cursor planning. Output is still aQuerytree (possibly several, if a rule expands one statement into many). - Plan —
pg_plan_query()picks join orders, access methods (Seq Scan / Index Scan / Bitmap Heap Scan / ...), and sort / aggregation strategies. Output is aPlannedStmttree of executor nodes. Cost-based; uses statistics gathered by ANALYZE. - Execute —
ExecutorRun()walks the plan tree using a Volcano-style iterator (ExecProcNode→ childExecProcNode). Tuples flow up one at a time (or in batches underVECTORmode in 17+). The executor talks to the buffer manager for pages, the lock manager for row locks, and the WAL machinery for any modifications.
3. Storage & WAL
Every PostgreSQL table is a flat sequence of 8 KB pages on disk (the page size is compile-time, default 8K). A page holds a header, a line-pointer array, and tuples packed from the end backwards. Each tuple carries an xmin / xmax pair — the transaction that created it and the transaction that deleted it. This is MVCC: an UPDATE never overwrites; it writes a new tuple version with xmin = current and sets the old tuple's xmax = current. Readers consult their snapshot to decide which version is "visible to me". Dead tuples are reclaimed by vacuum.
Backends never touch disk pages directly. They go through shared_buffers, a fixed-size pool of 8 KB slots in shared memory (typical sizing: 25% of RAM). When a backend needs a page, it asks the buffer manager; the buffer manager either returns a pinned slot or evicts a victim (clock-sweep) and reads the page from disk. The page underneath is in the OS page cache, so a "miss" in shared_buffers is usually a hit in page cache — only the worst case is real disk I/O.
Durability is provided by the WAL (Write-Ahead Log), not by writing data pages on every change. Every modification first appends a WAL record describing it; only the WAL is fsynced on COMMIT (controlled by synchronous_commit). Data pages get written back asynchronously — by bgwriter as part of buffer eviction, and by checkpointer at checkpoint_timeout / max_wal_size boundaries. If the cluster crashes, recovery replays WAL from the last checkpoint to bring the data pages back to a consistent state.
The hot path on a write is tuple → shared_buffer → WAL buffer → fsync(WAL) → return. Data files (base/) are only updated asynchronously by checkpointer and bgwriter. The reason synchronous_commit = off is so much faster is that it skips the fsync(WAL) at commit — durability falls back to the next WAL flush boundary, which is the trade-off you accept.
4. PostgreSQL vs Other RDBMS
Four open / source-available OLTP databases that come up in the same room as PostgreSQL. The differences are real and shape what a workload feels like — connection pooling needs, index design, replication options, on-call playbook — even when the SQL surface looks the same.
| PostgreSQL | MySQL (InnoDB) | SQLite | DuckDB | |
|---|---|---|---|---|
| Process model | Process per connection (postmaster fork()s a backend) |
Thread per connection (one server process, N threads) | In-process library (no server) | In-process library (no server) |
| Storage layout | Heap (row store) · secondary indexes hold ctid | Clustered index on primary key · secondary indexes hold PK | Single-file row store · B-tree | Column store (vectorised) |
| Concurrency | MVCC · readers never block writers · UPDATE creates a new tuple version | MVCC via UNDO log · UPDATE writes in-place + UNDO entry | Database-level write lock (WAL mode allows readers during writer) | Single-writer multi-reader inside one process |
| Durability | WAL (write-ahead log) · fsync on commit | InnoDB redo log · doublewrite buffer for page tearing | WAL or rollback journal mode | WAL (since 0.9) · optimised for bulk OLAP |
| Replication | Physical (streaming WAL) + logical (decoded WAL → publisher/subscriber) | Statement / row binlog · semi-sync / async | External (litestream, rqlite) — not first-class | Not built-in (single-node analytical engine) |
| Extensibility | Procedural languages (PL/pgSQL, PL/Python, PL/V8), C extensions, custom data types / index access methods (GIN, GiST, BRIN, SP-GiST) | UDFs, stored procedures · pluggable storage engines (InnoDB, MyISAM, NDB) | Loadable extensions (FTS5, JSON1, R*Tree) | Loadable extensions (httpfs, parquet, json) · UDFs in Python / R |
| Use-case sweet spot | General-purpose OLTP · mixed workloads · "default SQL database when in doubt" | Web-tier OLTP at scale · clustered-index heavy reads | Embedded / on-device / single-writer apps | Single-node analytics on columnar files |
| License | PostgreSQL License (BSD-style, permissive) | GPLv2 (server) · proprietary (Oracle) | Public Domain | MIT |
| Governance | PostgreSQL Global Development Group (PGDG) · mailing-list driven · no single vendor | Oracle (since 2010) · pluggable storage engines from other vendors (Percona, MariaDB fork) | One maintainer (Hwaci) · external contributions not accepted | DuckDB Labs (Mark Raasveldt, Hannes Mühleisen) + CWI Amsterdam |
| Contribution workflow | Patches → pgsql-hackers mailing list → CommitFest review → committer pushes | GitHub PR to mysql/mysql-server · MySQL Worklog (WL) for design | Not accepting external contributions | GitHub PR to duckdb/duckdb · very fast turnaround |
5. CPU / Memory / Network at Runtime
A running PostgreSQL cluster's resource profile is dominated by three things: connection count (because each is a real OS process), shared_buffers + work_mem × concurrency (RAM), and WAL throughput (the fsync hot path). Almost every tuning conversation reduces to a knob in one of these three categories.
CPU / process model
- One OS process per connection. The cost of an idle backend is small (a few MB of private heap + page tables), the cost of a busy backend is whatever its query is doing. There is no thread pool to size — concurrency is sized by
max_connectionsand by the OS limit on processes. - No internal thread pool. A single query uses one backend; parallel query (
parallel_workers) lets a backend spawn worker backends for that one query but they exit when it finishes. - Auxiliary processes (walwriter, bgwriter, checkpointer, autovacuum workers) are mostly idle, with periodic CPU spikes — checkpointer's spike is the most visible, controlled by
checkpoint_completion_targetwhich spreads writes across the interval. - Connection poolers (PgBouncer, pgpool-II) sit between clients and PostgreSQL to amortise the fork() + per-process RAM cost when applications open and close many short-lived connections. Almost every real deployment uses one.
Memory layout
- shared_buffers — one fixed segment allocated at startup; sized to ~25% of RAM is the usual starting point. Holds 8 KB pages of all relations being touched + the dictionary that maps page-id → slot. Backed by OS page cache underneath, which is why "more shared_buffers" past a certain point gives diminishing returns.
- WAL buffers — small (16 MB default) ring of WAL pages waiting for walwriter or commit-driven flush. Almost never the bottleneck.
- Per-backend private memory —
work_memper sort / hash operator (not per connection — a single query with 3 sorts can use 3×work_mem), plus relcache / catcache caches (typically a few MB per backend), plustemp_buffersfor temp tables. Scaling formula: peak RAM ≈ shared_buffers + (avg per-backend × max_connections). - OS page cache — everything not in shared_buffers lives here. PostgreSQL deliberately keeps shared_buffers small relative to RAM so the OS can do its job.
Network traffic
- Client ↔ backend — one TCP per client connection, long-lived if the application reuses the connection, otherwise
fork()per request. PostgreSQL's wire protocol is a simple length-prefixed message format; libpq / JDBC / pgx all speak it directly. - Replication — physical replicas open one TCP per replica to the primary's walsender, which streams WAL records as they're flushed. Backpressure is the replica's
recovery_min_apply_delay+ how fast it can apply. - Logical replication / CDC — same walsender mechanism, but the WAL is decoded into row-level changes that match the subscriber's interest (table / publication filter). Used by every modern CDC tool (Debezium, pgoutput, wal2json).
- Archive shipping —
archive_commandruns per completed 16 MB WAL segment, copies to off-cluster storage (S3 / GCS / NFS). PITR + base backup + archive = the standard backup recipe.
6. Source Tree (for contributors)
Official source: read-only GitHub mirror at github.com/postgres/postgres; the canonical git repository is git.postgresql.org/postgresql.git. PostgreSQL does not accept GitHub pull requests — every contribution goes through the pgsql-hackers@lists.postgresql.org mailing list and is tracked in the per-cycle CommitFest. The patch submission guide walks the workflow end-to-end. Build is ./configure && make (a parallel Meson build also lives in the tree); JDK is not relevant — most of the codebase is C, with bison + flex for the grammar.
Top-level modules
src/backend/— the server. By far the largest tree. Sub-directories track the query lifecycle:parser/(gram.y),analyzer/+rewrite/,optimizer/(planner),executor/. Plusstorage/(buffer manager, lock manager, smgr),access/(heap, btree, gist, gin, brin, hash — the access methods),commands/(DDL implementations),catalog/(system tables + catalog cache),libpq/(the wire protocol server-side),replication/(walsender, walreceiver, logical decoding),tcop/(the postmaster main loop),utils/(date/time, hash, sort, memory contexts, the float comparison helpers used by every numeric opclass).src/include/— public + internal headers. Mirrorssrc/backend/'s tree. Touching a header here is how cross-module changes start.src/bin/— client-side tools (psql,pg_dump,pg_restore,pg_basebackup,initdb,pg_rewind,pg_upgrade,vacuumdb). Most contributors land their first patches here — small, well-bounded surface.src/interfaces/libpq/— the C client library that everything else (JDBC, psycopg, pgx) wraps or re-implements. The protocol contract lives here.src/pl/— procedural languages (plpgsql/,plpython/,plperl/,pltcl/). Each is its own loadable module.src/test/— regression infrastructure.regress/is the main suite (SQL + expected output diffs, run viapg_regress),isolation/covers concurrency,recovery/exercises crash recovery,subscription/covers logical replication, plus TAP tests underperl/.contrib/— bundled extensions that ship with the server tarball but are loaded explicitly viaCREATE EXTENSION.postgres_fdw(PostgreSQL foreign data wrapper),pg_stat_statements,pgcrypto,btree_gin/btree_gist,hstore,ltree,pgbench,pageinspect,amcheck,dblink,fuzzystrmatch,isn,dict_int, ... Each has its own.csources, SQL extension scripts, and asql/+expected/regression test pair. Reviewers are more tolerant of contrib changes than core changes — this is the recommended on-ramp for a first contribution.doc/src/sgml/— the user-facing documentation in DocBook XML. Doc-only patches are a real entry path; many committers got their first commit by improving a confusing paragraph.config/,configure,meson.build— autoconf + meson build systems. Rarely touched.
Version history
- 0.01 → 6.x (1995 – 1997) — derived from POSTGRES at Berkeley by Andrew Yu and Jolly Chen. SQL added (replacing the original QUEL).
- 7.x (2000 – 2004) — WAL (7.1), foreign keys (7.0), schemas (7.3), prepared statements (7.4). The era when PostgreSQL stopped being a research project.
- 8.x (2005 – 2009) — point-in-time recovery, native Windows port (8.0), two-phase commit, table partitioning via inheritance, common table expressions (8.4).
- 9.x (2010 – 2016) — streaming replication (9.0), JSON / JSONB (9.2 / 9.4), materialized views (9.3), foreign data wrappers, logical decoding (9.4),
UPSERT(9.5), parallel query (9.6). - 10 – 13 (2017 – 2020) — logical replication (publisher / subscriber), declarative partitioning (10 / 11), JIT (11), incremental sorts (13).
- 14 – 17 (2021 – 2024) — pipeline mode for libpq (14), SQL/JSON path (12 → 17 polish), MERGE (15), logical replication on standby (16), incremental backup (17), bgwriter / WAL improvements every cycle.
- 18 (current stable) — released 2025-09. Asynchronous I/O for many sequential paths, OAuth in libpq, more SQL/JSON, virtual columns.
- 19 (next, in development) — open CF /59/ window; my btree_gist NaN patch targets this cycle.
Branching & release model
The tree's main branch is master; each released major (REL_17_STABLE, REL_18_STABLE, ...) is a long-lived branch. Bug fixes are committed to master first and back-patched by the committer to every still-supported branch (typically the last five majors). Major releases ship roughly once a year in autumn (September / October). Minor releases ship four times a year (Feb / May / Aug / Nov) and contain bug fixes + security fixes only — no features. The release manager for each cycle is rotated; the release process is run on the pgsql-release@lists.postgresql.org internal list.
Related
- Sourcepostgres/postgres (read-only mirror)
- Upstream gitgit.postgresql.org/postgresql.git
- Docspostgresql.org/docs
- Submit a patchwiki.postgresql.org/Submitting_a_Patch
- CommitFestcommitfest.postgresql.org
- My PRsPostgreSQL contributions
- SiblingApache Kafka — Architecture, Message Flow, CLI
- Blogdevbilllab.tistory.com