OSS reference · PostgreSQL · RDBMS Core

PostgreSQL — Architecture, Query Lifecycle, Storage & WAL, Runtime, Source Tree

PostgreSQL is a multi-process relational database whose defining choices — process-per-connection rather than thread-per-connection, MVCC visibility rather than locking readers, WAL-based durability rather than direct fsync, and a mailing-list contribution workflow rather than GitHub PRs — propagate through every part of the design. This page is a compact engineer-facing reference to those choices: how the processes are laid out, how a SQL statement moves through them, where bytes live at runtime, how PostgreSQL compares to other RDBMS, and where the source code is.

Architecture
Query Lifecycle
Storage & WAL
PostgreSQL vs Other RDBMS
CPU / Memory / Network at Runtime
Source Tree (for contributors)

1. Architecture

A PostgreSQL cluster is one postmaster process plus a fixed set of auxiliary processes plus one backend process per client connection. The postmaster never serves SQL itself — it owns the listening socket, accepts new connections, and fork()s a child backend for each one. Backends share state through a fixed-size shared memory segment (allocated at startup) that holds the buffer pool, the WAL buffers, the lock tables, and the catalog cache invalidation queues. Everything else — the planner, the executor, the per-statement memory contexts — lives in the backend's private address space.

The auxiliary processes are durable for the lifetime of the cluster: walwriter flushes WAL pages out of the WAL buffer ring, bgwriter writes dirty shared-buffer pages back to OS so backends don't stall on victim search, checkpointer drives the periodic checkpoint that bounds recovery time, autovacuum launcher + autovacuum workers reclaim space and update planner statistics, archiver copies completed WAL segments to off-cluster storage, walsender serves WAL to physical replicas + logical subscribers (one per peer), and on each replica a walreceiver consumes from the corresponding walsender.

Process-per-connection (vs. MySQL's thread-per-connection) is the single biggest architectural decision. It gives PostgreSQL strong isolation between sessions and lets the OS handle scheduling for free, at the cost of higher per-connection RAM and slower connection turnover. Connection poolers (PgBouncer / pgpool) exist precisely to amortise this cost over many short-lived client connections.

2. Query Lifecycle

A single SQL statement travels through five stages inside the backend that runs it. The boundary between stages is intentional — every stage has its own data structure that can be inspected (and is, by EXPLAIN + the auto_explain contrib module), and replaced (the rewriter is how rules + RLS predicates fold into the query, the planner is what extensions like pg_hint_plan hook into).

Parse — raw_parser() turns the SQL string into a list of raw parse trees using the gram.y bison grammar.
Analyze — parse_analyze() resolves names against the catalog (tables, columns, types, functions) and produces a Query tree with semantic info attached.
Rewrite — QueryRewrite() applies rules: view substitution, RLS predicates, UPDATE/DELETE → SELECT for cursor planning. Output is still a Query tree (possibly several, if a rule expands one statement into many).
Plan — pg_plan_query() picks join orders, access methods (Seq Scan / Index Scan / Bitmap Heap Scan / ...), and sort / aggregation strategies. Output is a PlannedStmt tree of executor nodes. Cost-based; uses statistics gathered by ANALYZE.
Execute — ExecutorRun() walks the plan tree using a Volcano-style iterator (ExecProcNode → child ExecProcNode). Tuples flow up one at a time (or in batches under VECTOR mode in 17+). The executor talks to the buffer manager for pages, the lock manager for row locks, and the WAL machinery for any modifications.

3. Storage & WAL

Every PostgreSQL table is a flat sequence of 8 KB pages on disk (the page size is compile-time, default 8K). A page holds a header, a line-pointer array, and tuples packed from the end backwards. Each tuple carries an xmin / xmax pair — the transaction that created it and the transaction that deleted it. This is MVCC: an UPDATE never overwrites; it writes a new tuple version with xmin = current and sets the old tuple's xmax = current. Readers consult their snapshot to decide which version is "visible to me". Dead tuples are reclaimed by vacuum.

Backends never touch disk pages directly. They go through shared_buffers, a fixed-size pool of 8 KB slots in shared memory (typical sizing: 25% of RAM). When a backend needs a page, it asks the buffer manager; the buffer manager either returns a pinned slot or evicts a victim (clock-sweep) and reads the page from disk. The page underneath is in the OS page cache, so a "miss" in shared_buffers is usually a hit in page cache — only the worst case is real disk I/O.

Durability is provided by the WAL (Write-Ahead Log), not by writing data pages on every change. Every modification first appends a WAL record describing it; only the WAL is fsynced on COMMIT (controlled by synchronous_commit). Data pages get written back asynchronously — by bgwriter as part of buffer eviction, and by checkpointer at checkpoint_timeout / max_wal_size boundaries. If the cluster crashes, recovery replays WAL from the last checkpoint to bring the data pages back to a consistent state.

The hot path on a write is tuple → shared_buffer → WAL buffer → fsync(WAL) → return. Data files (base/) are only updated asynchronously by checkpointer and bgwriter. The reason synchronous_commit = off is so much faster is that it skips the fsync(WAL) at commit — durability falls back to the next WAL flush boundary, which is the trade-off you accept.

4. PostgreSQL vs Other RDBMS

Four open / source-available OLTP databases that come up in the same room as PostgreSQL. The differences are real and shape what a workload feels like — connection pooling needs, index design, replication options, on-call playbook — even when the SQL surface looks the same.

	PostgreSQL	MySQL (InnoDB)	SQLite	DuckDB
Process model	Process per connection (postmaster `fork()`s a backend)	Thread per connection (one server process, N threads)	In-process library (no server)	In-process library (no server)
Storage layout	Heap (row store) · secondary indexes hold ctid	Clustered index on primary key · secondary indexes hold PK	Single-file row store · B-tree	Column store (vectorised)
Concurrency	MVCC · readers never block writers · UPDATE creates a new tuple version	MVCC via UNDO log · UPDATE writes in-place + UNDO entry	Database-level write lock (WAL mode allows readers during writer)	Single-writer multi-reader inside one process
Durability	WAL (write-ahead log) · fsync on commit	InnoDB redo log · doublewrite buffer for page tearing	WAL or rollback journal mode	WAL (since 0.9) · optimised for bulk OLAP
Replication	Physical (streaming WAL) + logical (decoded WAL → publisher/subscriber)	Statement / row binlog · semi-sync / async	External (litestream, rqlite) — not first-class	Not built-in (single-node analytical engine)
Extensibility	Procedural languages (PL/pgSQL, PL/Python, PL/V8), C extensions, custom data types / index access methods (GIN, GiST, BRIN, SP-GiST)	UDFs, stored procedures · pluggable storage engines (InnoDB, MyISAM, NDB)	Loadable extensions (FTS5, JSON1, R*Tree)	Loadable extensions (httpfs, parquet, json) · UDFs in Python / R
Use-case sweet spot	General-purpose OLTP · mixed workloads · "default SQL database when in doubt"	Web-tier OLTP at scale · clustered-index heavy reads	Embedded / on-device / single-writer apps	Single-node analytics on columnar files
License	PostgreSQL License (BSD-style, permissive)	GPLv2 (server) · proprietary (Oracle)	Public Domain	MIT
Governance	PostgreSQL Global Development Group (PGDG) · mailing-list driven · no single vendor	Oracle (since 2010) · pluggable storage engines from other vendors (Percona, MariaDB fork)	One maintainer (Hwaci) · external contributions not accepted	DuckDB Labs (Mark Raasveldt, Hannes Mühleisen) + CWI Amsterdam
Contribution workflow	Patches → pgsql-hackers mailing list → CommitFest review → committer pushes	GitHub PR to mysql/mysql-server · MySQL Worklog (WL) for design	Not accepting external contributions	GitHub PR to duckdb/duckdb · very fast turnaround

5. CPU / Memory / Network at Runtime

A running PostgreSQL cluster's resource profile is dominated by three things: connection count (because each is a real OS process), shared_buffers + work_mem × concurrency (RAM), and WAL throughput (the fsync hot path). Almost every tuning conversation reduces to a knob in one of these three categories.

CPU / process model

One OS process per connection. The cost of an idle backend is small (a few MB of private heap + page tables), the cost of a busy backend is whatever its query is doing. There is no thread pool to size — concurrency is sized by max_connections and by the OS limit on processes.
No internal thread pool. A single query uses one backend; parallel query (parallel_workers) lets a backend spawn worker backends for that one query but they exit when it finishes.
Auxiliary processes (walwriter, bgwriter, checkpointer, autovacuum workers) are mostly idle, with periodic CPU spikes — checkpointer's spike is the most visible, controlled by checkpoint_completion_target which spreads writes across the interval.
Connection poolers (PgBouncer, pgpool-II) sit between clients and PostgreSQL to amortise the fork() + per-process RAM cost when applications open and close many short-lived connections. Almost every real deployment uses one.

Memory layout

shared_buffers — one fixed segment allocated at startup; sized to ~25% of RAM is the usual starting point. Holds 8 KB pages of all relations being touched + the dictionary that maps page-id → slot. Backed by OS page cache underneath, which is why "more shared_buffers" past a certain point gives diminishing returns.
WAL buffers — small (16 MB default) ring of WAL pages waiting for walwriter or commit-driven flush. Almost never the bottleneck.
Per-backend private memory — work_mem per sort / hash operator (not per connection — a single query with 3 sorts can use 3× work_mem), plus relcache / catcache caches (typically a few MB per backend), plus temp_buffers for temp tables. Scaling formula: peak RAM ≈ shared_buffers + (avg per-backend × max_connections).
OS page cache — everything not in shared_buffers lives here. PostgreSQL deliberately keeps shared_buffers small relative to RAM so the OS can do its job.

Network traffic

Client ↔ backend — one TCP per client connection, long-lived if the application reuses the connection, otherwise fork() per request. PostgreSQL's wire protocol is a simple length-prefixed message format; libpq / JDBC / pgx all speak it directly.
Replication — physical replicas open one TCP per replica to the primary's walsender, which streams WAL records as they're flushed. Backpressure is the replica's recovery_min_apply_delay + how fast it can apply.
Logical replication / CDC — same walsender mechanism, but the WAL is decoded into row-level changes that match the subscriber's interest (table / publication filter). Used by every modern CDC tool (Debezium, pgoutput, wal2json).
Archive shipping — archive_command runs per completed 16 MB WAL segment, copies to off-cluster storage (S3 / GCS / NFS). PITR + base backup + archive = the standard backup recipe.

6. Source Tree (for contributors)

Official source: read-only GitHub mirror at github.com/postgres/postgres; the canonical git repository is git.postgresql.org/postgresql.git. PostgreSQL does not accept GitHub pull requests — every contribution goes through the pgsql-hackers@lists.postgresql.org mailing list and is tracked in the per-cycle CommitFest. The patch submission guide walks the workflow end-to-end. Build is ./configure && make (a parallel Meson build also lives in the tree); JDK is not relevant — most of the codebase is C, with bison + flex for the grammar.

Top-level modules

src/backend/ — the server. By far the largest tree. Sub-directories track the query lifecycle: parser/ (gram.y), analyzer/ + rewrite/, optimizer/ (planner), executor/. Plus storage/ (buffer manager, lock manager, smgr), access/ (heap, btree, gist, gin, brin, hash — the access methods), commands/ (DDL implementations), catalog/ (system tables + catalog cache), libpq/ (the wire protocol server-side), replication/ (walsender, walreceiver, logical decoding), tcop/ (the postmaster main loop), utils/ (date/time, hash, sort, memory contexts, the float comparison helpers used by every numeric opclass).
src/include/ — public + internal headers. Mirrors src/backend/'s tree. Touching a header here is how cross-module changes start.
src/bin/ — client-side tools (psql, pg_dump, pg_restore, pg_basebackup, initdb, pg_rewind, pg_upgrade, vacuumdb). Most contributors land their first patches here — small, well-bounded surface.
src/interfaces/libpq/ — the C client library that everything else (JDBC, psycopg, pgx) wraps or re-implements. The protocol contract lives here.
src/pl/ — procedural languages (plpgsql/, plpython/, plperl/, pltcl/). Each is its own loadable module.
src/test/ — regression infrastructure. regress/ is the main suite (SQL + expected output diffs, run via pg_regress), isolation/ covers concurrency, recovery/ exercises crash recovery, subscription/ covers logical replication, plus TAP tests under perl/.
contrib/ — bundled extensions that ship with the server tarball but are loaded explicitly via CREATE EXTENSION. postgres_fdw (PostgreSQL foreign data wrapper), pg_stat_statements, pgcrypto, btree_gin / btree_gist, hstore, ltree, pgbench, pageinspect, amcheck, dblink, fuzzystrmatch, isn, dict_int, ... Each has its own .c sources, SQL extension scripts, and a sql/ + expected/ regression test pair. Reviewers are more tolerant of contrib changes than core changes — this is the recommended on-ramp for a first contribution.
doc/src/sgml/ — the user-facing documentation in DocBook XML. Doc-only patches are a real entry path; many committers got their first commit by improving a confusing paragraph.
config/, configure, meson.build — autoconf + meson build systems. Rarely touched.

Version history

0.01 → 6.x (1995 – 1997) — derived from POSTGRES at Berkeley by Andrew Yu and Jolly Chen. SQL added (replacing the original QUEL).
7.x (2000 – 2004) — WAL (7.1), foreign keys (7.0), schemas (7.3), prepared statements (7.4). The era when PostgreSQL stopped being a research project.
8.x (2005 – 2009) — point-in-time recovery, native Windows port (8.0), two-phase commit, table partitioning via inheritance, common table expressions (8.4).
9.x (2010 – 2016) — streaming replication (9.0), JSON / JSONB (9.2 / 9.4), materialized views (9.3), foreign data wrappers, logical decoding (9.4), UPSERT (9.5), parallel query (9.6).
10 – 13 (2017 – 2020) — logical replication (publisher / subscriber), declarative partitioning (10 / 11), JIT (11), incremental sorts (13).
14 – 17 (2021 – 2024) — pipeline mode for libpq (14), SQL/JSON path (12 → 17 polish), MERGE (15), logical replication on standby (16), incremental backup (17), bgwriter / WAL improvements every cycle.
18 (current stable) — released 2025-09. Asynchronous I/O for many sequential paths, OAuth in libpq, more SQL/JSON, virtual columns.
19 (next, in development) — open CF /59/ window; my btree_gist NaN patch targets this cycle.

Branching & release model

The tree's main branch is master; each released major (REL_17_STABLE, REL_18_STABLE, ...) is a long-lived branch. Bug fixes are committed to master first and back-patched by the committer to every still-supported branch (typically the last five majors). Major releases ship roughly once a year in autumn (September / October). Minor releases ship four times a year (Feb / May / Aug / Nov) and contain bug fixes + security fixes only — no features. The release manager for each cycle is rotated; the release process is run on the pgsql-release@lists.postgresql.org internal list.

Sourcepostgres/postgres (read-only mirror)
Upstream gitgit.postgresql.org/postgresql.git
Docspostgresql.org/docs
Submit a patchwiki.postgresql.org/Submitting_a_Patch
CommitFestcommitfest.postgresql.org
My PRsPostgreSQL contributions
SiblingApache Kafka — Architecture, Message Flow, CLI
Blogdevbilllab.tistory.com