What is Write-Ahead Logging (WAL)?
Write-Ahead Logging (WAL) is a technique for ensuring data durability in databases. The principle is simple but powerful: before modifying any data, first write the intended changes to a persistent log.
The Golden Rule of WAL:
┌─────────────────────────────────────────────────────────────┐
│ "Never modify the data page before logging the change." │
└─────────────────────────────────────────────────────────────┘
This guarantees that if the system crashes at any point, the database can recover to a consistent state by replaying the log.
Why WAL Matters
Consider what happens without WAL:
Without WAL - Disaster Scenario:
1. User: UPDATE balance = balance - 100 WHERE user = 'Alice'
2. User: UPDATE balance = balance + 100 WHERE user = 'Bob'
3. Database writes Alice's new balance to disk ✓
4. 💥 CRASH before writing Bob's balance
5. On restart: Alice lost $100, Bob gained nothing!
→ Money disappeared from the system
With WAL:
With WAL - Safe Recovery:
1. User: UPDATE balance = balance - 100 WHERE user = 'Alice'
2. User: UPDATE balance = balance + 100 WHERE user = 'Bob'
3. Database writes BOTH changes to WAL ✓
4. Database writes Alice's balance to disk ✓
5. 💥 CRASH before writing Bob's balance
6. On restart: Replay WAL → Apply Bob's update
→ Transaction fully committed, system consistent!
The ACID Connection
WAL is the implementation mechanism behind the D in ACID (Durability):
| ACID Property | What It Means | How WAL Helps |
|---|---|---|
| Atomicity | All or nothing | Undo log enables rollback |
| Consistency | Valid state only | Recovery replays committed txns |
| Isolation | Transactions don't interfere | MVCC uses WAL records |
| Durability | Committed = permanent | WAL persists before data |
How WAL Works: Deep Dive
The Write Path
┌──────────────────────────────────────────────────────────────┐
│ WRITE OPERATION │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 1. Generate WAL Record │
│ - Transaction ID │
│ - Operation type (INSERT, UPDATE, DELETE) │
│ - Before image (for UNDO) │
│ - After image (for REDO) │
│ - Timestamp / LSN (Log Sequence Number) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. Append to WAL Buffer (in memory) │
│ - Fast sequential write │
│ - Batched for efficiency │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. Flush WAL to Disk (fsync) │
│ - On commit │
│ - Or when buffer is full │
│ - Or at configured intervals │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. Apply Changes to Data Pages (in buffer pool) │
│ - May be deferred (lazy) │
│ - Pages marked "dirty" │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 5. Background: Flush dirty pages to data files │
│ - Checkpoint process │
│ - Order doesn't matter (WAL protects us) │
└──────────────────────────────────────────────────────────────┘
WAL Record Structure
A typical WAL record contains:
@dataclass
class WALRecord:
lsn: int # Log Sequence Number (monotonically increasing)
transaction_id: int # Which transaction made this change
operation: str # INSERT, UPDATE, DELETE, COMMIT, ABORT
table_id: int # Which table was modified
page_id: int # Which page was modified
offset: int # Offset within the page
before_image: bytes # Old data (for UNDO)
after_image: bytes # New data (for REDO)
timestamp: int # When this happened
checksum: int # Data integrity verification
Log Sequence Numbers (LSN)
The LSN is the backbone of WAL ordering:
class LSNGenerator:
"""
Generates monotonically increasing Log Sequence Numbers.
LSN = unique identifier for each WAL record.
"""
def __init__(self):
self._current = 0
self._lock = threading.Lock()
def next(self) -> int:
with self._lock:
self._current += 1
return self._current
Each data page tracks its PageLSN - the LSN of the most recent WAL record that modified it:
Page Header:
┌───────────────────────────────────────┐
│ PageLSN: 1000 │ ← Last WAL record that touched this page
│ Checksum: 0xABCD1234 │
│ Free Space: 2048 bytes │
├───────────────────────────────────────┤
│ Data... │
└───────────────────────────────────────┘
WAL: [LSN 998] [LSN 999] [LSN 1000] [LSN 1001] ...
↑
This record modified the page
Crash Recovery: ARIES Algorithm
Most databases use the ARIES (Algorithms for Recovery and Isolation Exploiting Semantics) recovery protocol:
The Three Phases
┌─────────────────────────────────────────────────────────────┐
│ ARIES RECOVERY │
├─────────────────────────────────────────────────────────────┤
│ │
│ Phase 1: ANALYSIS │
│ ───────────────── │
│ • Scan WAL from last checkpoint │
│ • Build transaction table (active txns at crash) │
│ • Build dirty page table (pages not flushed) │
│ • Determine REDO start point │
│ │
├─────────────────────────────────────────────────────────────┤
│ │
│ Phase 2: REDO │
│ ───────────── │
│ • Replay WAL from REDO start point │
│ • Reapply ALL changes (even if page on disk) │
│ • Use LSN comparison to skip already-applied changes │
│ • Brings database to exact crash state │
│ │
├─────────────────────────────────────────────────────────────┤
│ │
│ Phase 3: UNDO │
│ ────────── │
│ • Roll back uncommitted transactions │
│ • Walk backward through WAL │
│ • Apply before-images to undo changes │
│ • Write compensation log records (CLRs) │
│ │
└─────────────────────────────────────────────────────────────┘
Implementation
class ARIESRecovery:
def recover(self):
# Phase 1: Analysis
active_txns, dirty_pages, redo_lsn = self.analysis_phase()
# Phase 2: Redo
self.redo_phase(redo_lsn)
# Phase 3: Undo
self.undo_phase(active_txns)
print("Recovery complete. Database is consistent.")
def analysis_phase(self):
"""Scan WAL to understand crash state."""
active_txns = {} # txn_id -> last_lsn
dirty_pages = {} # page_id -> rec_lsn (first LSN that dirtied it)
# Start from last checkpoint
checkpoint_lsn = self.read_checkpoint()
for record in self.wal.scan_from(checkpoint_lsn):
if record.operation == 'BEGIN':
active_txns[record.transaction_id] = record.lsn
elif record.operation == 'COMMIT':
del active_txns[record.transaction_id]
elif record.operation == 'ABORT':
del active_txns[record.transaction_id]
else:
# Data modification
active_txns[record.transaction_id] = record.lsn
if record.page_id not in dirty_pages:
dirty_pages[record.page_id] = record.lsn
# REDO starts from min of all rec_lsn values
redo_lsn = min(dirty_pages.values()) if dirty_pages else checkpoint_lsn
return active_txns, dirty_pages, redo_lsn
def redo_phase(self, start_lsn):
"""Replay WAL to restore crash state."""
for record in self.wal.scan_from(start_lsn):
if record.operation in ('INSERT', 'UPDATE', 'DELETE'):
page = self.buffer_pool.get_page(record.page_id)
# Only redo if page hasn't seen this change
if page.lsn < record.lsn:
self.apply_redo(page, record)
page.lsn = record.lsn
def undo_phase(self, active_txns):
"""Roll back uncommitted transactions."""
# Get all LSNs to undo, sorted descending
to_undo = sorted(active_txns.values(), reverse=True)
while to_undo:
lsn = to_undo.pop(0)
record = self.wal.read(lsn)
if record.operation in ('INSERT', 'UPDATE', 'DELETE'):
# Apply the undo
page = self.buffer_pool.get_page(record.page_id)
self.apply_undo(page, record)
# Write Compensation Log Record
clr = self.write_clr(record)
# Follow undo chain
if record.prev_lsn:
to_undo.append(record.prev_lsn)
Checkpointing: Limiting Recovery Time
Without checkpoints, recovery would replay the entire WAL from the beginning. Checkpoints limit how far back we need to go.
Types of Checkpoints
1. Consistent Checkpoint (Quiesce)
Steps:
1. Stop accepting new transactions
2. Wait for all active transactions to complete
3. Flush all dirty pages to disk
4. Write checkpoint record to WAL
5. Resume operations
Pros: Simple recovery
Cons: Database unavailable during checkpoint (can be minutes!)
2. Fuzzy Checkpoint (Non-blocking)
Steps:
1. Write BEGIN_CHECKPOINT to WAL
2. Record active transactions and dirty pages
3. Continue normal operations (don't block!)
4. Asynchronously flush dirty pages
5. Write END_CHECKPOINT to WAL
Pros: No blocking
Cons: More complex recovery
NexaDB's Checkpoint Strategy
class CheckpointManager:
def __init__(self, wal, buffer_pool, config):
self.wal = wal
self.buffer_pool = buffer_pool
self.checkpoint_interval = config.get('checkpoint_interval', 300) # 5 min
self.max_wal_size = config.get('max_wal_size', 1024 * 1024 * 1024) # 1GB
def should_checkpoint(self):
"""Determine if checkpoint is needed."""
# Time-based
if time.time() - self.last_checkpoint > self.checkpoint_interval:
return True
# WAL size-based (prevent unbounded growth)
if self.wal.size() > self.max_wal_size:
return True
return False
def fuzzy_checkpoint(self):
"""Perform non-blocking fuzzy checkpoint."""
# Record start
begin_lsn = self.wal.append(CheckpointBegin(
active_txns=self.get_active_transactions(),
dirty_pages=self.buffer_pool.get_dirty_pages()
))
# Flush dirty pages in background
for page_id in self.buffer_pool.get_dirty_pages():
self.buffer_pool.flush_page(page_id)
yield # Allow other operations to proceed
# Record end
self.wal.append(CheckpointEnd(begin_lsn=begin_lsn))
# Update master checkpoint record
self.update_checkpoint_pointer(begin_lsn)
# Truncate old WAL segments
self.wal.truncate_before(begin_lsn)
WAL Durability Levels
Different applications need different durability guarantees:
Level 1: Full Durability (fsync on commit)
def commit_full_durability(self, txn_id):
"""Guaranteed durability - data survives any crash."""
record = CommitRecord(txn_id=txn_id, timestamp=time.time())
self.wal.append(record)
self.wal.fsync() # Force to disk
return True
# Latency: ~1-10ms per commit (disk dependent)
# Guarantees: Committed transaction NEVER lost
# Use for: Financial transactions, critical data
Level 2: Group Commit (batched fsync)
class GroupCommit:
"""Batch multiple commits into single fsync."""
def __init__(self, wal, max_delay_ms=10, max_batch_size=100):
self.wal = wal
self.pending = []
self.max_delay = max_delay_ms / 1000
self.max_batch = max_batch_size
self.lock = threading.Lock()
self.commit_event = threading.Event()
def commit(self, txn_id):
with self.lock:
record = CommitRecord(txn_id=txn_id)
self.wal.append(record)
self.pending.append(txn_id)
if len(self.pending) >= self.max_batch:
self._flush()
else:
# Wait for batch or timeout
threading.Timer(self.max_delay, self._maybe_flush).start()
self.commit_event.wait()
return True
def _flush(self):
if self.pending:
self.wal.fsync()
self.pending.clear()
self.commit_event.set()
# Latency: ~1-10ms amortized across batch
# Throughput: 10-100x higher than individual fsync
# Use for: High-throughput applications with slight latency tolerance
Level 3: Async Durability (periodic fsync)
def commit_async(self, txn_id):
"""Fast commit, eventual durability."""
record = CommitRecord(txn_id=txn_id, timestamp=time.time())
self.wal.append(record) # In-memory only
return True
# Background thread:
def background_sync(self):
while True:
time.sleep(0.01) # 10ms
self.wal.fsync()
# Latency: <0.1ms per commit
# Risk: Up to 10ms of data loss on crash
# Use for: Caching layers, non-critical data, metrics
NexaDB Configuration
# nexadb.yaml
wal:
sync_mode: "group" # full | group | async
group_commit_delay: 10 # ms
group_commit_size: 100 # max transactions per batch
checkpoint_interval: 300 # seconds
max_wal_size: 1GB
WAL in Different Databases
PostgreSQL: Full-Featured WAL
PostgreSQL WAL Features:
├── Full ARIES recovery
├── Streaming replication (replicas read WAL)
├── Point-in-Time Recovery (PITR)
├── WAL archiving
├── Logical decoding (change data capture)
└── Configurable sync levels
MySQL/InnoDB: Redo Log + Doublewrite
InnoDB adds "doublewrite buffer" for torn page protection:
1. Write pages to doublewrite buffer (sequential)
2. fsync doublewrite buffer
3. Write pages to actual locations
4. On recovery: if page is torn, restore from doublewrite
SQLite: WAL Mode
-- Enable WAL mode
PRAGMA journal_mode=WAL;
-- SQLite WAL benefits:
-- • Readers don't block writers
-- • Writers don't block readers
-- • Faster for many workloads
-- • Automatic checkpointing
LSM-Tree Databases (LevelDB, RocksDB, NexaDB)
In LSM-trees, WAL protects the MemTable:
Write → WAL → MemTable → (flush) → SSTable
If crash before flush:
→ Replay WAL to rebuild MemTable
→ Continue from there
NexaDB optimization: Parallel WAL writing during MemTable flush
Implementation: Building a WAL
Here's a production-quality WAL implementation:
import struct
import os
import mmap
from dataclasses import dataclass
from typing import Iterator, Optional
import hashlib
@dataclass
class WALRecord:
lsn: int
transaction_id: int
operation: int # 1=INSERT, 2=UPDATE, 3=DELETE, 4=COMMIT, 5=ABORT
key: bytes
value: bytes
checksum: int
HEADER_SIZE = 32 # lsn(8) + txn_id(8) + op(4) + key_len(4) + val_len(4) + checksum(4)
def serialize(self) -> bytes:
key_len = len(self.key)
val_len = len(self.value)
header = struct.pack(
'<QQIII',
self.lsn,
self.transaction_id,
self.operation,
key_len,
val_len
)
data = header + self.key + self.value
checksum = self._calculate_checksum(data)
return data + struct.pack('<I', checksum)
@classmethod
def deserialize(cls, data: bytes) -> 'WALRecord':
lsn, txn_id, op, key_len, val_len = struct.unpack('<QQIII', data[:24])
offset = 24
key = data[offset:offset + key_len]
offset += key_len
value = data[offset:offset + val_len]
offset += val_len
checksum = struct.unpack('<I', data[offset:offset + 4])[0]
# Verify checksum
expected = cls._calculate_checksum(data[:offset])
if checksum != expected:
raise ValueError(f"WAL record corrupted: checksum mismatch")
return cls(lsn, txn_id, op, key, value, checksum)
@staticmethod
def _calculate_checksum(data: bytes) -> int:
return int(hashlib.md5(data).hexdigest()[:8], 16)
class WriteAheadLog:
"""
A durable, append-only log for crash recovery.
Features:
- Append-only writes (no seeking, fast sequential I/O)
- CRC checksums for corruption detection
- Segment-based storage for easier management
- Configurable sync modes
"""
SEGMENT_SIZE = 64 * 1024 * 1024 # 64MB segments
def __init__(self, directory: str, sync_mode: str = 'group'):
self.directory = directory
self.sync_mode = sync_mode
self.current_segment = None
self.current_lsn = 0
self.segments = []
os.makedirs(directory, exist_ok=True)
self._recover_segments()
self._open_current_segment()
def _recover_segments(self):
"""Find and recover existing WAL segments."""
for filename in sorted(os.listdir(self.directory)):
if filename.startswith('wal_') and filename.endswith('.log'):
segment_id = int(filename[4:-4])
self.segments.append(segment_id)
# Find highest LSN
for record in self._scan_segment(segment_id):
self.current_lsn = max(self.current_lsn, record.lsn)
def _open_current_segment(self):
"""Open or create the current segment file."""
if not self.segments:
self.segments.append(0)
segment_id = self.segments[-1]
path = os.path.join(self.directory, f'wal_{segment_id:010d}.log')
self.current_segment = open(path, 'ab+')
# Check if we need a new segment
if self.current_segment.tell() >= self.SEGMENT_SIZE:
self._rotate_segment()
def _rotate_segment(self):
"""Create a new segment file."""
if self.current_segment:
self.current_segment.close()
new_segment_id = self.segments[-1] + 1
self.segments.append(new_segment_id)
path = os.path.join(self.directory, f'wal_{new_segment_id:010d}.log')
self.current_segment = open(path, 'ab+')
def append(self, transaction_id: int, operation: int,
key: bytes, value: bytes = b'') -> int:
"""
Append a record to the WAL.
Returns the LSN of the new record.
"""
self.current_lsn += 1
lsn = self.current_lsn
record = WALRecord(
lsn=lsn,
transaction_id=transaction_id,
operation=operation,
key=key,
value=value,
checksum=0
)
data = record.serialize()
# Write length prefix + data
self.current_segment.write(struct.pack('<I', len(data)))
self.current_segment.write(data)
# Sync based on mode
if self.sync_mode == 'full':
self.fsync()
# Rotate if needed
if self.current_segment.tell() >= self.SEGMENT_SIZE:
self._rotate_segment()
return lsn
def fsync(self):
"""Force WAL to disk."""
self.current_segment.flush()
os.fsync(self.current_segment.fileno())
def scan_from(self, start_lsn: int) -> Iterator[WALRecord]:
"""Scan WAL records starting from a given LSN."""
for segment_id in self.segments:
for record in self._scan_segment(segment_id):
if record.lsn >= start_lsn:
yield record
def _scan_segment(self, segment_id: int) -> Iterator[WALRecord]:
"""Scan all records in a segment."""
path = os.path.join(self.directory, f'wal_{segment_id:010d}.log')
if not os.path.exists(path):
return
with open(path, 'rb') as f:
while True:
length_data = f.read(4)
if len(length_data) < 4:
break
length = struct.unpack('<I', length_data)[0]
record_data = f.read(length)
if len(record_data) < length:
break # Incomplete record (crash during write)
try:
yield WALRecord.deserialize(record_data)
except ValueError:
break # Corrupted record
def truncate_before(self, lsn: int):
"""Remove WAL segments that are fully before the given LSN."""
segments_to_remove = []
for segment_id in self.segments[:-1]: # Never remove current
max_lsn = 0
for record in self._scan_segment(segment_id):
max_lsn = max(max_lsn, record.lsn)
if max_lsn < lsn:
segments_to_remove.append(segment_id)
for segment_id in segments_to_remove:
path = os.path.join(self.directory, f'wal_{segment_id:010d}.log')
os.remove(path)
self.segments.remove(segment_id)
def close(self):
"""Close the WAL."""
if self.current_segment:
self.fsync()
self.current_segment.close()
Performance Considerations
1. Sequential vs Random I/O
WAL writes are sequential:
┌─────────────────────────────────────────────────┐
│ HDD: Sequential Write = 100-200 MB/s │
│ HDD: Random Write = 0.5-2 MB/s │
│ Improvement: 50-400x faster! │
├─────────────────────────────────────────────────┤
│ SSD: Sequential Write = 500-3000 MB/s │
│ SSD: Random Write = 50-500 MB/s │
│ Improvement: 5-10x faster │
└─────────────────────────────────────────────────┘
2. Fsync Costs
# Benchmark: fsync latency by storage type
results = {
'RAM Disk': '0.01ms',
'NVMe SSD': '0.05-0.1ms',
'SATA SSD': '0.1-0.5ms',
'HDD': '2-10ms',
'Network Storage': '1-100ms'
}
# Takeaway: Group commit is essential for HDDs
3. WAL Size Management
Problem: WAL grows forever without checkpoints
Solution: Regular checkpointing + WAL truncation
NexaDB approach:
├── Checkpoint every 5 minutes
├── Checkpoint when WAL > 1GB
├── Keep minimum 2 segments for safety
└── Async page flushing during checkpoint
Common Issues and Solutions
Issue 1: WAL Growing Too Large
# Increase checkpoint frequency
wal:
checkpoint_interval: 60 # Every minute
max_wal_size: 256MB # Trigger checkpoint at 256MB
Issue 2: Slow Commits (fsync bottleneck)
# Enable group commit
wal:
sync_mode: group
group_commit_delay: 5 # 5ms batching window
group_commit_size: 200 # Max 200 txns per batch
Issue 3: Recovery Taking Too Long
# More frequent checkpoints = faster recovery
wal:
checkpoint_interval: 60 # Checkpoint every minute
# Recovery will replay max 1 minute of WAL
Issue 4: Torn Page Writes
# On power failure, a page write might be partial
# Solution: Full page images in WAL after checkpoint
def write_with_fpi(self, page_id, changes):
page = self.buffer.get(page_id)
# If first modification since checkpoint, log full page
if page.lsn <= self.last_checkpoint_lsn:
self.wal.append_full_page_image(page_id, page.data)
# Apply changes and log them
for change in changes:
self.apply_change(page, change)
self.wal.append(change)
NexaDB's WAL Implementation
NexaDB uses a high-performance WAL optimized for LSM-Tree storage:
class NexaDBWAL:
"""
NexaDB's production WAL implementation.
Features:
- Lock-free concurrent appends
- Automatic group commit
- Parallel recovery
- Compression (optional)
"""
def __init__(self, config):
self.directory = config['wal_directory']
self.sync_mode = config.get('sync_mode', 'group')
self.segment_size = config.get('segment_size', 64 * 1024 * 1024)
# Group commit settings
self.group_delay = config.get('group_commit_delay', 0.01)
self.group_size = config.get('group_commit_size', 100)
# Performance stats
self.stats = {
'writes': 0,
'syncs': 0,
'bytes_written': 0
}
def write(self, operation):
"""
Append operation to WAL.
Thread-safe, uses lock-free techniques where possible.
"""
lsn = self._allocate_lsn()
record = self._serialize(lsn, operation)
self._append_to_buffer(record)
self.stats['writes'] += 1
self.stats['bytes_written'] += len(record)
if self.sync_mode == 'full':
self._sync_now()
return lsn
def recover(self):
"""
Replay WAL after crash.
Uses parallel recovery when multiple segments exist.
"""
print("Starting WAL recovery...")
start = time.time()
# Parallel scan of segments
with ThreadPoolExecutor() as executor:
records = list(executor.map(self._scan_segment, self.segments))
# Merge and sort by LSN
all_records = sorted(
[r for segment in records for r in segment],
key=lambda r: r.lsn
)
# Replay
for record in all_records:
self._apply_record(record)
elapsed = time.time() - start
print(f"WAL recovery complete: {len(all_records)} records in {elapsed:.2f}s")
# Usage in NexaDB:
# Every write goes through WAL first
def put(self, key, value):
# 1. Log to WAL
lsn = self.wal.write(PutOperation(key, value))
# 2. Apply to MemTable
self.memtable.put(key, value, lsn)
# 3. Return success (data is durable in WAL)
return True
Conclusion
Write-Ahead Logging is the unsung hero of database reliability. By understanding WAL, you gain insight into:
- Why commits feel instant (memory write + sequential log)
- How databases survive crashes (ARIES recovery)
- Why checkpoint matters (bounds recovery time)
- Performance tradeoffs (durability vs latency)
Key takeaways:
- WAL before data: Never modify data pages before logging
- LSN ordering: Every change has a unique sequence number
- Checkpoints: Limit how much WAL needs replaying
- Group commit: Essential for high throughput
- Recovery: Analysis → Redo → Undo
NexaDB implements a modern WAL with group commit, parallel recovery, and tight integration with its LSM-Tree storage engine.
Learn more about NexaDB's internals: LSM-Tree Storage Engine | Getting Started
Further Reading
- ARIES: A Transaction Recovery Method (IBM Research)
- PostgreSQL WAL Internals
- RocksDB Wiki: Write Ahead Log
- SQLite WAL Mode
Questions about WAL or database durability? Join our Discord community or explore NexaDB's source code.