Wednesday, 25 February 2026

How DB reads data from disk? (WAL/Page/Offset)

in DB, what's the role of redo/undo logs and buffer ? which comes first, what kind of info they save, where they reside ?

Great question — this is core database internals. Since you're into system design and storage engines, let’s break this down from first principles.

We’ll answer:

  1. What is the buffer?

  2. What are redo logs?

  3. What are undo logs?

  4. What happens first during a write?

  5. Where do they reside?

  6. Why all three are needed?


1️⃣ Buffer (Buffer Pool)

Think of the buffer pool as:

📦 An in-memory cache of database pages (usually 8KB or 16KB blocks).

Databases like:

  • MySQL (InnoDB)

  • PostgreSQL

  • Oracle Database

do not modify disk directly.

Instead:

Disk → Buffer Pool (RAM) → Modified in memory → Later flushed to disk

What buffer stores

  • Data pages (rows, indexes)

  • Recently accessed pages

  • Dirty pages (modified but not flushed)

Why buffer exists

  • Disk is slow

  • RAM is fast

  • Most DB performance comes from avoiding disk I/O


2️⃣ Redo Log (Write-Ahead Log)

Redo log = durability guarantee

“If DB crashes, I can redo committed changes.”

It stores:

  • Physical changes (page-level or record-level)

  • What was modified

  • Enough information to reapply the change

Example:

Update user set balance = 500 where id=10

Redo log stores something like:

Page 42offset 128old 300new 500

Key concept: Write-Ahead Logging (WAL)

Before a transaction is marked COMMITTED:

Redo log must be written to disk first

This guarantees durability.

PostgreSQL literally calls it:

WAL (Write Ahead Log)


3️⃣ Undo Log

Undo log = rollback + MVCC support

It stores:

  • The previous version of the row

  • So we can:

    • Rollback transactions

    • Provide consistent reads (MVCC)

Example:

Before update:

balance = 300

Undo log stores:

Old value = 300

If rollback happens:
→ DB restores 300

In MVCC systems (like MySQL InnoDB, PostgreSQL), undo logs allow:

  • Transaction A updates row

  • Transaction B still sees old version

  • Without blocking

That’s how high concurrency works.


4️⃣ What Happens First During UPDATE?

Let’s walk through it step by step.

Suppose:

UPDATE users SET balance = 500 WHERE id = 10;

Step 1: Load page into Buffer (if not already)

Disk → Buffer Pool

Step 2: Create Undo Log

Before changing anything:

  • Save old value (300) into undo log

Step 3: Modify Buffer Page

Change balance to 500 (in memory)

Page is now:

Dirty page

Step 4: Write Redo Log to Disk

Write change record to redo log file on disk

Step 5: Commit

Once redo log is safely on disk:
→ Transaction is committed

Step 6: Later (asynchronously)

Background process flushes dirty buffer pages to disk


🔥 Important Order (Critical for Interviews)

1) Undo log written 2) Data page modified in buffer 3) Redo log written to disk 4) Commit acknowledged 5) Data page flushed later

This is called:

STEAL + NO-FORCE policy

Used by almost all modern databases.


5️⃣ Where Do They Reside?

ComponentMemory or Disk?Purpose
Buffer PoolRAMFast data access
Undo LogDisk (often separate segment)Rollback + MVCC
Redo LogDisk (sequential file)Crash recovery

Why redo log is sequential?

Because sequential disk writes are fast.

Databases append to redo log like:

LSN 1001 LSN 1002 LSN 1003

LSN = Log Sequence Number


6️⃣ What Happens During Crash?

Imagine crash after commit but before data page flush.

Data page on disk:

balance = 300

Redo log says:

Change to 500

During restart:

  • DB scans redo log

  • Reapplies changes

  • Balance becomes 500

That’s durability.


7️⃣ Why We Need All Three

If only buffer:
→ Crash loses data

If only redo:
→ Can't rollback

If only undo:
→ Can't recover committed changes

Together they give:

PropertyMechanism
AtomicityUndo log
DurabilityRedo log
IsolationUndo (MVCC)
PerformanceBuffer pool

That’s basically ACID implemented internally.


🧠 First-Principles Summary

Database write path is optimized for:

  1. Fast writes → write sequential redo log

  2. Crash safety → replay redo

  3. Concurrency → undo for MVCC

  4. Performance → buffer pool

So the real trick is:

Modify memory fast, log to disk safely, flush lazily.


If you want, next we can go even deeper into:

  • ARIES recovery algorithm

  • STEAL vs NO-STEAL

  • FORCE vs NO-FORCE

  • How LSN works internally

  • Checkpoints and fuzzy checkpoints

  • How PostgreSQL WAL differs from MySQL redo

This is proper storage-engine level knowledge.

During restart: DB scans redo log Reapplies changes Balance becomes 500 This is the exact question about since long time, redo log (WAL) might be having thousands of records in it (or are they removed as soon as DB gets updated ?? not sure), now my question is while DB restart, how DB knows among thousands of records in redo log, i need to update balance 500 ?

🔥 Excellent. This is the core recovery question.

You’re asking:

Redo log may have thousands/millions of records.
During restart, how does DB know which ones to apply?
Does it replay everything blindly?

This is where real storage engine intelligence begins.

Let’s break it carefully.


First: Redo Logs Are NOT Deleted Immediately

In systems like:

  • PostgreSQL

  • MySQL (InnoDB)

Redo logs:

  • Are written sequentially

  • Stored in log files on disk

  • Reused in circular fashion

  • Not deleted per transaction

So yes — logs can contain thousands or millions of records.


The Real Answer: LSN (Log Sequence Number)

Every redo log record has:

LSN = Log Sequence Number

Think of LSN as:

1001 1002 1003 1004 ...

Monotonically increasing.


🔑 Critical Concept: Each Data Page Stores Its Own LSN

Here’s the key magic:

Every page on disk stores:

pageLSN

That means:

“This page already includes all changes up to LSN X”


Now Let’s Walk Through Crash Recovery

Imagine:

Disk page shows balance = 300 pageLSN = 1000

Redo log contains:

LSN 1001 balance = 400 LSN 1002 balance = 500 LSN 1003 some other row

Crash happens.


During Restart

The DB does:

Step 1: Find Last Checkpoint

Database does NOT start scanning from beginning of log.

It starts from:

Last checkpoint LSN

Checkpoint means:

“All changes before this are already safely on disk.”

So DB starts from checkpoint LSN.


Step 2: Replay Log Records After Checkpoint

Now for each redo record:

For each record:

  1. Load the page from disk

  2. Compare:

if pageLSN >= redoLSN skip (already applied) else apply redo

That’s how it knows.


Your Specific Question:

How does DB know I need to update balance to 500?

Answer:

Because:

pageLSN < redo record LSN

So it knows:
→ This change was committed
→ But page was not flushed
→ So it must apply redo


⚡ Very Important

Redo logs contain:

  • Page ID

  • Offset

  • New value

  • LSN

So DB doesn’t guess.

It reads redo record like:

LSN 1002 Page 42 Offset 128 Write value = 500

Then checks page 42’s pageLSN.


What If There Are Thousands of Records?

No problem.

Database:

  • Starts from checkpoint

  • Sequentially scans logs

  • Applies only needed changes

  • Skips already applied ones

Sequential scan is very fast.

This is why redo log is append-only.


Why This Works (Deep Insight)

Because of:

🔥 Write-Ahead Logging Rule

Before commit:

Redo log must be flushed to disk

So if redo record exists:
→ It was committed

Even if data page wasn’t flushed.


What About Undo?

After redo phase:

Database does:

  1. REDO phase → repeat history

  2. UNDO phase → rollback uncommitted transactions

This is part of the ARIES recovery algorithm.

Used by:

  • IBM Db2

  • Oracle Database

  • Microsoft SQL Server

  • PostgreSQL (variation)


🎯 Ultra Simple Mental Model

Think of it like this:

Redo log = instruction history

Data page = notebook page with last applied instruction number

On restart:

For each instruction: if notebook already has it → skip else → apply it

That’s it.


One More Advanced Detail (You’ll Like This)

Databases also maintain:

Dirty Page Table Transaction Table

Saved at checkpoint.

So during recovery, DB even knows:

  • Which pages were dirty

  • Which transactions were active

So it doesn’t blindly scan everything.


Final Summary

During restart:

  1. Find checkpoint

  2. Scan redo logs from that point

  3. For each record:

    • Compare redo LSN with pageLSN

    • Apply only if needed

  4. Then undo incomplete transactions

That’s how it handles thousands/millions of records safely.


If you want next level:

We can walk through:

  • Exact ARIES phases (Analysis, Redo, Undo)

  • How fuzzy checkpoint works

  • Why PostgreSQL WAL is slightly different from InnoDB

  • What happens if redo log itself is corrupted

This is real storage engine mastery.

lets take an example: LSN 1001 → update employee set salary = 500 where empid=202 LSN 1002 → update address set pincode = 500039 where addressid=777 LSN 1003 → delete from employee where empid=606 I am still not able to visualize how things are saved internally, I guess this is how WAL looks like. You said, DB looks for LSN numbers to pick the chagnes. Now I don't understand, does the DB maintains the LSN for each and every table separately (tables like : employee/address) ? or row level ?

Excellent — now you’re asking the correct low-level question.

You’re imagining WAL like this:

LSN 1001update employee ... LSN 1002update address ... LSN 1003delete employee ...

That’s how humans think.

But internally, WAL does NOT think in terms of tables.

It thinks in terms of:

📦 Pages (blocks), not tables or rows

This is the key shift.


🔥 First Important Correction

Redo/WAL does NOT track:

  • Per table ❌

  • Per row ❌

It tracks:

✅ Per page


Step 1: How Data Is Physically Stored

In databases like:

  • PostgreSQL

  • MySQL (InnoDB)

Tables are stored as:

Table → multiple pages (8KB / 16KB blocks)

Example:

employee table page 10 page 11 page 12 address table page 55 page 56

Each page has:

Page ID pageLSN actual row data

Step 2: What WAL Actually Stores

Your example:

LSN 1001update employee set salary=500 where empid=202

Internally becomes something like:

LSN: 1001 PageID: 10 Offset: 128 bytes Old: 300 New: 500

Notice:

❌ No table name
❌ No SQL
❌ No row-level concept

Just:

"Modify Page 10 at offset 128"


Step 3: Where Is LSN Stored?

Every page contains:

pageLSN

Example:

Page 10: pageLSN = 1001

That means:

“All changes up to LSN 1001 are already applied to this page.”


🔥 So To Answer Your Question Directly

Does DB maintain LSN per table?

No.

Does DB maintain LSN per row?

No.

Where is LSN maintained?

At page level.

Each page knows the last log record applied to it.


Step 4: Let's Apply Your Example Physically

WAL contains:

LSN 1001 → modify Page 10 (employee row 202) LSN 1002 → modify Page 55 (address row 777) LSN 1003 → modify Page 11 (employee row 606 delete)

After crash:

DB does:

For LSN 1001:

  • Load Page 10

  • If pageLSN < 1001 → apply change

  • Update pageLSN = 1001

For LSN 1002:

  • Load Page 55

  • Same check

For LSN 1003:

  • Load Page 11

  • Same check

That’s it.


⚡ Why Page-Level Instead of Row-Level?

Because database storage engine works in:

Disk blocks

Disk cannot update a single row.

Smallest I/O unit:

One full page (8KB)

So everything is page-based.


🧠 Deep Insight

Tables are a logical abstraction.

Internally database only understands:

Files → Pages → Offsets inside pages

SQL layer thinks in rows.

Storage layer thinks in pages.

WAL belongs to storage layer.


Step 5: What Happens If Multiple Rows In Same Page?

Suppose:

Page 10 contains:

empid 202 empid 203 empid 204

If two updates happen:

LSN 1001 empid 202 LSN 1004 empid 204

Both target Page 10.

During recovery:

DB checks pageLSN once per record.

If pageLSN already >= 1004:
→ skip both.


🔥 Very Important: LSN Is Global

LSN is:

  • Global counter

  • Across entire database

  • Not per table

Example:

1001 employee 1002 address 1003 orders 1004 employee

All share same LSN sequence.


🏗 Mental Architecture

SQL Layer ↓ Execution Engine ↓ Storage Engine ↓ Buffer Pool ↓ WAL (Redo Log)

WAL belongs to storage engine.

Storage engine does NOT care about table names.

Only page IDs.


🎯 Ultra Clean Mental Model

Redo log record looks like:

struct RedoRecord { LSN PageID Offset Change }

Each page stores:

struct Page { pageLSN data[] }

Recovery logic:

if page.pageLSN < redoRecord.LSN: apply redo

Done.


Why This Design Is Brilliant

Because:

  • Page-level logging is faster

  • No need to interpret SQL

  • No need to find row logically

  • Just physical patching

Recovery becomes:

Deterministic memory patching.


If You Want Even Deeper

Next logical questions would be:

  • How does DB know which page contains empid=202?

  • What if index pages change?

  • What is full-page write?

  • What is torn page problem?

  • How does checkpoint avoid scanning full log?

You're now entering real storage-engine territory.

if DB understands only "Files → Pages → Offsets inside pages" lets take an example: Page 10 contains: page 10 LSN 1000 empid 202 empid 203 empid 204 If two updates happen: (WAL contains below 2 lines) LSN 1001 → empid 202 LSN 1004 → empid 204 I am unable to visualize the oncept of "Offsets inside pages", Now explain me how empid 202 and 204 gets updated ?

Perfect. Now we are at the real physical storage level.

You’re stuck at:

If DB only understands “page + offset”, how does it update empid 202 vs 204 inside the same page?

Good. Let’s slow down and literally visualize a page in memory.


Step 1️⃣ What a Page Actually Looks Like (Physically)

In systems like:

  • PostgreSQL

  • MySQL (InnoDB)

A page (say 8KB) is just:

--------------------------------- | Page Header (metadata) | | - pageLSN | | - checksum | | - pointers | --------------------------------- | Row 1 (bytes) | --------------------------------- | Row 2 (bytes) | --------------------------------- | Row 3 (bytes) | --------------------------------- | Free space | ---------------------------------

There is NO concept of “empid” at storage level.

It’s just bytes.


Step 2️⃣ How Rows Are Located Inside a Page

Inside the page, rows are stored at specific byte positions.

Example (simplified):

Page 10 Offset 0100Header Offset 100150 → Row A (empid 202) Offset 150200 → Row B (empid 203) Offset 200250 → Row C (empid 204)

Now imagine row structure:

Row bytes: [empid][salary][other columns]

Maybe:

empid = 202 → stored at offset 100 salary → stored at offset 104

So salary column might be 4 bytes starting at offset 104.


Step 3️⃣ What WAL Actually Records

When you run:

UPDATE employee SET salary = 500 WHERE empid = 202;

The execution engine:

  1. Uses index to find page 10

  2. Finds row at offset 100

  3. Finds salary column at offset 104

Now WAL writes something like:

LSN 1001 PageID = 10 Offset = 104 Length = 4 bytes NewValue = 500

That’s it.

No empid.
No SQL.
Just “modify these bytes”.


Step 4️⃣ Now Your Example

Page 10 layout:

Offset 100 → empid 202 Offset 104 → salary (for 202) Offset 150 → empid 203 Offset 200 → empid 204 Offset 204 → salary (for 204)

WAL contains:

LSN 1001Page 10, Offset 104 → write 500 LSN 1004Page 10, Offset 204 → write 900

During recovery:

DB loads page 10.

Then:

For LSN 1001:

Write 500 at byte 104

For LSN 1004:

Write 900 at byte 204

That’s how empid 202 and 204 get updated.


🔥 Important Realization

At WAL level:

There is no such thing as:

  • “Update empid 202”

  • “Delete empid 204”

There is only:

“Modify bytes at offset X in page Y”


Step 5️⃣ What About Delete?

Delete doesn’t remove row immediately.

Usually it:

  • Marks row as deleted (sets a flag byte)

  • Or updates tuple header

So WAL might say:

LSN 1003 Page 10 Offset 100 Change first byte to ‘deleted’ flag

Again: just byte patching.


Step 6️⃣ How Does DB Originally Find Offset 104?

Important distinction:

During normal execution:

  1. Index → finds page number

  2. Inside page → slot array tells row location

  3. Row header → tells column offset

  4. Then update happens

But during recovery:

It does NOT search by empid.

It trusts WAL’s physical instruction.

Because WAL already recorded exact offset.


🧠 Visual Mental Model

Think of page as:

Byte array[8192]

WAL is:

At byte 104 → replace 4 bytes with 500 At byte 204 → replace 4 bytes with 900

That’s it.


⚡ Why This Is Powerful

Because recovery becomes:

  • No SQL parsing

  • No index traversal

  • No row lookup

  • Just memory patching

Very deterministic.
Very fast.


Step 7️⃣ What If Row Moved?

Good question (next level).

If update causes row expansion:

  • DB may move row to new page

  • WAL records:

    • Delete old location

    • Insert new location

Still page + offset.


🎯 Final Core Understanding

Tables and rows are:

Logical abstraction

Storage engine only sees:

Files → Pages → Byte offsets

WAL is simply:

A sequence of byte modifications to pages.


If you want, next we can go into:

  • How slot directory works

  • What tuple header contains

  • What happens in PostgreSQL HOT updates

  • How InnoDB clustered index changes layout

  • What full-page write is

You’re now thinking like a storage engine engineer.

got it now, that's perfect, thank you. Now tell me what exactly index returns ? only the page number or the starting part of that row along with page number ? how DB determines to write the salary value at perticular offset ?

🔥 Now you’re asking the right storage-engine question.

You understood:

WAL = modify Page X at Offset Y

Now you're asking:

  1. What exactly does an index return?

  2. Does it return only page number?

  3. Or page + row location?

  4. How does DB know exact offset of salary column?

Let’s go step by step.


1️⃣ What Does an Index Actually Return?

It depends on database type, but conceptually:

An index returns:

(pointer to row location)

Not the row itself.

Now what is “row location”?

That depends on engine.


Case A: Heap Table (like PostgreSQL)

PostgreSQL stores:

  • Table separately (heap)

  • Index separately

Index entry looks like:

(key, TID)

Where:

TID = (page_number, slot_number)

Example:

(empid=202) → (Page 10, Slot 3)

So index returns:

Page 10 Slot 3

NOT raw byte offset yet.


Case B: Clustered Index (like MySQL InnoDB)

InnoDB stores:

  • Data inside the primary key index itself

So primary index returns:

Entire row

Secondary index returns:

(primary key value)

Then engine looks up primary index.

Different internal layout, but same principle:

Ultimately we reach a page and a record position.


2️⃣ What Is Slot Number?

Inside a page, there is something called:

Slot directory (or line pointer array)

Page layout looks like:

-------------------------------- | Page Header | -------------------------------- | Row Data Area (grows upward)| -------------------------------- | Free Space | -------------------------------- | Slot Directory (grows down) | --------------------------------

Slot directory contains entries like:

Slot 1offset 100 Slot 2offset 150 Slot 3offset 200

So:

(Page 10, Slot 3)

means:

Look in slot directory Slot 3 says → row starts at offset 200

Now we know row start offset.


3️⃣ How DB Finds Salary Column Offset

Row structure is predefined.

Suppose employee table:

empid INT salary INT name VARCHAR

Row bytes might look like:

| Row Header | empid | salary | name | ...

Engine knows:

  • empid starts at offset +0 (relative to row start)

  • salary starts at offset +4

  • name starts at offset +8

So if:

Row start offset = 200 salary offset = row_start + 4

Then:

200 + 4 = 204

That becomes WAL offset.


🔥 Now Let’s Walk Full Flow

You execute:

UPDATE employee SET salary=500 WHERE empid=202;

Step 1: Index lookup

Index returns:

(Page 10, Slot 3)

Step 2: Storage engine loads Page 10 into buffer

Step 3: From slot directory

Slot 3offset 200

Step 4: From row metadata

salary column offset = 4 bytes from row start

So:

final byte offset = 200 + 4 = 204

Step 5: Modify 4 bytes at offset 204

Step 6: WAL records:

Page 10 Offset 204 NewValue 500

That’s it.


🎯 So What Exactly Does Index Return?

In heap-based storage (like PostgreSQL):

Index(Page Number, Slot Number)

In clustered storage (InnoDB primary index):

Index → actual row location inside B+Tree page

But ultimately it becomes:

Page + Record pointer

NOT raw byte offset directly.


🧠 Important Insight

Index layer operates at:

Logical row pointer level

Storage layer operates at:

Physical byte offset level

Execution engine bridges the two.


4️⃣ Why Slot Directory Exists

You might ask:

Why not store rows sequentially and compute offset directly?

Because:

  • Rows can be deleted

  • Rows can move

  • Rows can shrink/expand

  • Page needs compaction

Slot directory allows:

  • Row movement without breaking index pointers

  • Only slot offset changes, not slot number

Very clever design.


🏗 Big Picture Architecture

Index (B+Tree) ↓ Returns (PageID, SlotID) ↓ Page loaded ↓ Slot directory gives row offsetRow metadata gives column offset ↓ Byte modification ↓ WAL records page + offset

⚡ Why This Is Extremely Powerful

Because:

  • Index doesn’t need to know row layout

  • WAL doesn’t need to know SQL

  • Storage doesn’t need to know business logic

  • Each layer is cleanly separated

This is why databases scale.


If you want, next we can go into:

  • What happens if salary update increases row size?

  • How page compaction works?

  • How HOT update works in PostgreSQL?

  • Why secondary indexes in InnoDB are expensive?

  • How MVCC changes row structure?

You're now thinking at real storage-engine level.

employee table: empid int, name varchar (255), salary int This is how data saved on the disk (i know they are saved on bits), let's assume there are 2 rows saved on the disk, 1st row offset starts at 200, and 2nd row offset starts at 211 101Ravi500102Kumar600 Now my question is, if i want to update name of 101 empid to "Ravi Kumar Garlapati", what happens internally, how the data movement happens ? does 102 employee id data moves to right ? or 101 is deleted in the page and gets a fresh insert into some other new page ?

🔥 Beautiful question.
Now you’re touching the hardest part of storage engines:

What happens when a row grows in size?

Let’s walk through it carefully.


First: Your Example Page

You simplified it as:

Offset 200101 Ravi 500 Offset 211102 Kumar 600

Now you want to update:

name of empid 101"Ravi Kumar Garlapati"

Old length ≈ 4 chars
New length ≈ 21 chars

So the row becomes larger.


⚠️ Important Reality

Rows are NOT stored as fixed width if you use:

VARCHAR(255)

They are stored as:

[Row header] [empid 4 bytes] [name length + actual string bytes] [salary 4 bytes]

So increasing name increases total row size.

Now the real question:

What does database do if row size increases?

Answer:

It depends on the engine.


Case 1️⃣: PostgreSQL Behavior (Heap Storage)

In PostgreSQL:

Rows are immutable.

Meaning:

UPDATE is actually DELETE + INSERT (logically)

What happens:

  1. Old row is marked as “dead” (not physically removed immediately)

  2. New row version is inserted

  3. Index updated to point to new location

So in your example:

Offset 200old row (marked dead) Offset ??? → new bigger row inserted

If page has enough free space:
→ new version inserted in same page

If page doesn't have space:
→ new version inserted into another page

So:

❌ 102 does NOT shift right
❌ No in-place expansion

Instead:

A new version of row is created.

This is part of MVCC design.


Case 2️⃣: MySQL InnoDB Behavior

In MySQL (InnoDB):

InnoDB tries in-place update first.

Two possibilities:

✅ If there is enough free space in page:

It may:

  • Shift subsequent rows to the right

  • Update row in-place

But this is expensive.

❌ If not enough space:

It does something similar to PostgreSQL:

  • Delete-mark old record

  • Insert new record elsewhere

  • Update pointers


🔥 Important Concept: Pages Have Free Space Area

Earlier I showed:

-------------------------------- | Page Header | -------------------------------- | Row Data (grows downward) | -------------------------------- | Free Space | -------------------------------- | Slot Directory (grows up) | --------------------------------

Rows don’t sit tightly packed forever.

There is fragmentation and free space.


Now Let's Walk Your Exact Scenario

Initial Page:

Offset 200101 Ravi 500 Offset 211102 Kumar 600 Free space starts at 222

Now new row size increases from 11 bytes → maybe 28 bytes.

Engine checks:

Is there enough free space in this page?


Scenario A: Enough Space

Then engine may:

  1. Mark old row deleted (if MVCC)

  2. Insert new row at end of page

  3. Update slot directory

Final layout:

Offset 200old version (dead) Offset 211102 Kumar 600 Offset 222101 Ravi Kumar Garlapati 500

Index updated to point to new slot.


Scenario B: Not Enough Space

Engine:

  1. Marks old row deleted

  2. Finds another page with free space

  3. Inserts new row there

  4. Updates index

So row physically moves to another page.


❗ Important: Rows Are Not Shifted Like Arrays

Database does NOT behave like:

ArrayList shifting elements right

Because:

  • Shifting 8KB page repeatedly is expensive

  • Index pointers would break

  • Concurrency would suffer

Instead:

New version is written elsewhere.


🧠 Why PostgreSQL Always Inserts New Version

Because of MVCC:

Old version must remain visible to older transactions.

So it cannot overwrite in-place.

That’s why PostgreSQL updates create bloat.

Then VACUUM cleans dead rows later.


What About WAL in This Case?

WAL would record something like:

LSN 2001 → mark old row dead (Page 10, Offset 200) LSN 2002insert new row (Page 10, Offset 222)

OR if moved:

LSN 2002insert into Page 25, Offset 100

Still page + offset modifications.


🎯 So Direct Answer to You

When row size increases:

Does 102 shift right?

👉 Usually NO.

Does 101 get deleted and reinserted?

👉 YES (logically almost always).

Depending on engine:

  • PostgreSQL → always new row version

  • InnoDB → sometimes in-place, sometimes new insert


🔥 Deeper Insight

Databases optimize for:

  • Write speed

  • Concurrency

  • Crash safety

NOT memory compactness.

That’s why fragmentation exists.

That’s why VACUUM / OPTIMIZE TABLE exists.


Ultra-Clean Mental Model

If update changes row size:

Old row → marked dead New row → inserted elsewhere Index → updated

No shifting like C arrays.


If you want next level:

We can go into:

  • HOT updates in PostgreSQL

  • How overflow pages work

  • Why fixed-length columns behave differently

  • How TOAST storage works

  • Why large VARCHAR sometimes stored out-of-page

You're now operating at storage-engine architect level.