in DB, what's the role of redo/undo logs and buffer ? which comes first, what kind of info they save, where they reside ?

Great question — this is core database internals. Since you're into system design and storage engines, let’s break this down from first principles.

We’ll answer:

What is the buffer?
What are redo logs?
What are undo logs?
What happens first during a write?
Where do they reside?
Why all three are needed?

1️⃣ Buffer (Buffer Pool)

Think of the buffer pool as:

📦 An in-memory cache of database pages (usually 8KB or 16KB blocks).

Databases like:

MySQL (InnoDB)
PostgreSQL
Oracle Database

do not modify disk directly.

Instead:


Disk  →  Buffer Pool (RAM)  →  Modified in memory  →  Later flushed to disk

What buffer stores

Data pages (rows, indexes)
Recently accessed pages
Dirty pages (modified but not flushed)

Why buffer exists

Disk is slow
RAM is fast
Most DB performance comes from avoiding disk I/O

2️⃣ Redo Log (Write-Ahead Log)

Redo log = durability guarantee

“If DB crashes, I can redo committed changes.”

It stores:

Physical changes (page-level or record-level)
What was modified
Enough information to reapply the change

Example:


Update user set balance = 500 where id=10

Redo log stores something like:


Page 42 → offset 128 → old 300 → new 500

Key concept: Write-Ahead Logging (WAL)

Before a transaction is marked COMMITTED:


Redo log must be written to disk first

This guarantees durability.

PostgreSQL literally calls it:

WAL (Write Ahead Log)

3️⃣ Undo Log

Undo log = rollback + MVCC support

It stores:

The previous version of the row
So we can:
- Rollback transactions
- Provide consistent reads (MVCC)

Example:

Before update:


balance = 300

Undo log stores:


Old value = 300

If rollback happens:
→ DB restores 300

In MVCC systems (like MySQL InnoDB, PostgreSQL), undo logs allow:

Transaction A updates row
Transaction B still sees old version
Without blocking

That’s how high concurrency works.

4️⃣ What Happens First During UPDATE?

Let’s walk through it step by step.

Suppose:


UPDATE users SET balance = 500 WHERE id = 10;

Step 1: Load page into Buffer (if not already)

Disk → Buffer Pool

Step 2: Create Undo Log

Before changing anything:

Save old value (300) into undo log

Step 3: Modify Buffer Page

Change balance to 500 (in memory)

Page is now:

Dirty page

Step 4: Write Redo Log to Disk

Write change record to redo log file on disk

Step 5: Commit

Once redo log is safely on disk:
→ Transaction is committed

Step 6: Later (asynchronously)

Background process flushes dirty buffer pages to disk

🔥 Important Order (Critical for Interviews)


1) Undo log written
2) Data page modified in buffer
3) Redo log written to disk
4) Commit acknowledged
5) Data page flushed later

This is called:

STEAL + NO-FORCE policy

Used by almost all modern databases.

5️⃣ Where Do They Reside?

Component	Memory or Disk?	Purpose
Buffer Pool	RAM	Fast data access
Undo Log	Disk (often separate segment)	Rollback + MVCC
Redo Log	Disk (sequential file)	Crash recovery

Why redo log is sequential?

Because sequential disk writes are fast.

Databases append to redo log like:


LSN 1001
LSN 1002
LSN 1003

LSN = Log Sequence Number

6️⃣ What Happens During Crash?

Imagine crash after commit but before data page flush.

Data page on disk:


balance = 300

Redo log says:


Change to 500

During restart:

DB scans redo log
Reapplies changes
Balance becomes 500

That’s durability.

7️⃣ Why We Need All Three

If only buffer:
→ Crash loses data

If only redo:
→ Can't rollback

If only undo:
→ Can't recover committed changes

Together they give:

Property	Mechanism
Atomicity	Undo log
Durability	Redo log
Isolation	Undo (MVCC)
Performance	Buffer pool

That’s basically ACID implemented internally.

🧠 First-Principles Summary

Database write path is optimized for:

Fast writes → write sequential redo log
Crash safety → replay redo
Concurrency → undo for MVCC
Performance → buffer pool

So the real trick is:

Modify memory fast, log to disk safely, flush lazily.

If you want, next we can go even deeper into:

ARIES recovery algorithm
STEAL vs NO-STEAL
FORCE vs NO-FORCE
How LSN works internally
Checkpoints and fuzzy checkpoints
How PostgreSQL WAL differs from MySQL redo

This is proper storage-engine level knowledge.

During restart: DB scans redo log Reapplies changes Balance becomes 500 This is the exact question about since long time, redo log (WAL) might be having thousands of records in it (or are they removed as soon as DB gets updated ?? not sure), now my question is while DB restart, how DB knows among thousands of records in redo log, i need to update balance 500 ?

🔥 Excellent. This is the core recovery question.

You’re asking:

Redo log may have thousands/millions of records.
During restart, how does DB know which ones to apply?
Does it replay everything blindly?

This is where real storage engine intelligence begins.

Let’s break it carefully.

First: Redo Logs Are NOT Deleted Immediately

In systems like:

PostgreSQL
MySQL (InnoDB)

Redo logs:

Are written sequentially
Stored in log files on disk
Reused in circular fashion
Not deleted per transaction

So yes — logs can contain thousands or millions of records.

The Real Answer: LSN (Log Sequence Number)

Every redo log record has:


LSN = Log Sequence Number

Think of LSN as:

Monotonically increasing.

🔑 Critical Concept: Each Data Page Stores Its Own LSN

Here’s the key magic:

Every page on disk stores:


pageLSN

That means:

“This page already includes all changes up to LSN X”

Now Let’s Walk Through Crash Recovery

Imagine:


Disk page shows balance = 300
pageLSN = 1000

Redo log contains:


LSN 1001 → balance = 400
LSN 1002 → balance = 500
LSN 1003 → some other row

Crash happens.

During Restart

The DB does:

Step 1: Find Last Checkpoint

Database does NOT start scanning from beginning of log.

It starts from:


Last checkpoint LSN

Checkpoint means:

“All changes before this are already safely on disk.”

So DB starts from checkpoint LSN.

Step 2: Replay Log Records After Checkpoint

Now for each redo record:

For each record:

Load the page from disk
Compare:


if pageLSN >= redoLSN
    skip (already applied)
else
    apply redo

That’s how it knows.

Your Specific Question:

How does DB know I need to update balance to 500?

Answer:

Because:


pageLSN < redo record LSN

So it knows:
→ This change was committed
→ But page was not flushed
→ So it must apply redo

⚡ Very Important

Redo logs contain:

Page ID
Offset
New value
LSN

So DB doesn’t guess.

It reads redo record like:


LSN 1002
Page 42
Offset 128
Write value = 500

Then checks page 42’s pageLSN.

What If There Are Thousands of Records?

No problem.

Database:

Starts from checkpoint
Sequentially scans logs
Applies only needed changes
Skips already applied ones

Sequential scan is very fast.

This is why redo log is append-only.

Why This Works (Deep Insight)

Because of:

🔥 Write-Ahead Logging Rule

Before commit:


Redo log must be flushed to disk

So if redo record exists:
→ It was committed

Even if data page wasn’t flushed.

What About Undo?

After redo phase:

Database does:

REDO phase → repeat history
UNDO phase → rollback uncommitted transactions

This is part of the ARIES recovery algorithm.

Used by:

IBM Db2
Oracle Database
Microsoft SQL Server
PostgreSQL (variation)

🎯 Ultra Simple Mental Model

Think of it like this:

Redo log = instruction history

Data page = notebook page with last applied instruction number

On restart:


For each instruction:
    if notebook already has it → skip
    else → apply it

That’s it.

One More Advanced Detail (You’ll Like This)

Databases also maintain:


Dirty Page Table
Transaction Table

Saved at checkpoint.

So during recovery, DB even knows:

Which pages were dirty
Which transactions were active

So it doesn’t blindly scan everything.

Final Summary

During restart:

Find checkpoint
Scan redo logs from that point
For each record:
- Compare redo LSN with pageLSN
- Apply only if needed
Then undo incomplete transactions

That’s how it handles thousands/millions of records safely.

If you want next level:

We can walk through:

Exact ARIES phases (Analysis, Redo, Undo)
How fuzzy checkpoint works
Why PostgreSQL WAL is slightly different from InnoDB
What happens if redo log itself is corrupted

This is real storage engine mastery.

lets take an example: LSN 1001 → update employee set salary = 500 where empid=202 LSN 1002 → update address set pincode = 500039 where addressid=777 LSN 1003 → delete from employee where empid=606 I am still not able to visualize how things are saved internally, I guess this is how WAL looks like. You said, DB looks for LSN numbers to pick the chagnes. Now I don't understand, does the DB maintains the LSN for each and every table separately (tables like : employee/address) ? or row level ?

Excellent — now you’re asking the correct low-level question.

You’re imagining WAL like this:


LSN 1001 → update employee ...
LSN 1002 → update address ...
LSN 1003 → delete employee ...

That’s how humans think.

But internally, WAL does NOT think in terms of tables.

It thinks in terms of:

📦 Pages (blocks), not tables or rows

This is the key shift.

🔥 First Important Correction

Redo/WAL does NOT track:

Per table ❌
Per row ❌

It tracks:

✅ Per page

Step 1: How Data Is Physically Stored

In databases like:

PostgreSQL
MySQL (InnoDB)

Tables are stored as:


Table → multiple pages (8KB / 16KB blocks)

Example:


employee table
    page 10
    page 11
    page 12

address table
    page 55
    page 56

Each page has:


Page ID
pageLSN
actual row data

Step 2: What WAL Actually Stores

Your example:


LSN 1001 → update employee set salary=500 where empid=202

Internally becomes something like:


LSN: 1001
PageID: 10
Offset: 128 bytes
Old: 300
New: 500

Notice:

❌ No table name
❌ No SQL
❌ No row-level concept

Just:

"Modify Page 10 at offset 128"

Step 3: Where Is LSN Stored?

Every page contains:


pageLSN

Example:


Page 10:
    pageLSN = 1001

That means:

“All changes up to LSN 1001 are already applied to this page.”

🔥 So To Answer Your Question Directly

Does DB maintain LSN per table?

No.

Does DB maintain LSN per row?

No.

Where is LSN maintained?

At page level.

Each page knows the last log record applied to it.

Step 4: Let's Apply Your Example Physically

WAL contains:


LSN 1001 → modify Page 10 (employee row 202)
LSN 1002 → modify Page 55 (address row 777)
LSN 1003 → modify Page 11 (employee row 606 delete)

After crash:

DB does:

For LSN 1001:

Load Page 10
If pageLSN < 1001 → apply change
Update pageLSN = 1001

For LSN 1002:

Load Page 55
Same check

For LSN 1003:

Load Page 11
Same check

That’s it.

⚡ Why Page-Level Instead of Row-Level?

Because database storage engine works in:

Disk blocks

Disk cannot update a single row.

Smallest I/O unit:


One full page (8KB)

So everything is page-based.

🧠 Deep Insight

Tables are a logical abstraction.

Internally database only understands:


Files → Pages → Offsets inside pages

SQL layer thinks in rows.

Storage layer thinks in pages.

WAL belongs to storage layer.

Step 5: What Happens If Multiple Rows In Same Page?

Suppose:

Page 10 contains:


empid 202
empid 203
empid 204

If two updates happen:


LSN 1001 → empid 202
LSN 1004 → empid 204

Both target Page 10.

During recovery:

DB checks pageLSN once per record.

If pageLSN already >= 1004:
→ skip both.

🔥 Very Important: LSN Is Global

LSN is:

Global counter
Across entire database
Not per table

Example:


1001 employee
1002 address
1003 orders
1004 employee

All share same LSN sequence.

🏗 Mental Architecture


SQL Layer
   ↓
Execution Engine
   ↓
Storage Engine
   ↓
Buffer Pool
   ↓
WAL (Redo Log)

WAL belongs to storage engine.

Storage engine does NOT care about table names.

Only page IDs.

🎯 Ultra Clean Mental Model

Redo log record looks like:


struct RedoRecord {
    LSN
    PageID
    Offset
    Change
}

Each page stores:


struct Page {
    pageLSN
    data[]
}

Recovery logic:


if page.pageLSN < redoRecord.LSN:
    apply redo

Done.

Why This Design Is Brilliant

Because:

Page-level logging is faster
No need to interpret SQL
No need to find row logically
Just physical patching

Recovery becomes:

Deterministic memory patching.

If You Want Even Deeper

Next logical questions would be:

How does DB know which page contains empid=202?
What if index pages change?
What is full-page write?
What is torn page problem?
How does checkpoint avoid scanning full log?

You're now entering real storage-engine territory.

if DB understands only "Files → Pages → Offsets inside pages" lets take an example: Page 10 contains: page 10 LSN 1000 empid 202 empid 203 empid 204 If two updates happen: (WAL contains below 2 lines) LSN 1001 → empid 202 LSN 1004 → empid 204 I am unable to visualize the oncept of "Offsets inside pages", Now explain me how empid 202 and 204 gets updated ?

Perfect. Now we are at the real physical storage level.

You’re stuck at:

If DB only understands “page + offset”, how does it update empid 202 vs 204 inside the same page?

Good. Let’s slow down and literally visualize a page in memory.

Step 1️⃣ What a Page Actually Looks Like (Physically)

In systems like:

PostgreSQL
MySQL (InnoDB)

A page (say 8KB) is just:


---------------------------------
| Page Header (metadata)        |
|  - pageLSN                    |
|  - checksum                   |
|  - pointers                   |
---------------------------------
| Row 1 (bytes)                 |
---------------------------------
| Row 2 (bytes)                 |
---------------------------------
| Row 3 (bytes)                 |
---------------------------------
| Free space                    |
---------------------------------

There is NO concept of “empid” at storage level.

It’s just bytes.

Step 2️⃣ How Rows Are Located Inside a Page

Inside the page, rows are stored at specific byte positions.

Example (simplified):


Page 10

Offset 0–100   → Header
Offset 100–150 → Row A (empid 202)
Offset 150–200 → Row B (empid 203)
Offset 200–250 → Row C (empid 204)

Now imagine row structure:


Row bytes:
[empid][salary][other columns]

Maybe:


empid = 202 → stored at offset 100
salary      → stored at offset 104

So salary column might be 4 bytes starting at offset 104.

Step 3️⃣ What WAL Actually Records

When you run:


UPDATE employee SET salary = 500 WHERE empid = 202;

The execution engine:

Uses index to find page 10
Finds row at offset 100
Finds salary column at offset 104

Now WAL writes something like:


LSN 1001
PageID = 10
Offset = 104
Length = 4 bytes
NewValue = 500

That’s it.

No empid.
No SQL.
Just “modify these bytes”.

Step 4️⃣ Now Your Example

Page 10 layout:


Offset 100 → empid 202
Offset 104 → salary (for 202)

Offset 150 → empid 203
Offset 200 → empid 204
Offset 204 → salary (for 204)

WAL contains:


LSN 1001 → Page 10, Offset 104 → write 500
LSN 1004 → Page 10, Offset 204 → write 900

During recovery:

DB loads page 10.

Then:

For LSN 1001:


Write 500 at byte 104

For LSN 1004:


Write 900 at byte 204

That’s how empid 202 and 204 get updated.

🔥 Important Realization

At WAL level:

There is no such thing as:

“Update empid 202”
“Delete empid 204”

There is only:

“Modify bytes at offset X in page Y”

Step 5️⃣ What About Delete?

Delete doesn’t remove row immediately.

Usually it:

Marks row as deleted (sets a flag byte)
Or updates tuple header

So WAL might say:


LSN 1003
Page 10
Offset 100
Change first byte to ‘deleted’ flag

Again: just byte patching.

Step 6️⃣ How Does DB Originally Find Offset 104?

Important distinction:

During normal execution:

Index → finds page number
Inside page → slot array tells row location
Row header → tells column offset
Then update happens

But during recovery:

It does NOT search by empid.

It trusts WAL’s physical instruction.

Because WAL already recorded exact offset.

🧠 Visual Mental Model

Think of page as:


Byte array[8192]

WAL is:


At byte 104 → replace 4 bytes with 500
At byte 204 → replace 4 bytes with 900

That’s it.

⚡ Why This Is Powerful

Because recovery becomes:

No SQL parsing
No index traversal
No row lookup
Just memory patching

Very deterministic.
Very fast.

Step 7️⃣ What If Row Moved?

Good question (next level).

If update causes row expansion:

DB may move row to new page
WAL records:
- Delete old location
- Insert new location

Still page + offset.

🎯 Final Core Understanding

Tables and rows are:

Logical abstraction

Storage engine only sees:


Files
  → Pages
    → Byte offsets

WAL is simply:

A sequence of byte modifications to pages.

If you want, next we can go into:

How slot directory works
What tuple header contains
What happens in PostgreSQL HOT updates
How InnoDB clustered index changes layout
What full-page write is

You’re now thinking like a storage engine engineer.

got it now, that's perfect, thank you. Now tell me what exactly index returns ? only the page number or the starting part of that row along with page number ? how DB determines to write the salary value at perticular offset ?

🔥 Now you’re asking the right storage-engine question.

You understood:

WAL = modify Page X at Offset Y

Now you're asking:

What exactly does an index return?
Does it return only page number?
Or page + row location?
How does DB know exact offset of salary column?

Let’s go step by step.

1️⃣ What Does an Index Actually Return?

It depends on database type, but conceptually:

An index returns:


(pointer to row location)

Not the row itself.

Now what is “row location”?

That depends on engine.

Case A: Heap Table (like PostgreSQL)

PostgreSQL stores:

Table separately (heap)
Index separately

Index entry looks like:


(key, TID)

Where:


TID = (page_number, slot_number)

Example:


(empid=202) → (Page 10, Slot 3)

So index returns:


Page 10
Slot 3

NOT raw byte offset yet.

Case B: Clustered Index (like MySQL InnoDB)

InnoDB stores:

Data inside the primary key index itself

So primary index returns:


Entire row

Secondary index returns:


(primary key value)

Then engine looks up primary index.

Different internal layout, but same principle:

Ultimately we reach a page and a record position.

2️⃣ What Is Slot Number?

Inside a page, there is something called:

Slot directory (or line pointer array)

Page layout looks like:


--------------------------------
| Page Header                  |
--------------------------------
| Row Data Area (grows upward)|
--------------------------------
| Free Space                   |
--------------------------------
| Slot Directory (grows down) |
--------------------------------

Slot directory contains entries like:


Slot 1 → offset 100
Slot 2 → offset 150
Slot 3 → offset 200

So:


(Page 10, Slot 3)

means:


Look in slot directory
Slot 3 says → row starts at offset 200

Now we know row start offset.

3️⃣ How DB Finds Salary Column Offset

Row structure is predefined.

Suppose employee table:


empid   INT
salary  INT
name    VARCHAR

Row bytes might look like:


| Row Header | empid | salary | name | ...

Engine knows:

empid starts at offset +0 (relative to row start)
salary starts at offset +4
name starts at offset +8

So if:


Row start offset = 200
salary offset = row_start + 4

Then:


200 + 4 = 204

That becomes WAL offset.

🔥 Now Let’s Walk Full Flow

You execute:


UPDATE employee SET salary=500 WHERE empid=202;

Step 1: Index lookup

Index returns:


(Page 10, Slot 3)

Step 2: Storage engine loads Page 10 into buffer

Step 3: From slot directory


Slot 3 → offset 200

Step 4: From row metadata


salary column offset = 4 bytes from row start

So:


final byte offset = 200 + 4 = 204

Step 5: Modify 4 bytes at offset 204

Step 6: WAL records:


Page 10
Offset 204
NewValue 500

That’s it.

🎯 So What Exactly Does Index Return?

In heap-based storage (like PostgreSQL):


Index → (Page Number, Slot Number)

In clustered storage (InnoDB primary index):


Index → actual row location inside B+Tree page

But ultimately it becomes:


Page + Record pointer

NOT raw byte offset directly.

🧠 Important Insight

Index layer operates at:

Logical row pointer level

Storage layer operates at:

Physical byte offset level

Execution engine bridges the two.

4️⃣ Why Slot Directory Exists

You might ask:

Why not store rows sequentially and compute offset directly?

Because:

Rows can be deleted
Rows can move
Rows can shrink/expand
Page needs compaction

Slot directory allows:

Row movement without breaking index pointers
Only slot offset changes, not slot number

Very clever design.

🏗 Big Picture Architecture


Index (B+Tree)
    ↓
Returns (PageID, SlotID)
    ↓
Page loaded
    ↓
Slot directory gives row offset
    ↓
Row metadata gives column offset
    ↓
Byte modification
    ↓
WAL records page + offset

⚡ Why This Is Extremely Powerful

Because:

Index doesn’t need to know row layout
WAL doesn’t need to know SQL
Storage doesn’t need to know business logic
Each layer is cleanly separated

This is why databases scale.

If you want, next we can go into:

What happens if salary update increases row size?
How page compaction works?
How HOT update works in PostgreSQL?
Why secondary indexes in InnoDB are expensive?
How MVCC changes row structure?

You're now thinking at real storage-engine level.

employee table: empid int, name varchar (255), salary int This is how data saved on the disk (i know they are saved on bits), let's assume there are 2 rows saved on the disk, 1st row offset starts at 200, and 2nd row offset starts at 211 101Ravi500102Kumar600 Now my question is, if i want to update name of 101 empid to "Ravi Kumar Garlapati", what happens internally, how the data movement happens ? does 102 employee id data moves to right ? or 101 is deleted in the page and gets a fresh insert into some other new page ?

🔥 Beautiful question.
Now you’re touching the hardest part of storage engines:

What happens when a row grows in size?

Let’s walk through it carefully.

First: Your Example Page

You simplified it as:


Offset 200 → 101 Ravi 500
Offset 211 → 102 Kumar 600

Now you want to update:


name of empid 101 → "Ravi Kumar Garlapati"

Old length ≈ 4 chars
New length ≈ 21 chars

So the row becomes larger.

⚠️ Important Reality

Rows are NOT stored as fixed width if you use:


VARCHAR(255)

They are stored as:


[Row header]
[empid 4 bytes]
[name length + actual string bytes]
[salary 4 bytes]

So increasing name increases total row size.

Now the real question:

What does database do if row size increases?

Answer:

It depends on the engine.

Case 1️⃣: PostgreSQL Behavior (Heap Storage)

In PostgreSQL:

Rows are immutable.

Meaning:

UPDATE is actually DELETE + INSERT (logically)

What happens:

Old row is marked as “dead” (not physically removed immediately)
New row version is inserted
Index updated to point to new location

So in your example:


Offset 200 → old row (marked dead)
Offset ??? → new bigger row inserted

If page has enough free space:
→ new version inserted in same page

If page doesn't have space:
→ new version inserted into another page

So:

❌ 102 does NOT shift right
❌ No in-place expansion

Instead:

A new version of row is created.

This is part of MVCC design.

Case 2️⃣: MySQL InnoDB Behavior

In MySQL (InnoDB):

InnoDB tries in-place update first.

Two possibilities:

✅ If there is enough free space in page:

It may:

Shift subsequent rows to the right
Update row in-place

But this is expensive.

❌ If not enough space:

It does something similar to PostgreSQL:

Delete-mark old record
Insert new record elsewhere
Update pointers

🔥 Important Concept: Pages Have Free Space Area

Earlier I showed:


--------------------------------
| Page Header                  |
--------------------------------
| Row Data (grows downward)    |
--------------------------------
| Free Space                   |
--------------------------------
| Slot Directory (grows up)    |
--------------------------------

Rows don’t sit tightly packed forever.

There is fragmentation and free space.

Now Let's Walk Your Exact Scenario

Initial Page:


Offset 200 → 101 Ravi 500
Offset 211 → 102 Kumar 600
Free space starts at 222

Now new row size increases from 11 bytes → maybe 28 bytes.

Engine checks:

Is there enough free space in this page?

Scenario A: Enough Space

Then engine may:

Mark old row deleted (if MVCC)
Insert new row at end of page
Update slot directory

Final layout:


Offset 200 → old version (dead)
Offset 211 → 102 Kumar 600
Offset 222 → 101 Ravi Kumar Garlapati 500

Index updated to point to new slot.

Scenario B: Not Enough Space

Engine:

Marks old row deleted
Finds another page with free space
Inserts new row there
Updates index

So row physically moves to another page.

❗ Important: Rows Are Not Shifted Like Arrays

Database does NOT behave like:


ArrayList shifting elements right

Because:

Shifting 8KB page repeatedly is expensive
Index pointers would break
Concurrency would suffer

Instead:

New version is written elsewhere.

🧠 Why PostgreSQL Always Inserts New Version

Because of MVCC:

Old version must remain visible to older transactions.

So it cannot overwrite in-place.

That’s why PostgreSQL updates create bloat.

Then VACUUM cleans dead rows later.

What About WAL in This Case?

WAL would record something like:


LSN 2001 → mark old row dead (Page 10, Offset 200)
LSN 2002 → insert new row (Page 10, Offset 222)

OR if moved:


LSN 2002 → insert into Page 25, Offset 100

Still page + offset modifications.

🎯 So Direct Answer to You

When row size increases:

Does 102 shift right?

👉 Usually NO.

Does 101 get deleted and reinserted?

👉 YES (logically almost always).

Depending on engine:

PostgreSQL → always new row version
InnoDB → sometimes in-place, sometimes new insert

🔥 Deeper Insight

Databases optimize for:

Write speed
Concurrency
Crash safety

NOT memory compactness.

That’s why fragmentation exists.

That’s why VACUUM / OPTIMIZE TABLE exists.

Ultra-Clean Mental Model

If update changes row size:


Old row → marked dead
New row → inserted elsewhere
Index → updated

No shifting like C arrays.

If you want next level:

We can go into:

HOT updates in PostgreSQL
How overflow pages work
Why fixed-length columns behave differently
How TOAST storage works
Why large VARCHAR sometimes stored out-of-page

You're now operating at storage-engine architect level.

Wednesday, 25 February 2026

How DB reads data from disk? (WAL/Page/Offset)

1️⃣ Buffer (Buffer Pool)

What buffer stores

Why buffer exists

2️⃣ Redo Log (Write-Ahead Log)

Key concept: Write-Ahead Logging (WAL)

3️⃣ Undo Log

4️⃣ What Happens First During UPDATE?

Step 1: Load page into Buffer (if not already)

Step 2: Create Undo Log

Step 3: Modify Buffer Page

Step 4: Write Redo Log to Disk

Step 5: Commit

Step 6: Later (asynchronously)

🔥 Important Order (Critical for Interviews)

5️⃣ Where Do They Reside?

Why redo log is sequential?

6️⃣ What Happens During Crash?

7️⃣ Why We Need All Three

🧠 First-Principles Summary

First: Redo Logs Are NOT Deleted Immediately

The Real Answer: LSN (Log Sequence Number)

🔑 Critical Concept: Each Data Page Stores Its Own LSN

Now Let’s Walk Through Crash Recovery

During Restart

Step 1: Find Last Checkpoint

Step 2: Replay Log Records After Checkpoint

Your Specific Question:

⚡ Very Important

What If There Are Thousands of Records?

Why This Works (Deep Insight)

🔥 Write-Ahead Logging Rule

What About Undo?

🎯 Ultra Simple Mental Model

One More Advanced Detail (You’ll Like This)

Final Summary

🔥 First Important Correction

Step 1: How Data Is Physically Stored

Step 2: What WAL Actually Stores

Step 3: Where Is LSN Stored?

🔥 So To Answer Your Question Directly

Step 4: Let's Apply Your Example Physically

WAL contains:

⚡ Why Page-Level Instead of Row-Level?

🧠 Deep Insight

Step 5: What Happens If Multiple Rows In Same Page?

🔥 Very Important: LSN Is Global

🏗 Mental Architecture

🎯 Ultra Clean Mental Model

Why This Design Is Brilliant

If You Want Even Deeper

Step 1️⃣ What a Page Actually Looks Like (Physically)

Step 2️⃣ How Rows Are Located Inside a Page

Step 3️⃣ What WAL Actually Records

Step 4️⃣ Now Your Example

🔥 Important Realization

Step 5️⃣ What About Delete?

Step 6️⃣ How Does DB Originally Find Offset 104?

🧠 Visual Mental Model

⚡ Why This Is Powerful

Step 7️⃣ What If Row Moved?

🎯 Final Core Understanding

1️⃣ What Does an Index Actually Return?

Case A: Heap Table (like PostgreSQL)

Case B: Clustered Index (like MySQL InnoDB)

2️⃣ What Is Slot Number?

3️⃣ How DB Finds Salary Column Offset

🔥 Now Let’s Walk Full Flow

Step 1: Index lookup

Step 2: Storage engine loads Page 10 into buffer

Step 3: From slot directory

Step 4: From row metadata

Step 5: Modify 4 bytes at offset 204

Step 6: WAL records:

🎯 So What Exactly Does Index Return?

🧠 Important Insight

4️⃣ Why Slot Directory Exists

🏗 Big Picture Architecture

⚡ Why This Is Extremely Powerful