I'm running Linux (Fedora 34, 64 bit, two cores, four CPUs, 32 GB RAM - PostgreSQL 13.3).
If I run stat -f some_random_file as follows:
[pol@fedora inst]$ stat -f blah.txt
ID: f1b798b1610e7067 Namelen: 255 Type: ext2/ext3
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 322411548 Free: 316122834 Available: 299727775
Inodes: Total: 81960960 Free: 81739842
Note: Block size: 4096 = 4096 bytes = 32768 bits.
Now, even if you have a file that is two bytes long ("Hi") - it will still occupy 4096 bytes on disk - it's basically the minimum I/O that can be performed by the OS. The OS takes stuff off disk as 4K "chunks" and spits them back on in 4K chunks - see here for a quick overview. You might like to test on your own system.
The disk itself has its own "atomic" unit. With HDDs, this was normally 512 bytes, but see the link above - "and at the hardware level the old drives used 512B sectors while new devices often write data in larger chunks (often 4kB or even 8kB)". See here for HDDs and here for SSDs. (Thanks to @RonJohn for his comment).
Likewise, the database will read data in and out in blocks (also called pages - the terminology can be confusing) - if you change one solitary bit in a record, the database will still have to read the entire page the record is on and write the entire page back onto disk once the modification is completed.
On PostgreSQL, you have the default block size of 8K.
test_1=# SELECT name, setting, short_desc, extra_desc FROM pg_settings WHERE name like '%block%' or short_desc LIKE '%block%';
name | setting | short_desc | extra_desc
block_size | 8192 | Shows the size of a disk block. |
wal_block_size | 8192 | Shows the block size in the write ahead log. |
It's important that there not be too big a gap between the HDD, the OS and the RDBMS "atomic unit" sizes - otherwise, you run the risk of torn pages - from the link:
Avoiding Torn Pages
For the Postgres file layout, Postgres reads and writes data to disk
8kb at a time. Most operating systems make use of a smaller page size,
such as 4kb. If Postgres is running on one of these operating systems,
an interesting edge case can occur. Since Postgres writes to disk in
units of 8kb and the OS writes to disk in units of 4kb, if the power
went out at just the right time, it is possible that only 4kb of an
8kb write Postgres was performing were written to disk. This edge case
is sometimes referred to as “torn pages”. Postgres does have a way of
working around torn pages, but it does increase the amount of I/O
Postgres needs to perform.
Also, see here:
Partial Writes / Torn Pages
So what are full page writes about? As the comment in postgresql.conf
says it’s a way to recover from partial page writes – PostgreSQL uses
8kB pages (by default), but other parts of the stack use different
chunk sizes. Linux filesystems typically use 4kB pages (it’s possible
to use smaller pages, but 4kB is the max on x86), and at the hardware
level the old drives used 512B sectors while new devices often write
data in larger chunks (often 4kB or even 8kB).
So when PostgreSQL writes the 8kB page, the other layers of the
storage stack may break this into smaller chunks, managed separately.
This presents a problem regarding write atomicity. The 8kB PostgreSQL
page may be split into two 4kB filesystem pages, and then into 512B
sectors. Now, what if the server crashes (power failure, kernel bug,
As with so much related to computer science, it's a matter of trade-offs and compromises - here's a benchmark of PostgreSQL run against the same system just changing the block size - from the post:
Samsung SSD 840, 500 GB TPS (txns/second)
So, you can see that a naïve, "make the db block size as big as possible" approach doesn't work very well. All I will say about this is that database benchmarks are a total quagmire... for some applications 1 MB may well be suitable - although straying beyond 16 KB would require considerable justification. Systems' default parameters are just that - defaults - chosen to be reasonably good under the widest range of circumstances...
Re. the historical part of the question - yes, a lot of it relates to history when disks came in 512 byte sectors... HDDs, despite advances in speed and incremental improvements in the composition of the rust, are essentially unchanged since the first HDDs - HDD performance has fallen way behind that of CPUs and RAM... capacity has increased, speed not so much (see here) - hence the birth of the "NoSQL" bandwagon movement (but that's another day's work )!
There's a lot happening in the area these days...
If you are interested - and have the time - I've perused it a couple of times, but it's a bit above my pay grade... there's an article here on Linux I/O and how it's being revolutionised by io_uring (see wiki - and links therein).
Intel are also providing an open source toolkit, the SPDK (the Storage Performance Development Kit) which appears (at least to my untrained eye) to be some sort of way of allowing user space processes to directly access hardware without going through the kernel... here's an interesting take on how this could be applied to databases.
And, also arriving on the scene, is (8) byte-addressable storage... for reasons best known to hardware designers, SSD's (at least some of them) also have blocks and pages... They are not a panacea (check out TLC SSD and normal HDD write speed - only a 30% gain).
However, on the (far?) horizon, there's Persistent Memory - from the post:
Persistent Memory (such as Intel Optane DC Persistent Memory) natively
operates byte-by-byte rather than in blocks of data like conventional
storage. Data is persisted in chunks of 8-bytes at most in Persistent
Memory (again, using the default behavior). For applications (like
databases) that are based on a BLOCK construct, the 8-byte atomicity
of Persistent Memory can be an issue. Writing 8,192 bytes of data (an
8K block) will get persisted in 1,024 chunks of 8-bytes each. Power
failure or other abnormal condition can leave blocks of data
"fractured" or "torn" into pieces, with portions of the block
containing old data and other portions with new data. Applications
(such as databases) need to be changed to tolerate this sort of
fracturing or tearing of blocks. Otherwise, these are effectively
corrupted blocks of data. Continue reading for more information on the
topic of corruption and especially how this relates to Oracle
So, we can see how problems such as torn pages can still occur with these systems - but they do offer the possibility - when database programmers catch up - of having the block size = 8 bytes (not 8 KB) - you want to change a BIGINT's value, all you have to do is read 8 bytes and write 8 bytes...
Maybe if we get down to this level, or even to specificity by a single byte, the whole notion of pages will go out the window for the disk, the OS and the RDBMS? I'm sure that there'll still be file systems - just not sure how they'll work.
This is a fascinating area (+1 for the question!), especially for database geeks.