“HDD and SSD Alignment”: What’s that?!
HDD: Hard disk drive, utilising spinning disks to store data that is physically accessed using moving arm mounted disk read/write heads;
SSD: Solid State Drive, utilising solid-state integrated circuits (typically flash devices) where the only movement is with electrons.
In the days of old, when HDDs were the size of washing machines and had a storage capacity just a mere fraction of even the smallest of today’s memory sticks, certain ways of organising the data on those devices were fixed into the binary of computing for many years to come… Until now, in the year 2011…
We now have Advanced Format (4kByte sectors) introduced for all HDDs. There are now also the new fangled no-moving-parts things called SSDs…
The SSD I’m playing with is a SATA2 device which has a claimed:
- 8kB page size;
- 16kB logical to physical mapping chunk size;
- 2MB erase block size;
- 64MB cache.
The sector size reported to Linux 3.0 is the default 512 bytes!
Hexdump for the pristine new device (60GB) shows:
# hexdump -C /dev/sdd
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |…………….|
*
df99e6000
For example at present, a multi-level cell (MLC) SSD typically has a quoted lifespan of 5000 writes per cell, and wear levelling is used to distribute the data writes evenly across all the cells. So, is write chunk size and write alignment really of concern?
To give, say a 20% margin, drop that cell lifespan to 4000 writes to allow for “imperfect” wear-levelling. For an average sized 120GB SSD, that gives 480TB worth of data writing before cell wear-out failure. That’s 263GB/day for 5 years, or 182MB/min of continuous writing. However, that ignores writes amplification due to misalignment or due to using a small write chunk size… That also ignores the write overhead (in effect an ‘indirect’ write amplification effect) of the internal shuffling of static data around for wear levelling.
For x86 architecture (for example, PCs) Linux systems, a 4kByte page size is used for the system memory. That is often used as the write chunk size for filesystems. It is also the size of virtual memory blocks written out to a swap partition. For the device example above, that could give a worse case write amplification of x4 for a lightly utilised device, or x512 for a fully utilised device. That brings the data write numbers down into the range of what can be typically seen even for a home PC system.
The story can be far worse for small randomly spread writes, or approaching optimally good if consecutive writes are aggregated by the filesystem or by the SSD firmware and cache.
So… For light use, or non-24/7/365 operation, then there is no real worry for a reasonable 5 year lifespan. For heavy or continuous use, then perhaps a little care for sympathetic utilisation of an SSD might go a long way to help performance and endurance.
An extreme case is for cheap USB memory sticks that have rudimentary ‘wear levelling’ only for the FAT area expected for a Windows-FAT formatted storage device. They can have 4MByte erase blocks (or larger, worse?) and are intended for storing large files (images, video, or sound) that are written sequentially in large chunks, and rewritten infrequently, so as not to suffer write amplification problems.
Measured examples for two SSDs are given on MLC NAND Flash Hits The Enterprise with usage cases that suggest a lifespan of from over 99 years down to less than a year depending on the test case. Another important aspect is that wear-levelling garbage collection can reduce performance from the advertised 100’s MB/s down to paltry 10’s of MB/s. See:
The rapidly developing Btrfs filesystem looks highly favourable for use on SSDs taking advantage of:
- Subvolumes to avoid the need of a partition table. Just format using the entire SSD with one Btrfs root, and then use subvolumes to implement whatever partitioning scheme you might want;
- Extents based storage to efficiently store large files in sequential blocks;
- Compression to transparently minimise the amount of data written in the first place. The data compression can also in effect speed up the IO bandwidth seen;
- Space-efficient packing of small files;
- Support for bulk TRIM on SSDs;
- And the various tuning intended to optimise Btrfs for use on SSDs.
There are other highly favourable features to make that a ‘must use’ filesystem. Also see Btrfs mount options and btrfs Sysadmin Guide.
However… A big “Gotcha” for Btrfs for my SSD for avoiding write amplification is:
Using mkfs.btrfs -l and -n for sizes other than 4096 is not supported
(Still the case as of November 2011. See Btrfs changelog. Since fixed, see below.)
As detailed in this series of maillist postings, I’ve also found similar problems for kernel 3.1.0 in that mkfs.btrfs accepts formatting with for example 16kByte leaves/nodes/sectors, but you soon get errors when trying to use that format.
Unfortunately, there is no clear information for Btrfs as to how far the developers have got for SSD sympathetic features and block alignment (note Btrfs design, Data Structures, User:Wtachi/On-disk Format… All a work in progress…). Also, there is no fsck (file system check and repair).
[Edit]
Since fixed. See: mkfs.btrfs(8) Manual Page
… Specify the nodesize, the tree block size in which btrfs stores data. The default value is 16KB (16384) or the page size, whichever is bigger…
[/Edit]
In contrast, there are at least a number of tweaks available to make ext4 sympathetic for SSDs… Hence, instead partition with due care for 16kByte (mapping chunk size) and 2MByte (erase block) alignment and tweak the ext4 filesystem to be more SSD sympathetic…
Using the old fdisk and cfdisk utilities gave rather confused numbers for the disk sectors. They both align to 2048 sectors (1MiB boundaries), and seem to do so regardless of what CHS values you try to fool them with… They are rather old and out of date now so I’m not sure there’s any value in chasing features or bugs for those.
One big plus for Btrfs is the facility of mounting “subvolumes” instead of requiring to use separate physical partitions on a disk. That nicely avoids the old MBR offset problem. Unfortunately, ext4 has no similar feature.
However, the limitations and quirkiness of the old MBR and associated utilities lead me to use the new GPT (GUID Partition Table). That allows at least 128 partitions!
Hence, the partition scheme I’ve used is (using gdisk):
Disk /dev/sdd: 117231408 sectors, 55.9 GiB Logical sector size: 512 bytes Disk identifier (GUID): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Partition table holds up to 128 entries First usable sector is 34, last usable sector is 117231374 Partitions will be aligned on 32-sector boundaries Total free space is 30 sectors (15.0 KiB) Number Start (sector) End (sector) Size Code Name 1 64 16777215 8.0 GiB 8200 SSD_swap 2 16777216 33554431 8.0 GiB 0700 SSD_home 3 33554432 67108863 16.0 GiB 0700 SSD_root 4 67108864 83886079 8.0 GiB 0700 SSD_var 5 83886080 100663295 8.0 GiB 0700 SSD_misc01 6 100663296 117231374 7.9 GiB 0700 SSD_misc02
Those numbers show that for what is presumably physically a 64GiB device, 8GiB+ is ‘hidden’ for data shuffling/wear-levelling and remapping by the SSD firmware.
Note that the 32 sectors used by the GPT are offset by maintaining the old two sectors of MBR. I’m not going to be booting from this device so I have no need of the spare 30 sectors for GRUB2 or for anything else. Then again, it’s an old historical artefact so for the sake of a meagre 15.0KiB unutilised, it is useful for the sake of historical compatibility…
When using GRUB2, it is recommended to use a BIOS boot partition on GPT disks to store the GRUB2 core image that is put into the commonly unpartitioned area (sectors 1 to 62) on the old MBR partitioned disks. To create one of those, you need to use such as parted >=1.9.
When written, that shows on the SSD as:
# hexdump -C /dev/sdd 00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| * 000001b0 ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 |................| 000001c0 02 00 ee ff ff ff 01 00 00 00 2f cf fc 06 00 00 |........../.....| 000001d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000001f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa |..............U.| 00000200 45 46 49 20 50 41 52 54 00 00 01 00 5c 00 00 00 |EFI PART....\...| 00000210 70 e5 6e f4 00 00 00 00 01 00 00 00 00 00 00 00 |p.n.............| 00000220 2f cf fc 06 00 00 00 00 22 00 00 00 00 00 00 00 |/.......".......| 00000230 0e cf fc 06 00 00 00 00 ae 93 8a 3a 2f 63 92 4f |...........:/c.O| 00000240 a7 a9 f5 d8 15 c9 f9 b8 02 00 00 00 00 00 00 00 |................| 00000250 80 00 00 00 80 00 00 00 37 b7 57 ec 00 00 00 00 |........7.W.....| 00000260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000400 6d fd 57 06 ab a4 c4 43 84 e5 09 33 c8 4b 4f 4f |m.W....C...3.KOO| 00000410 dc 47 1b 18 bc 4a 21 4d b1 18 91 7e fd e4 a7 2e |.G...J!M...~....| 00000420 40 00 00 00 00 00 00 00 ff ff ff 00 00 00 00 00 |@...............| 00000430 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| 00000440 73 00 77 00 61 00 70 00 00 00 00 00 00 00 00 00 |s.w.a.p.........| 00000450 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000480 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| 00000490 a6 6a 73 85 6c 5c f6 49 8e 71 c4 ee b5 ce 8f 60 |.js.l\.I.q.....`| 000004a0 00 00 00 01 00 00 00 00 ff ff ff 01 00 00 00 00 |................| 000004b0 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| 000004c0 68 00 6f 00 6d 00 65 00 00 00 00 00 00 00 00 00 |h.o.m.e.........| 000004d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000500 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| 00000510 a6 29 3f fd 5c 6b 45 4c ac 1d 4c 2b 02 8c a5 8a |.)?.\kEL..L+....| 00000520 00 00 00 02 00 00 00 00 ff ff ff 03 00 00 00 00 |................| 00000530 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| 00000540 72 00 6f 00 6f 00 74 00 00 00 00 00 00 00 00 00 |r.o.o.t.........| 00000550 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000580 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| 00000590 96 ed 2e 63 94 5a d5 42 a5 9d 02 13 5f d3 b8 83 |...c.Z.B...._...| 000005a0 00 00 00 04 00 00 00 00 ff ff ff 04 00 00 00 00 |................| 000005b0 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| 000005c0 76 00 61 00 72 00 00 00 00 00 00 00 00 00 00 00 |v.a.r...........| 000005d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000600 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| 00000610 6e d9 32 69 18 1c a8 4e 9f 7b 9e de 6c 42 43 da |n.2i...N.{..lBC.| 00000620 00 00 00 05 00 00 00 00 ff ff ff 05 00 00 00 00 |................| 00000630 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| 00000640 6d 00 69 00 73 00 63 00 30 00 31 00 00 00 00 00 |m.i.s.c.0.1.....| 00000650 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000680 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| 00000690 5a e1 26 1c de 71 29 4f a0 24 22 ac 26 65 4c 98 |Z.&..q)O.$".&eL.| 000006a0 00 00 00 06 00 00 00 00 0e cf fc 06 00 00 00 00 |................| 000006b0 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| 000006c0 6d 00 69 00 73 00 63 00 30 00 32 00 00 00 00 00 |m.i.s.c.0.2.....| 000006d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00004400 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| * df99e1e00 6d fd 57 06 ab a4 c4 43 84 e5 09 33 c8 4b 4f 4f |m.W....C...3.KOO| df99e1e10 dc 47 1b 18 bc 4a 21 4d b1 18 91 7e fd e4 a7 2e |.G...J!M...~....| df99e1e20 40 00 00 00 00 00 00 00 ff ff ff 00 00 00 00 00 |@...............| df99e1e30 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| df99e1e40 73 00 77 00 61 00 70 00 00 00 00 00 00 00 00 00 |s.w.a.p.........| df99e1e50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * df99e1e80 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| df99e1e90 a6 6a 73 85 6c 5c f6 49 8e 71 c4 ee b5 ce 8f 60 |.js.l\.I.q.....`| df99e1ea0 00 00 00 01 00 00 00 00 ff ff ff 01 00 00 00 00 |................| df99e1eb0 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| df99e1ec0 68 00 6f 00 6d 00 65 00 00 00 00 00 00 00 00 00 |h.o.m.e.........| df99e1ed0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * df99e1f00 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| df99e1f10 a6 29 3f fd 5c 6b 45 4c ac 1d 4c 2b 02 8c a5 8a |.)?.\kEL..L+....| df99e1f20 00 00 00 02 00 00 00 00 ff ff ff 03 00 00 00 00 |................| df99e1f30 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| df99e1f40 72 00 6f 00 6f 00 74 00 00 00 00 00 00 00 00 00 |r.o.o.t.........| df99e1f50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * df99e1f80 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| df99e1f90 96 ed 2e 63 94 5a d5 42 a5 9d 02 13 5f d3 b8 83 |...c.Z.B...._...| df99e1fa0 00 00 00 04 00 00 00 00 ff ff ff 04 00 00 00 00 |................| df99e1fb0 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| df99e1fc0 76 00 61 00 72 00 00 00 00 00 00 00 00 00 00 00 |v.a.r...........| df99e1fd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * df99e2000 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| df99e2010 6e d9 32 69 18 1c a8 4e 9f 7b 9e de 6c 42 43 da |n.2i...N.{..lBC.| df99e2020 00 00 00 05 00 00 00 00 ff ff ff 05 00 00 00 00 |................| df99e2030 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| df99e2040 6d 00 69 00 73 00 63 00 30 00 31 00 00 00 00 00 |m.i.s.c.0.1.....| df99e2050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * df99e2080 a2 a0 d0 eb e5 b9 33 44 87 c0 68 b6 b7 26 99 c7 |......3D..h..&..| df99e2090 5a e1 26 1c de 71 29 4f a0 24 22 ac 26 65 4c 98 |Z.&..q)O.$".&eL.| df99e20a0 00 00 00 06 00 00 00 00 0e cf fc 06 00 00 00 00 |................| df99e20b0 00 00 00 00 00 00 00 00 53 00 53 00 44 00 5f 00 |........S.S.D._.| df99e20c0 6d 00 69 00 73 00 63 00 30 00 32 00 00 00 00 00 |m.i.s.c.0.2.....| df99e20d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * df99e5e00 45 46 49 20 50 41 52 54 00 00 01 00 5c 00 00 00 |EFI PART....\...| df99e5e10 74 f4 64 bb 00 00 00 00 2f cf fc 06 00 00 00 00 |t.d...../.......| df99e5e20 01 00 00 00 00 00 00 00 22 00 00 00 00 00 00 00 |........".......| df99e5e30 0e cf fc 06 00 00 00 00 ae 93 8a 3a 2f 63 92 4f |...........:/c.O| df99e5e40 a7 a9 f5 d8 15 c9 f9 b8 0f cf fc 06 00 00 00 00 |................| df99e5e50 80 00 00 00 80 00 00 00 37 b7 57 ec 00 00 00 00 |........7.W.....| df99e5e60 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * df99e6000
From Wikipedia: MBR:
By convention, there are exactly four primary partition table entries in the MBR partition table scheme…
An artefact of hard disk technology from the era of the IBM PC, the partition table subdivides a storage medium using units of cylinders, heads, and sectors (CHS addressing). These values no longer correspond to their namesakes in modern disk drives, and other devices such as solid-state drives do not physically have cylinders or heads.
Sector indices have always begun with a 1, not a zero, and due to an early error in MS-DOS, the heads are generally limited to 255 instead of 256. Both the partition length and partition start address are sector values stored as 32-bit quantities. The sector size is fixed at 512 (29) bytes, which implies that either the maximum size of a partition or the maximum start address (both in bytes) cannot exceed 2 TB−512 bytes…
Details for the ext4 layout and tweaks:
- kernelnewbies.org/Ext4
- [linux/kernel/git/torvalds/linux-2.6.git] / Documentation / filesystems / ext4.txt
- Ext4 Disk Layout
- mkfs.ext4(8) – Linux man page
- mount(8) – Linux man page
- tune2fs(8) – Linux man page
- e2fsck(8) – Linux man page
And the (experimental) magic incantations are:
# Format with:
# Add -K to avoid blanking a new pristine SSD! NB: "-f fragment-size" is obsolete.
# mke2fs -v -T ext4 -L fs_label_name -b 4096 -E stride=4,stripe-width=4,lazy_itable_init=0 -O none,dir_index,extent,filetype,flex_bg,has_journal,sparse_super,uninit_bg /dev/sdX
#
# (For the sake of my paranoia!)
# To force a check on every mount:
# tune2fs -c 1 /dev/sdX
#
# And mount with (in /etc/fstab):
#/dev/sdX /mnt/ssd_mount_point ext4 journal_checksum,barrier,stripe=4,delalloc,commit=300,max_batch_time=15000,min_batch_time=200,discard,noatime,nouser_xattr,noacl,errors=remount-ro,noexec,nosuid,nodev,noauto 1 2
Nobarrier appears to be needed for reliable operation with USB memory sticks.
noatime implies nodiratime: Does noatime imply nodiratime?
And this is a work in progress! A few links to be assembled…
- I/O Limits: block sizes, alignment and I/O hints Comprehensive comment from msnitzer @ redhat 28-May-2010
- Optimal Usage of SSDs Under Linux: Optimize Your I/O Subsystem (PDF Slides) Werner Fischer @ Thomas-Krenn.AG, LinuxCon Europe 2011
- TLDP: Software-RAID-0.4x-HOWTO-8 Performance, Tools & General Bone-headed Questions (See question 8, How does the chunk size (stripe size) influence the speed of my RAID…?)
- The SSD Relapse: Understanding and Choosing the Best SSD by Anand Lal Shimpi on 30/8/2009, The third major SSD article…
- The SSD Anthology: Understanding SSDs and New Drives from OCZ by Anand Lal Shimpi on 18/3/2009, The SSD market appears to have changed a lot…
- SSD followup Linus’ blog, Wednesday, 18 March, 2009, …all the cheaper ones were unusable due to having horrible random write performance, which is something you notice really quickly in real life as nasty pauses. …
- OCZ’s Vertex 2 Pro Preview: The Fastest MLC SSD We’ve Ever Tested by Anand Lal Shimpi on 31/12/2009 – “Enter the SandForce” [controller]. Chart included showing write amplification factors
- OCZ’s Agility 2 Reviewed: The First SF-1200 with MP Firmware by Anand Lal Shimpi on 21/4/2010, … a little graph to illustrate why SSDs are both necessary and incredibly important…
- OCZ Agility 3 (240GB) Review by Anand Lal Shimpi on 24/5/2011 – further details on the Sandforce controllers, compression, and error correction
- The Register: Flashy fists fly as OCZ and DDRdrive row over SSD performance – Shows an example of worse case write speed for a non-cached SSD compared to the marketed specs
- Seems M$ have been reading the OCZ forums 😉 … m$ are admitting you reviewed SSD incorrectly unless you aligned the partitions on the drive..especially in XP.
- Blocksize benchmarks Btrfs now supports a range of blocksizes for btree metadata, including blocks larger than the page size…
- SSDs Shifting to 25nm NAND – What You Need to Know February 14th, 2011, … doesn’t mean faster drives or greater reliability. SSD manufacturers are starting to release drives with 25nm NAND flash …
- OCZ Vertex 2 25nm Review (OCZSSD2-2VTXE60G) February 15th, 2011, … Buyers can’t tell the difference thanks to OCZ’s dubious marketing, but we can…
- …the HyperX 3K is rated for 3,000 MLC NAND P/E cycles where as the regular HyperX SSD is rated for 5,000 P/E cycles. …
- SSD Disk Allocation Size Test 28 January 2011 … I then used the ATTO Disk Benchmark tool to test each configuration…
- How I figured out my OCZ Core 64GB SSD Jul 25 … I soon realized it wasn’t so straightforward…
- Using Solid State Disks on Linux February 21st, 2009, … Tuning Linux for SSDs…
- deadline-iosched.txt Deadline IO scheduler tunables
- SSD FAQ … SSD maintenance features…
- Aligning an SSD on Linux
- Write amplification and Basic SSD operation
- data alignment for SSD: Stripe size or sector size given with -s?
- Optimizing Linux with cheap flash drives February 18, 2011, by Arnd Bergmann, … This article will review the properties of typical flash devices and list some optimizations that should allow Linux to get the most out of low-cost flash drives…
- archlinux: Solid State Drives
- I/O Schedulers Very good readable summary
- OCZ: Linux – Tips, tweaks and alignment
- PC GUide: Hard Disk Size Barriers
- Master boot record
- Extended boot record
- GUID Partition Table
- Disk sector
- Hard disk drive
- Cylinder-head-sector
- Advanced Format
- The Impact of 4 KB HDD Sector Size
- 512 Byte Emulation – Issues
Note that this post is slowly being revamped and split into smaller byte-sized pieces.
Meanwhile, SSD technology advances apace and here’s two examples of what can be done…
Firstly, note the different over-provisioning and program-erase cycles counts for the SSDs compared on:
Write Endurance: Comparing MLC, eMLC, And SLC
To bust the jargon: “MLC” is “Multi-Level Cell” typically storing 2 bits of data (4 electrical charge/voltage levels) per storage cell, “eMLC” is MLC but supposedly characterized for greater endurance for surviving more program-erase cycles, and “SLC” is “Single-Level Cell” where just one bit of data (2 electrical charge/voltage levels) is stored per storage cell.
And secondly there is:
SMART’s new SSD wrings extra juice from MLC flash
This comment from the forum is much more informative than the Marketing-article:
Re: “can do 50 full drive writes a day for five years”
So… Better firmware and aggressive caching gives a x25 to x50 lifespan boost for the existing technology?… Suggesting a little care with filesystem config and system usage can also go a long way…
Some informative but fun The Register irreverence:
Flashboys: HEELLLP, we’re trapped in a process size shrink crunch
The 3D die stack tack: Toshiba builds towering column of flash
TLC NAND could penetrate biz with flash-to-flash backup
A range of real numbers are given here for write endurance of some new flash devices:
… SSD Review: For The Enterprise – Results: Write Endurance
The report there suggests a write endurance for the three SSDs under test of 12.49 TB, 68.39 TB and 72.69 TB for 1% “indicated wear”. So… How long would it take you to write over a Peta-Byte of data?…
Hence, with good wear-leveling for recent devices, this has moved on to being more of a question of how you use your SSD as to whether wear is a concern or not. Also, a least with recent GPT-aware partition tools, partition alignment should no longer be a problem.
Thanks go to Jason for finding this good article and comment on SSD wear-out:
Concerns about SSD reliability debunked (again)
The article gives a bit of background and a few plots for what lifetime can be expected for a recent SSD with good wear-leveling.
My own personal view from following all this over the years of development is that now with recent devices and recent software/distros, this is all simply not a problem and no longer a concern.
Indeed for ‘normal’ desktop use, any recent SSD should last longer than the user!
However… If you are running servers with pathological usage with virtual machine images and databases, then you should already know enough to not be too flash-killing silly!
For my own examples of servers using SSDs, including having swap space on there, the daily GBytes writes suggest that the systems will be obsolete and scrapped long before the SSDs should wear out.
… Which is why I’ve never bothered finishing a follow-on article to this old rambling post of old days SSD confusion. (Then again, hopefully this old article makes for a fun investigative read!)
One technical detail to note for SSDs still: An important performance marker is how fast they operate for 4kiB read/write. However, their underlying structure usually favours 8kiB, 16kiB or 32kiB block sizes for writes. The only headache there is how best then to fit in what (somehow optimum) block size to use for a (block-device-level) raid setup to avoid crass unwanted write amplification… Then again, all that becomes moot if you instead use the integral filesystem-level raid functionality of btrfs or glusterfs.
Very low power, small physical size, low cost, very high speed, high enough capacity, and more than enough longevity, all now pretty much makes SSDs a “no-brainer” to use.
Hope of interest,
Martin
Noting:
Interestingly (and especially sympathetic for SSDs), the defaults for the btrfs format have recently been changed to use 16kiByte chunks:
For a second note:
That needs qualifying to comment that is currently the case for up to about 128GBytes. Larger SSDs quickly get more expensive and more power hungry than their spinning rust counterparts. The distinction is also blurred by the recent trend for the larger spinning rust HDDs to incorporate a small SSD for the drive firmware to transparently cache ‘hot data’. These are the new “hybrid” HDD/SSD drives.
For the sake of completeness: Also note that since the 3.10 linux kernel, there is bcache available to use an SSD as a block device cache for HDDs/SSDs. There is also dm-cache available since the 3.9 kernel to use an SSD as a cache device.
Hope of interest,
Martin
What happens to your data sent to an SSD during an unexpected power off/fail?… Well… Some SSDs are advertised as using “super-capacitors” to keep the SSD powered for long enough to get any data held in the SSD cache written safely to the SSD flash chips. However… Real World tests suggest otherwise:
FAST ’13 Conference: Understanding the Robustness of SSDs under Power Fault
lkcl.net: Analysis of SSD Reliability during power-outages
From that, I read it to suggest that you cannot trust any manufacturer’s claims. If you care about your data under powerfail conditions, then you need to test operation directly for yourself. Particularly worrying is for how some of the tested SSDs corrupted their stored data rather than just benignly failing to save cached data.
Hence also, for any power-down sequence, you must also ensure that any SSDs or SSD-HDD hybrid drives remain powered long enough to fully make safe any cached write data that they may hold immediately before power-off… Do we need a few seconds grace between the final sync of the filesystems and the terminal ACPI power-off that kills the system?…
A question from the Idiot Gallery. *IF* you are in a data centre and *IF* you are using SSDs for mission-critical data; wouldn’t you already haven redundant drives, UPSs, back-up generators and all the other good stuff a data centre should have?
I can see the super-capacitor thing being vital for home/office use where such kit is rare; but in the date centre?
Such internal power outages happen far too frequently… Even if only from ‘maintenance’ work being done and the wrong plug gets pulled or a circuit breaker gets inadvertently tripped, or even the occasional cable fire…
For just three examples on The Register:
Fasthosts goes titsup after storm-induced power outage
Level 3’s UPS burnout sends websites down in flames
Bad generator and bugs take out Amazon cloud
Here’s a recent article from Phoronix comparing which of the Linux kernel IO schedulers is best suited for SSDs:
Linux 3.16: Deadline I/O Scheduler Generally Leads With A SSD
That matches what I found some long time ago when testing for myself. An alternate recommendation is to try the “noop” IO scheduler to avoid all the futile processing for the HDD mechanical latency aspects that SSD devices simply do not suffer.
Note also the comment and example for using udev rules to detect SSDs and to also set “queue/iosched/fifo_batch=1”.
And the default CFQ IO scheduler has been ‘tweaked’ for when SSDs are detected to reduce the IO latency that is assumed. However, “deadline” still looks to win out.
Another tip from elsewhere for increased performance is to NOT use the SSD TRIM function via the filesystem, such as for example via the “discard” mount option for ext4 or btrfs. Better/faster is to run a FSTRIM periodically to clean up unused space in larger gulps when your system is expected to be quiet, rather than delay the normal filesystem operation when busy with the many smaller TRIMs.
Meanwhile, this post is now rather old and out of date. Time for an update sometime!
There is still fun reading there for the sleuthing around the proprietary silly obfuscation! 😐
The usual concern for FLASH storage (SSDs and ‘memory sticks’) is that of “write endurance”. That is, how many times can storage locations be written to before those cells “wear out” to become inoperative. For good recent SSDs, that should no longer be of any worry due to the use of “wear leveling” in the SSD firmware spreading all the data writes across the entire device. However, localized cell wear is still a problem for memory sticks where the wear leveling features can be expected still to be crude or nonexistent.
With the SSD write wear concerns now largely no longer a concern (except for extreme or pathological cases), then the next concern is that of for how long are your bits reliably stored before suffering bit rot?
In typical The Register reporting:
FLASH better than DISK for archiving, say academics. Are they stark, raving mad?
All a question of design emphasis, Marketing, and capacity?… All interesting to watch.
ps: One of the comments to that article echos my experience that memory sticks typically can hold data for about 2 years at UK room temperature before suffering bit loss.
And note that SSDs should be powered up every few months so that their firmware can run a scrub through their data to error-correct any degrading bits so as to keep all the bits ‘fresh’. (Does such a bit error check/correct scrub need to be initiated/prompted? Should the user keep a device check, and effectively force a full scrub, by running a SHA checksum read check for the full device?…) I believe memory sticks typically do not implement such a data scrub feature.
There’s good comment on SSD performance for btrfs and the IO scheduler used. From the thread btrfs performance – ssd array:
For myself, I’ve found the “deadline” IO scheduler to give the best performance compromise over the CFQ and NoOp IO schedulers. Using CFQ with the above tweaks could give an improvement at the expense of (negligibly) more CPU effort.
All by the power of FLOSS!
Martin