Howto HDD and SSD Alignment

“HDD and SSD Alignment”: What’s that?!

HDD: Hard disk drive, utilising spinning disks to store data that is physically accessed using moving arm mounted disk read/write heads;
SSD: Solid State Drive, utilising solid-state integrated circuits (typically flash devices) where the only movement is with electrons.

In the days of old, when HDDs were the size of washing machines and had a storage capacity just a mere fraction of even the smallest of today’s memory sticks, certain ways of organising the data on those devices were fixed into the binary of computing for many years to come… Until now, in the year 2011…

We now have Advanced Format (4kByte sectors) introduced for all HDDs. There are now also the new fangled no-moving-parts things called SSDs…

The SSD I’m playing with is a SATA2 device which has a claimed:

  • 8kB page size;
  • 16kB logical to physical mapping chunk size;
  • 2MB erase block size;
  • 64MB cache.

The sector size reported to Linux 3.0 is the default 512 bytes!

Hexdump for the pristine new device (60GB) shows:

# hexdump -C /dev/sdd
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |…………….|
*
df99e6000

For example at present, a multi-level cell (MLC) SSD typically has a quoted lifespan of 5000 writes per cell, and wear levelling is used to distribute the data writes evenly across all the cells. So, is write chunk size and write alignment really of concern?

To give, say a 20% margin, drop that cell lifespan to 4000 writes to allow for “imperfect” wear-levelling. For an average sized 120GB SSD, that gives 480TB worth of data writing before cell wear-out failure. That’s 263GB/day for 5 years, or 182MB/min of continuous writing. However, that ignores writes amplification due to misalignment or due to using a small write chunk size… That also ignores the write overhead (in effect an ‘indirect’ write amplification effect) of the internal shuffling of static data around for wear levelling.

For x86 architecture (for example, PCs) Linux systems, a 4kByte page size is used for the system memory. That is often used as the write chunk size for filesystems. It is also the size of virtual memory blocks written out to a swap partition. For the device example above, that could give a worse case write amplification of x4 for a lightly utilised device, or x512 for a fully utilised device. That brings the data write numbers down into the range of what can be typically seen even for a home PC system.

The story can be far worse for small randomly spread writes, or approaching optimally good if consecutive writes are aggregated by the filesystem or by the SSD firmware and cache.

So… For light use, or non-24/7/365 operation, then there is no real worry for a reasonable 5 year lifespan. For heavy or continuous use, then perhaps a little care for sympathetic utilisation of an SSD might go a long way to help performance and endurance.

An extreme case is for cheap USB memory sticks that have rudimentary ‘wear levelling’ only for the FAT area expected for a Windows-FAT formatted storage device. They can have 4MByte erase blocks (or larger, worse?) and are intended for storing large files (images, video, or sound) that are written sequentially in large chunks, and rewritten infrequently, so as not to suffer write amplification problems.

Measured examples for two SSDs are given on MLC NAND Flash Hits The Enterprise with usage cases that suggest a lifespan of from over 99 years down to less than a year depending on the test case. Another important aspect is that wear-levelling garbage collection can reduce performance from the advertised 100’s MB/s down to paltry 10’s of MB/s. See:

The rapidly developing Btrfs filesystem looks highly favourable for use on SSDs taking advantage of:

  • Subvolumes to avoid the need of a partition table. Just format using the entire SSD with one Btrfs root, and then use subvolumes to implement whatever partitioning scheme you might want;
  • Extents based storage to efficiently store large files in sequential blocks;
  • Compression to transparently minimise the amount of data written in the first place. The data compression can also in effect speed up the IO bandwidth seen;
  • Space-efficient packing of small files;
  • Support for bulk TRIM on SSDs;
  • And the various tuning intended to optimise Btrfs for use on SSDs.

There are other highly favourable features to make that a ‘must use’ filesystem. Also see Btrfs mount options and btrfs Sysadmin Guide.

However… A big “Gotcha” for Btrfs for my SSD for avoiding write amplification is:

Using mkfs.btrfs -l and -n for sizes other than 4096 is not supported

(Still the case as of November 2011. See Btrfs changelog. Since fixed, see below.)

As detailed in this series of maillist postings, I’ve also found similar problems for kernel 3.1.0 in that mkfs.btrfs accepts formatting with for example 16kByte leaves/nodes/sectors, but you soon get errors when trying to use that format.

Unfortunately, there is no clear information for Btrfs as to how far the developers have got for SSD sympathetic features and block alignment (note Btrfs design, Data Structures, User:Wtachi/On-disk Format… All a work in progress…). Also, there is no fsck (file system check and repair).

[Edit]
Since fixed. See: mkfs.btrfs(8) Manual Page

… Specify the nodesize, the tree block size in which btrfs stores data. The default value is 16KB (16384) or the page size, whichever is bigger…

[/Edit]

In contrast, there are at least a number of tweaks available to make ext4 sympathetic for SSDs… Hence, instead partition with due care for 16kByte (mapping chunk size) and 2MByte (erase block) alignment and tweak the ext4 filesystem to be more SSD sympathetic…

 

Using the old fdisk and cfdisk utilities gave rather confused numbers for the disk sectors. They both align to 2048 sectors (1MiB boundaries), and seem to do so regardless of what CHS values you try to fool them with… They are rather old and out of date now so I’m not sure there’s any value in chasing features or bugs for those.

One big plus for Btrfs is the facility of mounting “subvolumes” instead of requiring to use separate physical partitions on a disk. That nicely avoids the old MBR offset problem. Unfortunately, ext4 has no similar feature.

However, the limitations and quirkiness of the old MBR and associated utilities lead me to use the new GPT (GUID Partition Table). That allows at least 128 partitions!

Hence, the partition scheme I’ve used is (using gdisk):

Disk /dev/sdd: 117231408 sectors, 55.9 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 117231374
Partitions will be aligned on 32-sector boundaries
Total free space is 30 sectors (15.0 KiB)

Number Start (sector) End (sector) Size     Code Name
 1               64     16777215    8.0 GiB 8200 SSD_swap
 2         16777216     33554431    8.0 GiB 0700 SSD_home
 3         33554432     67108863   16.0 GiB 0700 SSD_root
 4         67108864     83886079    8.0 GiB 0700 SSD_var
 5         83886080    100663295    8.0 GiB 0700 SSD_misc01
 6        100663296    117231374    7.9 GiB 0700 SSD_misc02

Those numbers show that for what is presumably physically a 64GiB device, 8GiB+ is ‘hidden’ for data shuffling/wear-levelling and remapping by the SSD firmware.
Note that the 32 sectors used by the GPT are offset by maintaining the old two sectors of MBR. I’m not going to be booting from this device so I have no need of the spare 30 sectors for GRUB2 or for anything else. Then again, it’s an old historical artefact so for the sake of a meagre 15.0KiB unutilised, it is useful for the sake of historical compatibility…

When using GRUB2, it is recommended to use a BIOS boot partition on GPT disks to store the GRUB2 core image that is put into the commonly unpartitioned area (sectors 1 to 62) on the old MBR partitioned disks. To create one of those, you need to use such as parted >=1.9.

When written, that shows on the SSD as:

# hexdump -C /dev/sdd    
00000000  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
000001b0  ff ff ff ff ff ff ff ff  00 00 00 00 00 00 00 00  |................|
000001c0  02 00 ee ff ff ff 01 00  00 00 2f cf fc 06 00 00  |........../.....|
000001d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 55 aa  |..............U.|
00000200  45 46 49 20 50 41 52 54  00 00 01 00 5c 00 00 00  |EFI PART....\...|
00000210  70 e5 6e f4 00 00 00 00  01 00 00 00 00 00 00 00  |p.n.............|
00000220  2f cf fc 06 00 00 00 00  22 00 00 00 00 00 00 00  |/.......".......|
00000230  0e cf fc 06 00 00 00 00  ae 93 8a 3a 2f 63 92 4f  |...........:/c.O|
00000240  a7 a9 f5 d8 15 c9 f9 b8  02 00 00 00 00 00 00 00  |................|
00000250  80 00 00 00 80 00 00 00  37 b7 57 ec 00 00 00 00  |........7.W.....|
00000260  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400  6d fd 57 06 ab a4 c4 43  84 e5 09 33 c8 4b 4f 4f  |m.W....C...3.KOO|
00000410  dc 47 1b 18 bc 4a 21 4d  b1 18 91 7e fd e4 a7 2e  |.G...J!M...~....|
00000420  40 00 00 00 00 00 00 00  ff ff ff 00 00 00 00 00  |@...............|
00000430  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
00000440  73 00 77 00 61 00 70 00  00 00 00 00 00 00 00 00  |s.w.a.p.........|
00000450  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000480  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
00000490  a6 6a 73 85 6c 5c f6 49  8e 71 c4 ee b5 ce 8f 60  |.js.l\.I.q.....`|
000004a0  00 00 00 01 00 00 00 00  ff ff ff 01 00 00 00 00  |................|
000004b0  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
000004c0  68 00 6f 00 6d 00 65 00  00 00 00 00 00 00 00 00  |h.o.m.e.........|
000004d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000500  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
00000510  a6 29 3f fd 5c 6b 45 4c  ac 1d 4c 2b 02 8c a5 8a  |.)?.\kEL..L+....|
00000520  00 00 00 02 00 00 00 00  ff ff ff 03 00 00 00 00  |................|
00000530  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
00000540  72 00 6f 00 6f 00 74 00  00 00 00 00 00 00 00 00  |r.o.o.t.........|
00000550  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000580  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
00000590  96 ed 2e 63 94 5a d5 42  a5 9d 02 13 5f d3 b8 83  |...c.Z.B...._...|
000005a0  00 00 00 04 00 00 00 00  ff ff ff 04 00 00 00 00  |................|
000005b0  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
000005c0  76 00 61 00 72 00 00 00  00 00 00 00 00 00 00 00  |v.a.r...........|
000005d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000600  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
00000610  6e d9 32 69 18 1c a8 4e  9f 7b 9e de 6c 42 43 da  |n.2i...N.{..lBC.|
00000620  00 00 00 05 00 00 00 00  ff ff ff 05 00 00 00 00  |................|
00000630  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
00000640  6d 00 69 00 73 00 63 00  30 00 31 00 00 00 00 00  |m.i.s.c.0.1.....|
00000650  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000680  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
00000690  5a e1 26 1c de 71 29 4f  a0 24 22 ac 26 65 4c 98  |Z.&..q)O.$".&eL.|
000006a0  00 00 00 06 00 00 00 00  0e cf fc 06 00 00 00 00  |................|
000006b0  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
000006c0  6d 00 69 00 73 00 63 00  30 00 32 00 00 00 00 00  |m.i.s.c.0.2.....|
000006d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004400  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
*
df99e1e00  6d fd 57 06 ab a4 c4 43  84 e5 09 33 c8 4b 4f 4f  |m.W....C...3.KOO|
df99e1e10  dc 47 1b 18 bc 4a 21 4d  b1 18 91 7e fd e4 a7 2e  |.G...J!M...~....|
df99e1e20  40 00 00 00 00 00 00 00  ff ff ff 00 00 00 00 00  |@...............|
df99e1e30  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
df99e1e40  73 00 77 00 61 00 70 00  00 00 00 00 00 00 00 00  |s.w.a.p.........|
df99e1e50  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
df99e1e80  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
df99e1e90  a6 6a 73 85 6c 5c f6 49  8e 71 c4 ee b5 ce 8f 60  |.js.l\.I.q.....`|
df99e1ea0  00 00 00 01 00 00 00 00  ff ff ff 01 00 00 00 00  |................|
df99e1eb0  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
df99e1ec0  68 00 6f 00 6d 00 65 00  00 00 00 00 00 00 00 00  |h.o.m.e.........|
df99e1ed0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
df99e1f00  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
df99e1f10  a6 29 3f fd 5c 6b 45 4c  ac 1d 4c 2b 02 8c a5 8a  |.)?.\kEL..L+....|
df99e1f20  00 00 00 02 00 00 00 00  ff ff ff 03 00 00 00 00  |................|
df99e1f30  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
df99e1f40  72 00 6f 00 6f 00 74 00  00 00 00 00 00 00 00 00  |r.o.o.t.........|
df99e1f50  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
df99e1f80  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
df99e1f90  96 ed 2e 63 94 5a d5 42  a5 9d 02 13 5f d3 b8 83  |...c.Z.B...._...|
df99e1fa0  00 00 00 04 00 00 00 00  ff ff ff 04 00 00 00 00  |................|
df99e1fb0  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
df99e1fc0  76 00 61 00 72 00 00 00  00 00 00 00 00 00 00 00  |v.a.r...........|
df99e1fd0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
df99e2000  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
df99e2010  6e d9 32 69 18 1c a8 4e  9f 7b 9e de 6c 42 43 da  |n.2i...N.{..lBC.|
df99e2020  00 00 00 05 00 00 00 00  ff ff ff 05 00 00 00 00  |................|
df99e2030  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
df99e2040  6d 00 69 00 73 00 63 00  30 00 31 00 00 00 00 00  |m.i.s.c.0.1.....|
df99e2050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
df99e2080  a2 a0 d0 eb e5 b9 33 44  87 c0 68 b6 b7 26 99 c7  |......3D..h..&..|
df99e2090  5a e1 26 1c de 71 29 4f  a0 24 22 ac 26 65 4c 98  |Z.&..q)O.$".&eL.|
df99e20a0  00 00 00 06 00 00 00 00  0e cf fc 06 00 00 00 00  |................|
df99e20b0  00 00 00 00 00 00 00 00  53 00 53 00 44 00 5f 00  |........S.S.D._.|
df99e20c0  6d 00 69 00 73 00 63 00  30 00 32 00 00 00 00 00  |m.i.s.c.0.2.....|
df99e20d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
df99e5e00  45 46 49 20 50 41 52 54  00 00 01 00 5c 00 00 00  |EFI PART....\...|
df99e5e10  74 f4 64 bb 00 00 00 00  2f cf fc 06 00 00 00 00  |t.d...../.......|
df99e5e20  01 00 00 00 00 00 00 00  22 00 00 00 00 00 00 00  |........".......|
df99e5e30  0e cf fc 06 00 00 00 00  ae 93 8a 3a 2f 63 92 4f  |...........:/c.O|
df99e5e40  a7 a9 f5 d8 15 c9 f9 b8  0f cf fc 06 00 00 00 00  |................|
df99e5e50  80 00 00 00 80 00 00 00  37 b7 57 ec 00 00 00 00  |........7.W.....|
df99e5e60  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
df99e6000

 

From Wikipedia: MBR:

By convention, there are exactly four primary partition table entries in the MBR partition table scheme…

An artefact of hard disk technology from the era of the IBM PC, the partition table subdivides a storage medium using units of cylinders, heads, and sectors (CHS addressing). These values no longer correspond to their namesakes in modern disk drives, and other devices such as solid-state drives do not physically have cylinders or heads.

Sector indices have always begun with a 1, not a zero, and due to an early error in MS-DOS, the heads are generally limited to 255 instead of 256. Both the partition length and partition start address are sector values stored as 32-bit quantities. The sector size is fixed at 512 (29) bytes, which implies that either the maximum size of a partition or the maximum start address (both in bytes) cannot exceed 2 TB−512 bytes…

Details for the ext4 layout and tweaks:

And the (experimental) magic incantations are:

# Format with:
# Add -K to avoid blanking a new pristine SSD! NB: "-f fragment-size" is obsolete.
# mke2fs -v -T ext4 -L fs_label_name -b 4096 -E stride=4,stripe-width=4,lazy_itable_init=0 -O none,dir_index,extent,filetype,flex_bg,has_journal,sparse_super,uninit_bg /dev/sdX
#
# (For the sake of my paranoia!)
# To force a check on every mount:
# tune2fs -c 1 /dev/sdX
#
# And mount with (in /etc/fstab):
#/dev/sdX /mnt/ssd_mount_point ext4 journal_checksum,barrier,stripe=4,delalloc,commit=300,max_batch_time=15000,min_batch_time=200,discard,noatime,nouser_xattr,noacl,errors=remount-ro,noexec,nosuid,nodev,noauto 1 2

Nobarrier appears to be needed for reliable operation with USB memory sticks.

noatime implies nodiratime: Does noatime imply nodiratime?

And this is a work in progress! A few links to be assembled…

11 comments to Howto HDD and SSD Alignment

  • Martin L

    Note that this post is slowly being revamped and split into smaller byte-sized pieces.

    Meanwhile, SSD technology advances apace and here’s two examples of what can be done…

    Firstly, note the different over-provisioning and program-erase cycles counts for the SSDs compared on:

    Write Endurance: Comparing MLC, eMLC, And SLC

    To bust the jargon: “MLC” is “Multi-Level Cell” typically storing 2 bits of data (4 electrical charge/voltage levels) per storage cell, “eMLC” is MLC but supposedly characterized for greater endurance for surviving more program-erase cycles, and “SLC” is “Single-Level Cell” where just one bit of data (2 electrical charge/voltage levels) is stored per storage cell.

    And secondly there is:

    SMART’s new SSD wrings extra juice from MLC flash

    89,000 write cycles … are you sure that’s not SLC?

    SMART has introduced a solid state drive that can do 50 full drive writes a day for five years using consumer-grade MLC flash; that’s 89,000 P/E cycles and a 50X jump up from the raw NAND rate. … This compares to the original Optimus SSD which can do one full drive write a day. …

    … The Optimus Ultra Plus drive is sampling now.

    This comment from the forum is much more informative than the Marketing-article:

    Re: “can do 50 full drive writes a day for five years”

    Right. I’ve dug around their website a bit. Most of the stuff is very light on details, the best I can get is from a whitepaper (https://www.smartstoragesys.com/pdfs/WP003_Guardian_Technology.pdf), which, when you get past all the snazzy graphs going upwards, has a few important things in:

    – It includes a “Redundant Array of Memory Elements”, so yes, there is a lot of redundancy.

    – They “treat each cell individually thereby maximizing the effects of stronger flash elements (i.e. those that exhibit higher performance capability) while minimizing the effects of weaker elements”. How they know what is ‘strong’ and ‘weak’ though I have no idea.

    – A job lot of statistical error correction on reads.

    – Lots of cache to reduce writes, with some chunky capacitors for when the power fails.

    Most importantly though is that they will (to a certain extent) put their money where their mouth is: they give a 5 year guarantee for up to 25 full drive capacity‐writes/day.

    So interesting, but I would like more technical information on how they go about this.

    So… Better firmware and aggressive caching gives a x25 to x50 lifespan boost for the existing technology?… Suggesting a little care with filesystem config and system usage can also go a long way…

  • Martin L

    Note that this post is slowly being revamped and split into smaller byte-sized pieces.

    Meanwhile, SSD technology advances [continue] apace…

    Some informative but fun The Register irreverence:

    Flashboys: HEELLLP, we’re trapped in a process size shrink crunch

    The NAND flash industry is facing a process size shrink crunch and no replacement technology is ready. Unless 3D die stacking works, we are facing a solid state storage capacity shortage. …

    … One implication of this is the effect on NAND fab owners foundry investment plans. Why spend … when, in five years time, NAND is at the end of its life…

    … Jon Bennett disagrees … “What’s coming after process shrinkage runs out is high-rise (3D). I have see wafers of 3D flash. It’s not slide-ware. Once you go to 3D you have lots of room to work with.”

    … “Then, in five years time we’ll have PCM, IBM’s Racetrack or some other technology. … So the fab run-out story is wrong.”

    The 3D die stack tack: Toshiba builds towering column of flash

    The idea of high-rise or 3D chips is that we can sidestep limitations on increasing the storage density of flash or memory chips by stacking them one on top of the other, increasing the storage density on a Mbits/in2 basis by building upwards, in the same way as high-rise housing increases the number of people living in the ground footprint of a block of flats. …

    TLC NAND could penetrate biz with flash-to-flash backup

    … But TLC NAND has a short working life, with 500 – 750 phase/erase cycles…

    … 5 to 10 per cent of system data [churn] per day. He calculates that this could equate to thirty full device writes a year and, with a 1,000 PE cycle TLC product, that would give you three to four years of working life; good enough. …

  • Martin L

    A range of real numbers are given here for write endurance of some new flash devices:

    … SSD Review: For The Enterprise – Results: Write Endurance

    … As we’ve seen time and time again, you get what you pay for when it comes to write endurance. … it is still a read-focused product. When you subject the 600 Pro to excessive writes, you basically throw away money. In fact, it took less than one day to consume 1% of its rated lifespan. If you used this SSD for 100% sequential writes, you could theoretically kill it in a little more than a month.

    On the flip side, it’s encouraging that we were able to get almost 5,000 P/E cycles out of 19 nm MLC NAND.

    […]

    Even though this is the first SSD we’ve reviewed seemingly built for read-focused use, we expect that the 600 Pro’s performance to mirror similar drives currently on other vendors’ roadmaps. You should start seeing enterprise write endurance fall into three categories: the first involves read-focused MLC with ~5,000 P/E cycles; the second is best for mixed use, featuring eMLC with ~25,000-35,000 P/E cycles; the last is write-oriented SLC capable of 100,000+ P/E cycles.

    The report there suggests a write endurance for the three SSDs under test of 12.49 TB, 68.39 TB and 72.69 TB for 1% “indicated wear”. So… How long would it take you to write over a Peta-Byte of data?…

    Hence, with good wear-leveling for recent devices, this has moved on to being more of a question of how you use your SSD as to whether wear is a concern or not. Also, a least with recent GPT-aware partition tools, partition alignment should no longer be a problem.

  • Martin L

    Thanks go to Jason for finding this good article and comment on SSD wear-out:

    Concerns about SSD reliability debunked (again)

    The myths about how you should use an SSD, and what you should not do with it keep on spinning. Even if there are frequent articles which crunch the actual numbers, the superstition persists. Back in 2008, Robert Penz concluded that your 64 GB SSD could be used for swap, a journalling file system, and consumer level logging, and still last between 20 and 50 years under extreme use.

    Fast forward to 2013, with 120 and 240 GB drives becoming affordable … but people are still worried…

    The article gives a bit of background and a few plots for what lifetime can be expected for a recent SSD with good wear-leveling.

    My own personal view from following all this over the years of development is that now with recent devices and recent software/distros, this is all simply not a problem and no longer a concern.

    Indeed for ‘normal’ desktop use, any recent SSD should last longer than the user!

    However… If you are running servers with pathological usage with virtual machine images and databases, then you should already know enough to not be too flash-killing silly!

    For my own examples of servers using SSDs, including having swap space on there, the daily GBytes writes suggest that the systems will be obsolete and scrapped long before the SSDs should wear out.

    … Which is why I’ve never bothered finishing a follow-on article to this old rambling post of old days SSD confusion. (Then again, hopefully this old article makes for a fun investigative read!)

    One technical detail to note for SSDs still: An important performance marker is how fast they operate for 4kiB read/write. However, their underlying structure usually favours 8kiB, 16kiB or 32kiB block sizes for writes. The only headache there is how best then to fit in what (somehow optimum) block size to use for a (block-device-level) raid setup to avoid crass unwanted write amplification… Then again, all that becomes moot if you instead use the integral filesystem-level raid functionality of btrfs or glusterfs.

    Very low power, small physical size, low cost, very high speed, high enough capacity, and more than enough longevity, all now pretty much makes SSDs a “no-brainer” to use.

    Hope of interest,
    Martin

    • Martin L

      Noting:

      One technical detail to note for SSDs still: An important performance marker is how fast they operate for 4kiB read/write. However, their underlying structure usually favours 8kiB, 16kiB or 32kiB block sizes for writes. The only headache there is how best then to fit in what (somehow optimum) block size to use for a (block-device-level) raid setup to avoid crass unwanted write amplification… Then again, all that becomes moot if you instead use the integral filesystem-level raid functionality of btrfs or glusterfs.

      Interestingly (and especially sympathetic for SSDs), the defaults for the btrfs format have recently been changed to use 16kiByte chunks:

      … The patch below switches our default mkfs leafsize up to 16K. This
      should be a better choice in almost every workload, but now is your
      chance to complain if it causes trouble.

      ——————————–
      16KB is faster and leads to less metadata fragmentation in almost all
      workloads. It does slightly increase lock contention on the root nodes
      in some workloads, but that is best dealt with by adding more subvolumes
      (for now).

      This uses 16KB or the page size, whichever is bigger. If you’re doing a
      mixed block group mkfs, it uses the sectorsize instead. …

       

      For a second note:

      Very low power, small physical size, low cost, very high speed, high enough capacity, and more than enough longevity, all now pretty much makes SSDs a “no-brainer” to use.

      That needs qualifying to comment that is currently the case for up to about 128GBytes. Larger SSDs quickly get more expensive and more power hungry than their spinning rust counterparts. The distinction is also blurred by the recent trend for the larger spinning rust HDDs to incorporate a small SSD for the drive firmware to transparently cache ‘hot data’. These are the new “hybrid” HDD/SSD drives.

      For the sake of completeness: Also note that since the 3.10 linux kernel, there is bcache available to use an SSD as a block device cache for HDDs/SSDs. There is also dm-cache available since the 3.9 kernel to use an SSD as a cache device.

      Hope of interest,
      Martin

  • Martin L

    What happens to your data sent to an SSD during an unexpected power off/fail?… Well… Some SSDs are advertised as using “super-capacitors” to keep the SSD powered for long enough to get any data held in the SSD cache written safely to the SSD flash chips. However… Real World tests suggest otherwise:

     

    FAST ’13 Conference: Understanding the Robustness of SSDs under Power Fault

    Modern storage technology (SSDs, No-SQL databases, commoditized RAID hardware, etc.) bring new reliability challenges to the already complicated storage stack. Among other things, the behavior of these new components during power faults—which happen relatively frequently in data centers—is an important yet mostly ignored issue…

    In this paper, we propose a new methodology to expose reliability issues in block devices under power faults. … Applying our testing framework, we test fifteen commodity SSDs from five different vendors using more than three thousand fault injection cycles in total. Our experimental results reveal that thirteen out of the fifteen tested SSD devices exhibit surprising failure behaviors under power faults…

     

    lkcl.net: Analysis of SSD Reliability during power-outages

    … Conclusion

    Right now, there is only one reliable SSD manufacturer: Intel. That really is the end of the discussion. It would appear that Intel is the only manufacturer of SSDs that provide sufficiently large on-board
    temporary power (probably in the form of supercapacitors) to cover writing back the entire cache when power is pulled, even when the on-board cache is completely full…

    From that, I read it to suggest that you cannot trust any manufacturer’s claims. If you care about your data under powerfail conditions, then you need to test operation directly for yourself. Particularly worrying is for how some of the tested SSDs corrupted their stored data rather than just benignly failing to save cached data.

    Hence also, for any power-down sequence, you must also ensure that any SSDs or SSD-HDD hybrid drives remain powered long enough to fully make safe any cached write data that they may hold immediately before power-off… Do we need a few seconds grace between the final sync of the filesystems and the terminal ACPI power-off that kills the system?…

  • roadSurfer

    A question from the Idiot Gallery. *IF* you are in a data centre and *IF* you are using SSDs for mission-critical data; wouldn’t you already haven redundant drives, UPSs, back-up generators and all the other good stuff a data centre should have?

    I can see the super-capacitor thing being vital for home/office use where such kit is rare; but in the date centre?

    • Martin L

      … wouldn’t you already haven redundant drives, UPSs, back-up generators and all the other good stuff a data centre should have?…

      Such internal power outages happen far too frequently… Even if only from ‘maintenance’ work being done and the wrong plug gets pulled or a circuit breaker gets inadvertently tripped, or even the occasional cable fire…

      For just three examples on The Register:

       

      Fasthosts goes titsup after storm-induced power outage

      … “UPS’ are designed to handle a temporary loss of power. As the supply wasn’t switched back to mains, eventually they drained,”…

       

      Level 3’s UPS burnout sends websites down in flames

      … From the report it looks as though the failure was caused by a busbar blowing in the Uninterruptible Power Supply (UPS) system that is supposed to protect computers and network equipment from unexpected electrical failures…

       

      Bad generator and bugs take out Amazon cloud

      … They did not test the backup generators…

  • Martin L

    Here’s a recent article from Phoronix comparing which of the Linux kernel IO schedulers is best suited for SSDs:

    Linux 3.16: Deadline I/O Scheduler Generally Leads With A SSD

    … Using deadline [IO scheduler] for SSDs generally leads to the best performance…

    That matches what I found some long time ago when testing for myself. An alternate recommendation is to try the “noop” IO scheduler to avoid all the futile processing for the HDD mechanical latency aspects that SSD devices simply do not suffer.

    Note also the comment and example for using udev rules to detect SSDs and to also set “queue/iosched/fifo_batch=1”.

    And the default CFQ IO scheduler has been ‘tweaked’ for when SSDs are detected to reduce the IO latency that is assumed. However, “deadline” still looks to win out.

     

    Another tip from elsewhere for increased performance is to NOT use the SSD TRIM function via the filesystem, such as for example via the “discard” mount option for ext4 or btrfs. Better/faster is to run a FSTRIM periodically to clean up unused space in larger gulps when your system is expected to be quiet, rather than delay the normal filesystem operation when busy with the many smaller TRIMs.

     

    Meanwhile, this post is now rather old and out of date. Time for an update sometime!

    There is still fun reading there for the sleuthing around the proprietary silly obfuscation! 😐

  • Martin L

    The usual concern for FLASH storage (SSDs and ‘memory sticks’) is that of “write endurance”. That is, how many times can storage locations be written to before those cells “wear out” to become inoperative. For good recent SSDs, that should no longer be of any worry due to the use of “wear leveling” in the SSD firmware spreading all the data writes across the entire device. However, localized cell wear is still a problem for memory sticks where the wear leveling features can be expected still to be crude or nonexistent.

    With the SSD write wear concerns now largely no longer a concern (except for extreme or pathological cases), then the next concern is that of for how long are your bits reliably stored before suffering bit rot?

    In typical The Register reporting:

    FLASH better than DISK for archiving, say academics. Are they stark, raving mad?

    Comment: An academic paper claims flash could be better than disk for archiving. So just how did this unlikely result come about?

    … The researchers write: “The transitions to the next disk recording technology (HAMR) and its probable successor (Bit-Patterned Media) turn out to be vastly more difficult and expensive than expected, delaying further bit density improvements and thus decreasing the Kryder rate.”

    The archival access pattern is described as write once, read rarely, overwrite rarely. Flash is more expensive per-GB to buy than disk but is cheaper to use, needing less power, space and cooling.

    However, as its geometry scales down its endurance worsens. Flash drive controllers, currently mostly optimised for performance, can be optimised for endurance instead and solve that problem. The researcher suggest flash drives could be made with stronger insulation between cells, at little extra cost, to increase the data retention period…

    … Will we see flash archive products? One issue is flash foundry capacity, as in there isn’t enough of it. Once overall flash fab capacity doubles, we might say, and 3D TLC NAND provides much denser and lower flash capacity, then low-latency flash archives for big data mining might become feasible. Let’s circle back to this topic in a couple of years time and see.

    All a question of design emphasis, Marketing, and capacity?… All interesting to watch.

    ps: One of the comments to that article echos my experience that memory sticks typically can hold data for about 2 years at UK room temperature before suffering bit loss.

    And note that SSDs should be powered up every few months so that their firmware can run a scrub through their data to error-correct any degrading bits so as to keep all the bits ‘fresh’. (Does such a bit error check/correct scrub need to be initiated/prompted? Should the user keep a device check, and effectively force a full scrub, by running a SHA checksum read check for the full device?…) I believe memory sticks typically do not implement such a data scrub feature.

  • Martin L

    There’s good comment on SSD performance for btrfs and the IO scheduler used. From the thread btrfs performance – ssd array:

    While I don’t have any conclusive numbers, I have noticed myself that random write based AIO on BTRFS does tend to be slower on other filesystems. Also, LVM/MD based RAID10 does outperform BTRFS’ raid10 implementation, and probably will for quite a while; however, I’ve also noticed that faster RAM does provide a bigger benefit for BTRFS than it does for LVM (~2.5% greater performance for BTRFS than for LVM when switching from DDR3-1333 to DDR3-1600 on otherwise identical hardware), so you might consider looking into that.

    Another thing to consider is that the kernel’s default I/O scheduler and the default parameters for that I/O scheduler are almost always suboptimal for SSD’s, and this tends to show far more with BTRFS than anything else. Personally I’ve found that using the CFQ I/O scheduler with the following parameters works best for a majority of SSD’s:
    1. slice_idle=0
    2. back_seek_penalty=1
    3. back_seek_max set equal to the size in sectors of the device [size of the device in kibibytes]
    4. nr_requests and quantum set to the hardware command queue depth

    You can easily set these persistently for a given device with a udev rule like this:
    KERNEL==’sda’, SUBSYSTEM==’block’, ACTION==’add’, ATTR{queue/scheduler}=’cfq’, ATTR{queue/iosched/back_seek_penalty}=’1′, ATTR{queue/iosched/back_seek_max}=’‘, ATTR{queue/iosched/quantum}=’128′, ATTR{queue/iosched/slice_idle}=’0′, ATTR{queue/nr_requests}=’128’

    Make sure to replace ‘128’ in the rule with whatever the command queue depth is for the device in question (It’s usually 128 or 256, occasionally more), and with the size of the device in kibibytes.

    The stuff about the I/O scheduler is more general advice for dealing with SSD’s than anything BTRFS specific. I’ve found though that on SATA (I don’t have anywhere near the kind of budget needed for SAS disks, and even less so for SAS SSD’s) connected SSD’s at least, using the no-op I/O scheduler get’s better small burst performance, but it causes horrible latency spikes whenever trying to do something that requires bulk throughput with random writes (rsync being an excellent example of this).

    Something else I thought of after my initial reply, due to the COW nature of BTRFS, you will generally get better performance of metadata operations with shallower directory structures (largely because mtime updates propagate up the directory tree to the root of the filesystem).

    For myself, I’ve found the “deadline” IO scheduler to give the best performance compromise over the CFQ and NoOp IO schedulers. Using CFQ with the above tweaks could give an improvement at the expense of (negligibly) more CPU effort.

    All by the power of FLOSS!
    Martin

Leave a Reply