Solid State Devices using NAND Flash, how they differ from Hard Drives, and how they affect file deletion and recovery
This article springs from my own curiosity about how SSDs actually work at a technical level. The more I read the more I realised that there are many misconceptions and myths concerning how SSDs perform, and even more misconceptions about file deletion and recovery. I believed in many of those misconceptions too, and parrotted them from time to time, it's taken years to gain any clarity. Much of what's published is quite superficial and much repeats obsolete beliefs and dogma. There's little published about the inner workings of SSDs, possibly because SSD controller software is proprietary, more likely that it's unfathomable. Most of the detailed information here is garnered from available publications, and some of the conclusions I've made are from my own tests or from just trying to apply what logic I can along with common sense. Of course I can't say whether what I've written isn't confusing or even true, but as ever it's more of a guide than a bible.
As much as possible has been sourced from numerous corporate and private technical articles, with quite a lot from Seagate and WD. I am obliged to those I have borrowed from, and will also be obliged to those who point out any errors without any reward apart from that of contribution. I've tried to explain what is different with SSDs, and why it is so hard to grasp with our ingrained HDD minds.
The first misconception might be the plural of SSD: gramatically it should be, so I'm told, SSDs, but SSD's is almost as commonplace. Here I will stick to one SSD, many SSDs.
Software and hardware:
This article was written in 2019 onwards and deals almost exclusively with NAND flash in the form of pc or laptop storage devices known as SSDs. I shan't complicate things even more by referring to the ubiquitous flash drive or other NAND flash devices. If significant differences exist I shall try to note them as and when that occurs, but the default is the internal drive. Nowhere here is there anything about flash storage in phones, etc.
Most of the detail was produced whilst my PC was running Windows 10 Home, with a fairly modest internal 2.5" WD Green 120 gb SSD. This uses a Silicon Motion SM2258XT controller and four 32 GiB SanDisk 05497 032G 15 nm TLC memory chips with an inbuilt SLC cache of unknown capacity. As this article tries to discuss the behaviour of SSDs as a whole it shouldn't matter what host operating or file system is used, but in my case it's Windows and NTFS. Nothing here is specific to a particular brand or type of SSD, it should all be generic. We're really dealing with the principles of SSD operation. When those priciples are unknown, I shall give my own opinion, for better or worse.
The only additional software applications I have used are Piriform's excellent Recuva, which can list both live and deleted files and their cluster allocations, and HexDen (HxD), a very usable and capable hex editor. Recuva is free from www.piriform.com, and HexDen is also free from www.mh-nexus.de. I use the portable versions of both pieces of software.
All the conclusions and opinions here are entirely my own work, and any data taken from my own pc. It would be wise to verify, or at least agree with my reasoning, before accepting these words as the truth.
SSD Physical Internals:
Looking inside an SSD is something of a disappointment, a small pc board with a few NAND chips and a controller chip, lightweight and a little flimsy. As for the software inside the controller, I can only summarise the common tasks. SSD controller software is proprietary, very complex and highly guarded, but all controllers have to do basic tasks, even if we don't quite know how. Only the basic tasks can be discussed here, the very clever tweaks and tricks will have to remain known only to the manufacturer. I'll start with a little groundwork.
I'm not going to get too involved with a description of NAND flash cells. There's enough frankly bemusing articles on Wikipedia for all that. Sufficient to say that NAND (NOT-AND) flash memory stores information in arrays of memory cells made from floating-gate transistors. Tthe floating gate can either have no charge of electrons, and be in an 'empty' logical state, or be charged with electrons at various voltage thresholds and be in a logical state which represents a value. NAND flash is non-volatile and retains its state even when the SSD is not powered up.
A Single-Level Cell (SLC) has one threshold of electron charge to indicate the state of one bit, one or zero. A Multi-Level Cell (MLC) holds the state of two bits, with three different thresholds representing 11, 10, 00and 01. A Triple-Level Cell (TLC) holds the state of three bits, 111, 110, 100, 101, 001, 000, 010, and 011. The 15 thresholds used in Quad-level cells (QLC) can be deduced if anyone is at all interested. There are other variations of what these threshold values represent in bit terms. High-use commercial SSDs are commonly built from single level cells, with their greater speed, endurance, reliability and read/write capabilities, and the end-user consumer SSD market gets the cheaper higher capacity but slower and more fragile multiple level cells.
NAND flash is organised in multiple two-dimensional arrays of cells to form blocks. A row of cells in a block form the equivalent of what would be a sector on a HDD, but as this is new technology we get a new word, a Page. Multiple pages form a block. Common page sizes are 4k, 8k or 16k, with 128 to 256 pages making a block size between 512k and 4mb.
With multi-level cells physical NAND pages represent two or more logical pages. The two bits belonging to a MLC are separately mapped to two logical pages. Odd numbered pages (including zero) are mapped to the least significant bit, and even numbered pages are mapped to the most significant bit. Similarly, the three bits belonging to a TLC are separately mapped to three logical pages, and a QLC is mapped to four logical pages (The page numbering for TLC and QLC is unknown).
The more bits a multi-level cell has to support has a significant affect on the cell's performance. The latency of a typical triple-level cell is 4 x worse than a single-level cell for reads, but 6 x worse for writes. Erase latencies are also significantly impacted. The impact isn’t proportional — although TLC has twice the SLCs read/erase time it only holds 50% more data. It's even worse for QLC cells.
The reason TLC is slower than MLC or SLC is to do with how data is accessed. With SLC the controller only has to check if one threshold has been exceeded. With MLC the cell can have four values, with TLC eight, and QLC 16. Reading the correct value of the cell requires the SSD controller to use precise voltages and multiple reads to ascertain what charge is held by that cell. It's also apparent that if a single physical page supports multiple logical pages then that page will be read and written more frequently than a SLC page, with consequent affect on its life expectancy. It would seem self-evident that a TLC SSD would need only a third of the physical cells required in an SLC device, so my 120 gb SSD would actually hold only 40 gb of NAND cells.
Anyone still following this may have noticed a common factor in both single and multi-level cells, in that an empty page - where the floating gate has no charge - represents ones. Ever since Fibonacci introduced the Hindu-Arabic numeral system with its concept of zero into European mathematics in 1202, the human mind associates zero with empty and one with full. I can see no reason why empty cells should be classed as ones, apart from sheer bloody mindedness.
The Myths and Misconceptions:
And now we come to the myths, misconceptions and the real reason for writing this article, what happens when an SSD page is read, written and rewritten, and how does this affect deleted file recovery? On one hand we have NTFS, designed specifically for HDDs way before SSDs became easily available, NAND flash with its own unique way of operating, and several billion humans with years of ingrained HDD use and expectations. And here, if I haven't already, I shall use SSD interchangeably if incorrectly for NAND flash.
Storage Device Controllers:
All HDDs, SSDs and flash drives have an internal controller. It's the way that the storage device can be, in the words of Microsoft, abstracted from the host. That abstraction is done by logical block addressing, where each cluster capable of being addressed on the device is known to the host by an ascending number (the LBA). The storage device controller maps that number to the sectors or pages on the device. To the host this mapping is constant - a cluster retains the same LBA until the host changes it. On an HDD this relationship is physical and fixed: in its simplist deconstruction an HDD controller just reads and writes whatever sectors the host asks it to. It doesn't have to think about what was there before, it just does what it's told and writes new data on top of the old. It does that because it can, there's nothing preventing a new cluster being written directly on top of the same sectors of an old one. On an SSD it's different.
With an SSD the host still uses the LBA addressing system with the constant reconciliation between LBA and cluster number. It knows that the device is a SSD and has a few tricks to accommodate this, but they will come later. The SSD controller however has many tricks to reconcile the host's file system, written for HDDs, with the demands of NAND flash.
Flash Translation Layer:
The host still uses LBA addressing to address the SSD for read and writes, as it knows no other. These commands are intercepted by the Flash Translation Layer on the SSD controller. The FTL maintains a map of LBAs to physical block addresses, and and passes the translated PBA to the controller. This map is required because unlike an HDD the LBA to PBA relationship is volatile. It's volatile because of the way data is written to NAND flash.
An empty page, with all cells uncharged, contains by default all ones. If a hex editor is used to look at an SSD's empty sectors however, it will be presented with clusters of zeroes. This is because empty pages are not allocated to the LBA/PBA mapping table. Instead, if a read request is issued for an empty page a default page of zeroes is returned. This applies to both unallocated clusters and those which are part of a file: the SSD does not allocate a page and change all its cells from ones to zeroes.
It is not possible to read a NAND flash single cell, or even a single page. The page, and the cells within it, need to be isolated within the block. To do this the pages not being read in the block are temporarily disabled.
All cells in the same block row (a page) are connected with a Word Line. All cells in the same block column (cell offset) are connected with a Bit Line. At the end of each bit line is a sense amplifier. When a read takes place a pass-through voltage is applied to all word lines except the page being read. The pass-through voltage is close to or higher than the highest possible threshold and opens all cells in the block that are not being read. All bit lines then have a reference voltage applied. As the cells in the page being read have not had the pass-through voltage applied they will respond to the reference voltage and trigger the sense amplifier.
Only one threshold applies to SLC flash so only one test voltage is required - the cells either will or will not pass the voltage to the sense amplifiers. The LSB cells in multi-level flash also only need one test voltage due to the bit pattern in the cells, and act as if they are pseudo SLC. The MSB cells needs two or more reads to determine their state.
It can be seen from this that to read one page in a block requires that all cells in every page receive either a pass-through or one or more reference voltages. It also appears that this will still apply even if some or all of the other pages in the block are empty. This becomes significant in Read Disturbance below.
The most significant aspect of NAND flash, the widest fork in the HDD/SSD path, and the fundamental, pivotal factor in what follows, is that data can only be written to an empty SSD page. This is not new, nor is it in any way unknown, but it has the greatest implications for data security and recovery.
While SSDs can read and write to individual pages, they cannot overwrite pages, as the voltages required to revert a zero to a one would damage adjacent cells. All writes and rewrites need an empty page. Unlike HDDs, where a compete cluster is written to the disk whatever was there previously, the act of writing an SSD page allocates an empty page with its default of all ones, and an electrical charge is applied to the cells that require changing to zeroes. This is as true for multi-level cells as it is for SLCs, as the no-charge all-ones pattern is either replaced with a charge representing another pattern, or is left alone. This is a once-only process.
When a write request is issued an empty page is allocated, usually within the same block, and the data written. The LBA/PBA map in the FTL is updated to allocate the new page to the relevant LBA. The LBA will always remain the same to the host: no matter which page is allocated the host will never know. This is the same process if the user data is being rewritten or if it is a new file allocation: the only difference is that the rewrite will have slightly more work to do. The old page will be flagged as invalid and will be inacessible to the host, but will still take up space within its block as it cannot be reused.
Whilst it's easy to grasp writing to SLC pages, multi-level cell pages are more difficult to visualise. The controller accumulates new writes in the SSD cache until enough logical pages to fill a physical page are gathered, and then writes the physical page. This entails the fewest writes to the page. If a logical page in a multi-level page is amended it would require a new page to be allocated and all logical pages rewritten, as the individual values in the physical page can't be altered. If a logical page is deleted then I surmise that the deleted logical page is flagged as invalid, and when the block becomes a candidate for garbage collection any valid logical pages are consolidated before writing. In other words a multi-level page, or at least the majority of them, will always contain a full compliment of logical pages.
It's apparent that if NAND flash handles data writes in this way - and it does - the SSD will eventually become full of valid and invalid pages, and performance will gradually slow to a crawl. Although an individual SSD page can't be erased a block can, and this method is used to return blocks to a writable state. To expedite this, and to ensure that a pool of empty blocks is always available for writes, the SSD controller uses Garbage Collection.
Garbage Collection is enabled on the humblest up to the highest capacity SSD: without it NAND flash would be unusable. Garbage Collection is part of the SSD controller and its work is unknown to the host. In its simplest form GC takes a block holding valid and invalid pages, copies the valid pages to a new empty block, updates the LBA mapping tables, and consigns the old block to the invalid block pool. There the block and its pages are reset to empty state, and the block added to the available block pool. Thus a pool of available blocks should always be available for write activity. As long as there is power to the SSD GC will do its work, it cannot be stopped. There are various sophisticated techniques for GC routines, all proprietary and mainly known only to the manufacturers.
When an SSD arrives new from the factory writes will gradually fill the drive in a progressive, linear pattern until the addressable storage space has been entirely written. However once garbage collection begins, the method by which the data is written – sequential vs random – affects performance. Sequentially written data writes whole blocks, and when the data is replaced the whole block is marked as invalid. During garbage collection nothing needs to be moved to another block. This is the fastest possible garbage collection – i.e. no garbage to collect. When data is written randomly invalid pages are scattered throughout the SSD. When garbage collection acts on a block containing randomly written data, more data must be moved to a new block before the block can be erased.
The Garbage Collection Conundrum:
Garbage Collection can either take place in the background, when the host is idle, or the foreground, as and when it is needed for a write. Whilst background GC may seem to be preferrable, it has drawbacks. If the host uses a power-saving mode when idle, GC will either wait for the device to restart with a consequent user delay for GC to complete, or wake the device up and reduce battery life whilst the host is 'idle'. Furthermore GC has no knowledge of the data it is collecting. Inevitably some data will be subject to GC and then be deleted shortly afterwards, incurring another bout of GC and consequent additional and unnecessary writes (write amplification, the ratio of actual writes to data writes). Foreground GC, seemingly the antithesis of performance, avoids the power-saving problems, only incurs writes when they are actually required, and with fast cache and highly developed GC algorithms presents no noticeable performance penalty to the user. The trend in modern GC appears to be foreground collection.
Based on foregraoumnd garbage collection, and that most user activity is random, then the inevitable conclusion is that the SSD will spend most of its life at full capacity, if by that we mean available blocks, even though the allocated space appears to the host appeared to be low.
However there is another potential problem with SSDs, and that is to do with a historical event: the way that file systems were designed.
File Systems - What you see isn't what you get:
Host file systems were designed in the days when HDDs reigned supreme, simply because SSDs had yet to arrive in an available and affordable form. The file system does not take into account the needs of NAND flash. Files are constantly being updated: they get allocated, moved and deleted, and grow and shrink in size. The way the file system handles this is incompatible with the workings of NAND flash.
It's worth emphasising that storage devices are abstracted from the host operating system. Whilst an array of folders and files are displayed by Explorer in a form wholly comprehensible to a human, it's all an illusion. What Explorer is showing is a logical construct created entirely from metadata held within the file system's tables. The storage device controller knows nothing about files or folders, or tables or operating systems: all an HDD or SDD sees are commands to read or write specific sectors, which it does faithfully. An SSD has one advantage over an HDD however, it knows that some pages hold data, and are mapped to an LBA, and some pages are empty, hold no valid data, and are not mapped to an LBA. Conversely an HDD does not need to know this, to an HDD all sectors are the same.
In NTFS, when a file is deleted the entry in the Master File Table is flagged as such, and the cluster bitmap is amended to flag the file's clusters as available for reuse. The delete process takes place entirely within the MFT and the cluster bitmap. This is perfectly adequate for an HDD, as NTFS can simply reuse the MFT entry and the clusters whenever it wishes. On an SSD the process from NTFS's point is exactly the same, as NTFS has no other way of deleting files. However all the SSD sees is exactly what an HDD would see, updates to a few pages. Neither an HDD nor an SSD knows that it's the MFT and cluster bit map being updated, as they have no knowledge of such things. As there is no activity on the deleted file's clusters, the SSD's pages holding the clusters remain mapped to their LBAs in the FTL. The SSD's FTL has no way of knowing that these pages are no longer allocated by NTFS: to the SSD the pages are still valid and will not be cleaned up by garbage collection.
As these 'dead' pages are allocated to an LBA they could be released when files are allocated or extended and the host uses that LBA. In this case the page will be flagged as invalid and a new page used. However it is inevitable that eventually a significant amount of unused and unwanted baggage which is not flagged for garbage collection will be pointlessly maintained by the SSD controller and be unavailable for reuse. To overcome this, and to correlate the hosts view of allocated and unallocated pages with the SSD's, NTFS from Windows 7 onwards acquired the TRIM command.
Although the storage device is abstracted from the File System, to enable some of the file system's SSD tweaks it needs to know whether the device is an HDD or SSD. There are various ways to do this, including querying the rotational speed of the device, which on an SSD should be zero (or perhaps one). This seems the most widely used and most proficient method.
TRIM (it isn't an acronym) is a SATA command sent by the file system whenever a file is deleted or moved and the cluster is flagged as unallocated in the cluster bitmap. TRIM has to be supported and enabled in the file system, and supported in the SSD, to take effect. It tells the FTL that the pages allocated to specific LBAs are to be classed as invalid. The pages can then be gathered by garbage collection and returned to use. TRIM is an asynchronous command that is queued for low-priority operation. It does not need or send a response. The size of the TRIM queue is limited and in times of high activity some TRIM commands may be dropped. There is no indication that this takes place, so some unwanted pages may escape garbage collection.
Windows Defragger - now called Storage Optimiser - has an option to Optimise SSDs. This does not defrag the SSD but sends a series of TRIM commands to all unallocated clusters in the cluster bitmap. This global TRIM (or RETRIM) command is run at a granularity that the TRIM queue will never exceed its permitted size and no RETRIM commands will be dropped. A RETRIM is run automatically once a month by the storage optimiser.
All NAND flash devices use over-provisioning, additional capacity for extra write operations, controller firmware, failed block replacements, and other features utilised by the SSD controller. This capacity is not physically separated from the user capacity but is simply an amount of space that can't be allocated by the host. According to Seagate, the minimum reserve is the difference between binary and decimal naming conventions. An SSD is marketed as a storage device and its capacity is measured in gigabytes (1,000,000,000 Bytes). NAND flash however is memory and is measured in gibibytes (1,073,741,824 bytes), making the minimum overprovisioning percentage just over 7.37%. Even if an SSD appears to the host to be full, it will still have 7.37% of available space with which to keep functioning and performing writes (although write performance will be diabolical). Manufacturers may further reduce the amount of capacity available to the user and set it aside as additional over-provisioning, in addition to the built-in 7.37%. Additional over-provisioning can also be created by the host by allocating a partition that does not use the drive’s full capacity. The unallocated space will automatically be used by the controller as dynamic over-provisioning.
My humble WD SSD has four 32 gb chips but a specified capacity of 120 gb, meaning that it has 8 gb set aside as additional over-provisioning. Add this to the 7.37% minimum (9.4 gb) and the 17.4 gb equates to almost 15% over-provisioning space.
Some files are written once and remain untouched for the rest of their life. Others have few updates, some very many. As a consequence some blocks will hardly ever see the invalid block pool and have a very low erase/write count, and some will be in the pool every few minutes and have a very heavy count. To spread the wear so that all blocks are subject to erase/writes equally, and the performance of the SSD is maintained over its life, wear levelling is used. Wear levelling uses algorithms to indentify blocks with the lowest erase count and move the contents to high erase count blocks; and to select low erase count blocks for new allocations. As with garbage collection, wear levelling is far more complex than I could possibly deduce, let alone explain.
And now we come to deleted file recovery. We know that NTFS deletes files on an SSD in exactly the same way as it does on an HDD, with the exception of an additional TRIM. The TRIM command, assuming it's executed, destroys the chances of deleted file recovery. When the TRIM command is executed by the SSD controller the target page is flagged as invalid and removed from the LBA mapping. Any attempt by the host to access it will, within a few seconds, return a default page of zeroes. The data on the page will still exist until the block is processed by the garbage collector, but that data is not accessible from the host by any means, or any software.
Thus the notion of secure deletion - overwriting data before deletion - is irrelevant, and if any other pattern except zeroes is chosen is just additional and pointless wear on the SSD. Even overwriting with zeroes will cause transaction log and other files to be written, so secure deletion on an SSD should never be used.
After a session of file deletion, such as running Piriform's CCleaner, run Recuva on the SSD. The headers of the found files (and presumably the rest of the file) will all be zeroes. This is TRIM doing its work in a few seconds, killing any chance of deleted file recovery. Of course TRIM can be disabled, at the cost of performance, but it's probably better to be a little less cavalier when deleting files that might be wanted later.
There is a small possibility of recovering recently deleted files. Power off the SSD immediately and send the SSD to a professional data recovery company. They may recover some data, given enough time and money.
The OCZ Myth:
Some years ago (as a little light relief to all these acres of text) the OCZ forums were buzzing with the latest method of regaining performance on their SSDs: run Piriform's CCleaner Wipe Free Space, with one overwrite pass of zeroes. Although performance may have been regained, logic, and common sense, went out of the window. The theory was that overwriting the pages with zeroes was equivalent to erasing blocks (this was before the days of TRIM). This was nonsense, and should have been apparent from the start. The default state of an empty page is all ones, not zeroes, and how could a piece of software possibly erase NAND flash?. The real reason was that as CCleaner was filling the pages with zeroes the SSD controller simply unmapped the pages and showed default pages of zeroes to the host. The invalid pages were then candidates for garbage collection, which gave a much greater pool of blocks to call upon on writes, and hence a better performance. A sort of RETRIM before that was invented.
One of the SSD mantras is that an SSD should never be defragged. Whilst there is little (there is a little) to be gained from rearranging clusters into adjacent pages - an SSD has no significant overhead in random reads - an SSD defrag is not entirely verboten. In fact from Windows 8 onwards the Storage Optimiser will defrag an SSD if certain conditions are met. If System Restore is enabled, the fragmentation level is above 10%, and at least one month has passed since the last defrag, Windows Storage Optimiser Scheduled Maintenance will defrag the SSD. This is what Microsoft calls a Traditional Defrag, it is not an Optimise (RETRIM). The defrag is required to reduce the extents on the volume snapshot files when system restore is enabled.
There is nothing to be afraid of in a monthly defrag. Most users won't hit the 10% fragmented criteria so a simple RETRIM will be run, and Windows 10 users won't get defragged anyway (System Restore is disabled in Widows 10 by default). The reduction in life of an SSD will not be noticed. Furthermore, although SSDs are not fazed by random reads, files do get fragmented and that means a significant increase in I/Os. An occasional clearup is a boon.
SSD Lifetime: There are many users worried about the life expectancy of their SSDs. Yes, continuous write/erase cycles, and the added and unseen write amplification, do take a toll on the life of NAND flash. Using an SSD does wear it out. My WD Green 120 gb SSD, a TLC SSD from a reputable manufacturer but at the very lowest cost, has an estimated life of 1 million+ hours and a write limit if 40 terabytes. One million hours is 114 years, so we can forget that. As for writes, at 1 gb a day - far more than my current rate of data use - it would take the same 114 years to reach 40 tb. Even with massive write overhead this SSD is not going to wear out in the forseeable future. If all 128 gib of available flash is used equally, the 40 tb equates to 312 writes per cell, a very conservative number.
SSD reads are not quite free, there is a price to pay. As described above, a read of one page generates a pass-through voltage on all other cells in the block. This voltage is likely to be below the highest threshold value that could be held by the cell. Although the pass-through voltage is not as high as the programming voltage, it still generates a weak programming effect on the cells, which can unintentionally shift their threshold voltages. The high pass-through voltage induces electric tunnelling that can shift the voltages of the unread cells to a higher value, disturbing the cell contents. As the size of flash cells is reduced the transistor oxide becomes thinner and in turn increases this tunnelling effect, with fewer read operations required to neighbouring pages for the unread flash cells to become disturbed, and move into a different logical state. Cells holding lower threshold values are more susceptible to read disturbance. p>Thus each read can cause the threshold voltages of other unread cells in the same block to shift to a higher value. After a significant amount of reads this can cause read errors for those cells. A read count is kept for each block and if it is exceeded the block is rewritten. The count is high for SLC cells, around 1m, lower for 25 nm MLC at around 20,000, and much lower for 15 nm TLC cells.
The only thing to add is that NAND flash, SSDs, and especially SSD controllers, are far more sophisticated, complex and incomprehensible than what has been written here, what I know, what I could possibly comprehend, and what I could possibly explain. I should also add secret, as their software is proprietary. Whilst an HDD is a marvel of complex electro-mechanical engineering at a ridiculously low cost, the SSD is an equally marvellous and complex piece of electronics and software at a minimally higher cost. We should be thankful for both.
You can return to my home page here
If you have any questions, comments or criticisms at all then I'd be pleased to hear them: please email me at kes at kcall dot co dot uk.
© Webmaster. All rights reserved.