Deduplication is one technique ZFS can use to store file and other data in a pool. If several files contain the same pieces (blocks) of data, or any other pool data occurs more than once in the pool, ZFS will store just one copy of it. In effect instead of storing many copies of a book, it stores one copy and an arbitrary number of pointers to that one copy. Only when no file uses that data, is the data actually deleted. ZFS keeps a reference table which links files and pool data to the actual storage blocks containing “their” data. This is the Deduplication Table (DDT).
The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space. When data is read or written, it is also possible that a large block read or write can be replaced by a smaller DDT read or write, reducing disk I/O size and quantity.
The deduplication process is very demanding! There are four main costs to using deduplication: large amounts of RAM, requiring fast SSDs, CPU resources, and a general performance reduction. So the trade-off with deduplication is reduced server RAM/CPU/SSD performance and loss of “top end” I/O speeds in exchange for saving storage size and pool expenditures.
source: https://www.truenas.com/docs/references/zfsdeduplication/
I experimented with deduplication on TrueNAS and I discovered some additional points.
- A lot of data is already deduplicated.
- Data that is compressed (videos, pictures, VMware VMGK files) will not benefit from deduplication.
My best advice to people considering deduplication is to create a dataset or zvol with deduplication and try it out. Put some sample data in the dataset or zvol and see how much it dedups. You can check the rate by going to the shell and using “zpool list”.
What will dedup really well is log files, text files and files with a lot of duplicated data.
Always try compression first.