Deriving and Comparing Deduplication Techniques Using a Model-Based Classification Data Deduplication Goals Technique to save storage capacity Goal 1: Uncouple core concepts from implementations •  Exploit redundancy •  Commonly used for backup systems •  Data Domain, HP, … •  •  •  •  Traditional: Systems as inherently linked characteristics But: Systems consist of independent core concepts Dedup. approach is defined by its prefetching approach Prefetching vs. deduplication exactness Basic technique exact Split data up in chunks Fingerprint chunks Compare-by-Hash Remove duplicates approximate •  •  •  •  Container Caching Different Data Sets and Chunk Sizes Different Data Sets Block Locality Caching Sparse Indexing disk memory disk memory disk memory Chunk Index Bloom Filter Chunk Index Bloom Filter Chunk Index Sparse Index cont. storage cont. cache Block Index BLC segment storage manifest cache disk memory disk memory disk memory Sparse Index cont. storage cont. cache Sparse Index Block Index BLC Sparse Index segment storage manifest cache •  Weekly backups •  3x university home directories, 1x windows machines (Meyer et al.) exact approx. Goal 2: Extensive Comparison of all approaches •  Container Caching and Sparse Indexing not compared on the same data sets •  Comparison of the approaches •  {CC, BLC, SI } x {exact, approximate} x {chunk sizes} … •  IO patterns Different RAM sizes for approximate approaches •  So far: 8GB total memory (cache size to data set size) •  But: today 128GB and more Different Chunk Sizes •  Tradeoff between less memory and more duplicate detection •  4-16KB (HOME) , 8 + 16KB (Microsoft) exact approx. IO Access Patterns Container Caching Block Locality Caching Sparse Indexing Jürgen Kaiser, André Brinkmann, Tim Süß, Johannes Gutenberg University Mainz, {kaiserj, brinkman, suesst}@uni-mainz.de Dirk Meister, Pure Storage, [email protected] https://research.zdv.uni-mainz.de