Anton Gostev: VP, Product Management, Veeam

Anton Gostev: VP, Product Management, Veeam
Platinum
sponsors:
What is data corruption?
Silent data corruption
Why does data corruption happen?
Where does it happen most commonly?
In theory
In practice
Common corruption patterns
Tips and tricks recovering from corrupted Veeam® backup files
Is an unintended data change
Write
Read
Unintended data changes are facts of life!
Data management systems are designed with corruption in mind
Detection algorithms
Via parity bit, check summing (CRC32), hashing (MD5)
Retransmit on detection
Correction algorithms (includes detection)
Via error correcting code (ECC), erasure codes, etc.
In-place data recovery on detection
Both increase data footprint noticeably (performance, not so much)
No detection in hardware and software causes SILENT data corruption
Most software simply does not implement detection
However, most backup software does have detection algorithms built in
THIS is why you start seeing lots of corruption dealing with backups
It’s NOT because all backup software is bad!
RAM and CPU electronics problems (sticky bit)
URE (Unrecoverable Read Error) and storage wear
Data transfer noise (wireless, Ethernet, FC, SATA)
Firmware bugs and “optimizations” (NAS software, NIC, RAID controller)
Software bugs
OS kernel
File system
Virtualization
Data management applications
Not just computers (but also routers, NICs, RAID controllers)
Rated BER (Bit Error Rate) from 10-12 (1 error per 100 GB). Still want to
save buying non-ECC memory for your servers?
Breaks with time due to transistors wear! Normally takes at least 5-7 years,
but high operational temperatures speed up the issue
Results in:
Unbootable system (recent CISCO routers issue)
Sticky bit silently corrupting all of your data
URE (Unrecoverable Read Error)
Type
Consumer SATA
Enterprise SATA
Enterprise SAS
Tape (LTO)
Tape (Enterprise)
BER
10-14
10-15
10-16
10-17
10-19
One error per
10 TB
100 TB
1 PB
10 PB*
1 EB*
*Assuming proper handling
Consider mechanical wear for tape and classic hard drives, and electronic
wear for SSDs (SLC/MLC/TLC)
Bit Error Rate (BER)
Type
1 Gb Ethernet
10 Gb Ethernet
4 Gb Fibre Channel
BER
10-10
10-12
10-12
One error per
1 GB transferred
100 GB transferred
250 GB transferred
One error every
10 seconds
1 minute 40 seconds
~ 5 minutes
Checksummed and retransmitted* if necessary
Wireless uses Symbol Error Rate (SER); is affected by technology,
hardware, modulation, distance, collisions, noise, etc.
NAS firmware
Underlying base OS is usually solid
Questionable “optimizations” and tweaks
RAID controller firmware
Most overlooked component in software patching
NIC firmware
Poor quality from certain vendors
Also rarely patched, aside from performance troubleshooting case
OS kernel (Server 2012 “magic 10 bytes” or Linux SLAB bugs)
File system (two evils: bad stability vs. bad architecture)
Virtualization
Too much code in between apps and hardware
Too many moving pieces (hard to test)
Data management applications
Algorithmic bugs (e.g., incremental without full)
Non-transactional I/O handling (common for immature software)
Data corruption bugs are extremely rare (hard to make, easy to catch)
Corruption statistics in our support
Over six years and 120,000 customers
Through my prism of perception!
Disclaimer
There are three types of people:
1. People who trust statistics
2. People who do not trust statistics
3. People who make up their own statistics
Network shares
Windows SMB client issues
Low-end NAS and various appliances
Corrupted network traffic
Bad NIC firmware
Vendor ignoring TCP/IP reference
Storage-level corruption
RAID controllers writing rubbish
Corruptions from storage-side data processing (e.g., dedupe)
Network shares?
Avoid SMB backup targets
Use internal, DAS or block storage
Corrupt network traffic?
Network traffic verification (6.5 and later)
Requires locally attached storage!
Storage-level corruption?
SureBackup® with full scan option (or script backup verifier)
Copy your backups, remember 3-2-1 rule!
Single/Double Bit Errors
Bad memory (RAM, cache)
Magic 10 bytes
Windows Server 2012 kernel bug
2n sized chunks
Linux SLAB bug
64KB chunks
RAID controller misbehaving
See “Silent Corruptions” CERN research by Peter Kelemen for more info
Simple bit flip in a byte
Usually persistent issue (caused by hardware)
Typically caused by bad memory
1>0 transitions are more frequent than 0>1*
00000000
*
35285650
35285660
33 33 33 33 33 33 33 33
33 33 33 33 33 33 33 33
|3333333333333333|
33 33 33 33 33 33 33 33
33 33 33 33 33 33 33 33
33 33 33 33 33 33 22 33
33 33 33 33 33 33 33 33
|33333333333333"3|
|3333333333333333|
0x33 = 00110011b
0x22 = 00100010b
Random 10 bytes get random content
Specific to Windows Server 2012 and above
Usually persistent for hours/days (until environment change)
Typically surfaces on heavy non-sequential I/O (copying large files)
Timing is critical (possible race condition): reduced system load or debug
tools installation makes it go away
Recent occurrence points at system cache corrupting the data before or
during flush
128-512 bytes chunks of data
Usually transient
Sometimes contain identifiable user data
Typically observed in vicinity of Out of Memory
Specific to Linux backup repository
Possible corruption in SLAB (so may impact other Unix-based OS)
Multiple large chunks of 64 KB containing “old” data from previous cycles
(can be a few cycles old) or data from another location on disk:
Usually persistent; comes in bursts
Typically associated with I/O command timeouts
Size often matches RAID stripe size (64 KB is a common default)
RAID controller is the primary suspect
Accept and be prepared for corruptions!
Avoid detecting them at restore time:
Use file systems with built-in end-to-end checksumming, data
scrubbing and integrity checking (ZFS, ReFS)
Scrub your RAID arrays regularly
Re-read your tapes
Infamous LZ4/ZLIB/RLE decompression error
Restore job failed
Error: Client error: Zlib decompression error: [-3].
Windows Event Log: The device … has a bad block (Disk Event 7)
This is actually good news!
Indicates point corruption of backup file (usually, a single block)
File-level restore is still possible (unless MFT blocks are hit)
Full VM restore will fail, but still possible with support tools
This is bad news:
All instances of storage metadata are corrupted. Failed
to load metastore: Failed to load metadata partition.
This backup file is FUBAR :(
Metadata store is redundant (two copies), but still gets corrupted.
We have no idea how blocks map to backed up files (or their order)
Copy your backups!
Extracting data from corrupted backups:
Manual process in the early days
VeeamRAR support tool (v7)
Storage Explorer (v8)
Storage Explorer NEW
Analyzes impacted VM files
Enables image-level restores from corrupt backups
Can fix invalid summary.xml
Includes support for encrypted backups and cloud repositories
SureBackup
Verifies BOTH backup file integrity AND recoverability
Fully automated (set it and forget it)
Available in Enterprise edition (and later)
Backup Validator
Verifies backup file integrity ONLY
Can be scripted to automate integrity checks
Backup files must be imported (will not work on standalone file)
Available in all product editions
Data corruptions are facts of life. Get them before they get you!
Test your backup integrity regularly to find corruptions
…or find corruptions at restore time, whichever you prefer
And last, but not least:
Copy your backups!
And copy them to a different storage! Storage-based replication between
identical storage devices keeps your data in the same fault domain.
Gostev @ veeam.com (put “corruption” into the subject)
Thank you!