HBase at Xiaomi
Liang Xie / Honghua Feng
{xieliang, fenghonghua}@xiaomi.com
www.mi.com
1
About Us
Honghua Feng
Liang Xie
www.mi.com
2
Outline
Introduction
Latency practice
Some patches we contributed
Some ongoing patches
Q&A
www.mi.com
3
About Xiaomi
Mobile internet company founded in 2010
Sold 18.7 million phones in 2013
Over $5 billion revenue in 2013
Sold 11 million phones in Q1, 2014
www.mi.com
4
Hardware
www.mi.com
5
Software
www.mi.com
6
Internet Services
www.mi.com
7
About Our HBase Team
Founded in October 2012
5 members
Liang Xie
Shaohui Liu
Jianwei Cui
Liangliang He
Honghua Feng
Resolved 130+ JIRAs so far
www.mi.com
8
Our Clusters and Scenarios
15 Clusters : 9 online / 2 processing / 4 test
Scenarios
MiCloud
MiPush
MiTalk
Perf Counter
www.mi.com
9
Our Latency Pain Points
Java GC
Stable page write in OS layer
Slow buffered IO (FS journal IO)
Read/Write IO contention
www.mi.com
10
HBase GC Practice
Bucket cache with off-heap mode
Xmn/ServivorRatio/MaxTenuringThreshold
PretenureSizeThreshold & repl src size
GC concurrent thread number
GC time per day :
[2500, 3000] -> [300, 600]s !!!
www.mi.com
11
Write Latency Spikes
HBase client put
->HRegion.batchMutate
->HLog.sync
->SequenceFileLogWriter.sync
->DFSOutputStream.flushOrSync
->DFSOutputStream.waitForAckedSeqno <Stuck here often!>
===================================================
DataNode pipeline write, in BlockReceiver.receivePacket() :
->receiveNextPacket
->mirrorPacketTo(mirrorOut) //write packet to the mirror
->out.write/flush //write data to local disk. <- buffered IO
[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result
also confirmed it
www.mi.com
12
Root Cause of Write Latency Spikes
write() is expected to be fast
But blocked by write-back sometimes!
www.mi.com
13
Stable page write issue workaround
Workaround :
2.6.32.279(6.3) -> 2.6.32.220(6.2)
or
2.6.32.279(6.3) -> 2.6.32.358(6.4)
Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency
sensitive HBase cluster!
www.mi.com
14
Root Cause of Write Latency Spikes
...
0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]
0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]
0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]
0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]
0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]
0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]
0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]
0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]
0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]
0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]
0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]
0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]
0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]
0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]
XFS in latest kernel can relieve journal IO blocking issue, more friendly to
metadata heavy scenarios like HBase + HDFS
www.mi.com
15
Write Latency Spikes Testing
8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel :
3.12.17
Statistic the stalled write() which costs > 100ms
The largest write() latency in Ext4 : ~600ms !
www.mi.com
16
Hedged Read (HDFS-5776)
www.mi.com
17
Other Meaningful Latency Work
Long first “put” issue (HBASE-10010)
Token invalid (HDFS-5637)
Retry/timeout setting in DFSClient
Reduce write traffic? (HLog compression)
HDFS IO Priority (HADOOP-10410)
www.mi.com
18
Wish List
Real-time HDFS, esp. priority related
Core data structure GC friendly
More off-heap; shenandoah GC
TCP/Disk IO characteristic analysis
Need more eyes on OS
Stay tuned…
www.mi.com
19
Some Patches Xiaomi Contributed
New write thread model(HBASE-8755)
Reverse scan(HBASE-4811)
Per table/cf replication(HBASE-8751)
Block index key optimization(HBASE-7845)
www.mi.com
20
1. New Write Thread Model
Old model:
WriteHandler
…
WriteHandler
…
WriteHandler
256
Local Buffer
WriteHandler
: write totoHDFS
WriteHandler
WriteHandler:write
:write toHDFS
HDFS
256
WriteHandler
: sync to
HDFS
WriteHandler
WriteHandler:sync
:synctotoHDFS
HDFS
256
Problem : WriteHandler does everything, severe lock race!
www.mi.com
21
New Write Thread Model
New model :
WriteHandler
…
WriteHandler
…
WriteHandler
256
Local Buffer
AsyncWriter : write to HDFS
1
AsyncSyncer
: sync to
HDFS
WriteHandler
WriteHandler:sync
:synctotoHDFS
HDFS
AsyncNotifier : notify writers
www.mi.com
4
1
22
New Write Thread Model
Low load : No improvement
Heavy load : Huge improvement (3.5x)
www.mi.com
23
2. Reverse Scan
1. All scanners seek to ‘previous’ rows (SeekBefore)
2. Figure out next row : max ‘previous’ row
3. All scanners seek to first KV of next row (SeekTo)
Row2 kv2
Row3 kv1
Row1 kv1
Row1 kv2
Row2 kv1
Row3 kv2
Row3 kv3
Row2 kv3
Row3 kv4
Row4 kv2
Row4 kv1
Row4 kv4
Row4 kv5
Row4 kv3
Row4 kv6
Row5 kv2
Row6 kv1
Row5 kv3
Performance : 70% of forward scan
www.mi.com
24
3. Per Table/CF Replication
PeerB creates T2 only : replication can’t work!
PeerB creates T1&T2 : all data replicated!
PeerA
(backup)
T1 : cfA, cfB
T2 : cfX, cfY
Source
PeerB
(T2:cfX)
Need a way to specify which data to replicate!
www.mi.com
25
Per Table/CF Replication
add_peer ‘PeerA’, ‘PeerA_ZK’
add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’
PeerA
T1 : cfA, cfB
T2 : cfX, cfY
Source
PeerB
(T2:cfX)
www.mi.com
26
4. Block Index Key Optimization
Before : ‘Block 2’ block index key = “ah, hello world/…”
Now :
‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)
k1:“ab”
k2 : “ah, hello world”
…
…
Block 1
Block 2
Reduce block index size
Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]
www.mi.com
27
Some ongoing patches
Cross-table cross-row transaction(HBASE-10999)
HLog compactor(HBASE-9873)
Adjusted delete semantic(HBASE-8721)
Coordinated compaction (HBASE-9528)
Quorum master (HBASE-10296)
www.mi.com
28
1. Cross-Row Transaction : Themis
http://github.com/xiaomi/themis
Google Percolator : Large-scale Incremental Processing Using
Distributed Transactions and Notifications
Two-phase commit : strong cross-table/row consistency
Global timestamp server : global strictly incremental timestamp
No touch to HBase internal: based on HBase Client and coprocessor
Read : 90%,
Write : 23% (same downgrade as Google percolator)
More details : HBASE-10999
www.mi.com
29
2. HLog Compactor
HLog 1,2,3
Region x : few writes but scatter in many HLogs
Memstore
Region 1
Region 2
Region x
HFiles
PeriodicMemstoreFlusher : flush old memstores forcefully
‘flushCheckInterval’/‘flushPerChanges’ : hard to config
Result in ‘tiny’ HFiles
HBASE-10499 : problematic region can’t be flushed!
www.mi.com
30
HLog Compactor
HLog 1, 2, 3,4
Compact : HLog 1,2,3,4 HLog x
Archive : HLog1,2,3,4
HLog x
Memstore
Region 1
Region x
Region 2
HFiles
www.mi.com
31
3. Adjusted Delete Semantic
Scenario 1
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Write kvA at t0 again
4. Read kvA
Result : kvA can’t be read out
Scenario 2
1. Write kvA at t0
2. Delete kvA at t0, flush to hfile
3. Major compact
4. Write kvA at t0 again
5. Read kvA
Result : kvA can be read out
Fix : “delete can’t mask kvs with larger mvcc ( put later )”
www.mi.com
32
4. Coordinated Compaction
RS
RS
RS
Compact storm!
HDFS (global resource)
Compact uses a global HDFS, while whether to compact is decided locally!
www.mi.com
33
Coordinated Compaction
RS
RS
Can
OK I ?
RS
Can I ?
NO
Can I ? OK
Master
HDFS (global resource)
Compact is scheduled by master, no compact storm any longer
www.mi.com
34
5. Quorum Master
A
zk2
zk3
Master
X
A
Master
RS
Read info/states
RS
zk1
ZooKeeper
RS
When active master serves, standby master stays ‘really’ idle
When standby master becomes active, it needs to rebuild in-memory status
www.mi.com
35
Quorum Master
A
X
Master 1
Master 3
A
Master 2
RS
RS
RS
Better master failover perf : No phase to rebuild in-memory status
Better restart perf for BIG cluster(10+K regions)
No external(ZooKeeper) dependency
No potential consistency issue
Simpler deployment
www.mi.com
36
Acknowledgement
Hangjun Ye, Zesheng Wu, Peng Zhang
Xing Yong, Hao Huang, Hailei Li
Shaohui Liu, Jianwei Cui, Liangliang He
Dihao Chen
www.mi.com
37
www.mi.com
Thank You!
[email protected]
[email protected]
www.mi.com
38
© Copyright 2025