Chapter 11: Indexing and Hashing
Database System Concepts, 6th Ed.
©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
Basic Concepts
Indexing mechanisms are used to speed up access to data.
Search Key - attribute to set of attributes used to look up
records in a file.
An index file consists of records (called index entries) of the
form
search-key
pointer
Indexes are typically much smaller than the original, e.g.:
Table of contents in a book
Index in a book
Catalog in a library
Inventory in a warehouse
Database System Concepts - 6th Edition
11.2
©Silberschatz, Korth and Sudarshan
Basic Concepts
search-key
pointer
Two basic kinds of indices:
Ordered indices: search keys are stored in sorted
order
Hash indices: search keys are distributed
uniformly across “buckets” using a “hash function”.
Database System Concepts - 6th Edition
11.3
©Silberschatz, Korth and Sudarshan
Index Evaluation Metrics
Access types supported efficiently, e.g.:
records with a specified value in the attribute
or records with an attribute value falling in a
specified range of values.
Access time
Insertion time
Deletion time
Space overhead
Database System Concepts - 6th Edition
11.4
©Silberschatz, Korth and Sudarshan
11.2 Ordered Indices
To use an ordered index, we need an index-sequential
file = file that is stored sequentially, according to some
attribute:
This attribute (PK?) will be used as primary index
Database System Concepts - 6th Edition
11.5
©Silberschatz, Korth and Sudarshan
11.2 Ordered Indices
In an ordered index, index entries are stored sorted on the
search key value
E.g. author catalog in library.
Primary index: in a sequentially ordered file, the index
whose search key specifies the sequential order of the
file.
Also called clustering index
The search key of a primary index is usually (but not
necessarily) the PK.
Secondary index: an index whose search key specifies
an order different from the sequential order of the file.
Also called non-clustering index.
Database System Concepts - 6th Edition
11.6
©Silberschatz, Korth and Sudarshan
Ordered index: Dense
Dense index — There is an index record for every search-key value
E.g.: Dense and primary index on ID attribute (PK) of instructor
table
Database System Concepts - 6th Edition
11.7
©Silberschatz, Korth and Sudarshan
Ordered index: Dense
E.g.: Dense and primary index on dept_name, with
instructor file sorted on dept_name (not PK)
Database System Concepts - 6th Edition
11.8
©Silberschatz, Korth and Sudarshan
Ordered index: Sparse
Sparse Index: contains index records for only some of the search-key values.
It is applicable only as primary index, i.e. when the records are physically
ordered on the search-key
Database System Concepts - 6th Edition
11.9
©Silberschatz, Korth and Sudarshan
Ordered index: Sparse
Algorithm for locating a record with search-key value K:
Find index record with largest search-key value < K
Search the file sequentially starting at the record to which the
index record points
Database System Concepts - 6th Edition
11.10
©Silberschatz, Korth and Sudarshan
Ordered index: Sparse
Compared to dense indices, a sparse index:
Requires less space and less maintenance overhead for insertions and
deletions.
It is slower for locating records, b/c it requires multiple accesses to the file.
Good tradeoff: sparse index with an index entry for every block in file,
corresponding to least search-key value in the block.
Database System Concepts - 6th Edition
11.11
©Silberschatz, Korth and Sudarshan
QUIZ
What are the two main type of indices used in DBMS?
Database System Concepts - 6th Edition
11.12
©Silberschatz, Korth and Sudarshan
QUIZ
What are the two main type of indices used in DBMS?
Ordered index
Hash index
Database System Concepts - 6th Edition
11.13
©Silberschatz, Korth and Sudarshan
QUIZ
What is the difference between a primary index and a
secondary index?
Database System Concepts - 6th Edition
11.14
©Silberschatz, Korth and Sudarshan
QUIZ
What is the difference between a primary index and a
secondary index?
The key of the primary (a.k.a. clustered) index are the
ones on which the physical file is sorted.
The key of the secondary (a.k.a. non-clustered) index
is not the one on which the physical file is sorted.
Database System Concepts - 6th Edition
11.15
©Silberschatz, Korth and Sudarshan
QUIZ
What is the difference between a dense index and a
sparse index?
Database System Concepts - 6th Edition
11.16
©Silberschatz, Korth and Sudarshan
QUIZ
Which index is dense and which is sparse?
Database System Concepts - 6th Edition
11.17
©Silberschatz, Korth and Sudarshan
QUIZ
Can there be secondary sparse indices?
Give an example of explain why it’s not possible.
Database System Concepts - 6th Edition
11.18
©Silberschatz, Korth and Sudarshan
Secondary Indices
Remember:
A secondary index is …
Secondary indices have to be dense! (There are no
secondary sparse indices – Why?)
Why do we need them?
Database System Concepts - 6th Edition
11.19
©Silberschatz, Korth and Sudarshan
Secondary Index Example
Secondary index on salary field of instructor
Index record points to “bucket” that contains pointers to all the
physical records with that particular search-key value.
Database System Concepts - 6th Edition
11.20
©Silberschatz, Korth and Sudarshan
QUIZ: Classify these indices as
primary/secondary and dense/sparse
?
Database System Concepts - 6th Edition
11.21
©Silberschatz, Korth and Sudarshan
QUIZ: Classify these indices as
primary/secondary and dense/sparse
Images source: http://webserver.ignou.ac.in/virtualcampus/BIT_StudyMaterials/tri8/cst203-bl1-u3.htm
Database System Concepts - 6th Edition
11.22
©Silberschatz, Korth and Sudarshan
QUIZ: What is the difference between these
secondary indices?
Database System Concepts - 6th Edition
11.23
©Silberschatz, Korth and Sudarshan
A: In the one on the right, the pointers are “approximate” –
they point to the correct block, but not necessarily the
correct record within the block
Database System Concepts - 6th Edition
11.24
©Silberschatz, Korth and Sudarshan
Performance of ordered indices
Pro: substantial speed increase when searching for
records.
Con: when a table in the DB is modified, every index on
that file(s) storing the table must be updated.
Pro: Sequential scan using primary index is efficient
Con: sequential scan using a secondary index is
expensive
Each record access may fetch a new block from disk
Block fetch requires about 5 to 10 ms (HDD), versus
≈ 10-50 ns for main memory access (RAM)
Database System Concepts - 6th Edition
11.25
©Silberschatz, Korth and Sudarshan
Hierarchical/Multilevel Index
For large table, primary index may not fit in main memory
→ store on disk → access becomes expensive.
Solution: treat primary index kept on disk as a
sequential file and construct a sparse index on it.
outer index – a sparse index of primary index
inner index – the primary index file
Database System Concepts - 6th Edition
11.26
©Silberschatz, Korth and Sudarshan
Two-level index
Database System Concepts - 6th Edition
11.27
©Silberschatz, Korth and Sudarshan
Hierarchical/Multilevel Index
If even the outer index is too large to fit in main
memory, yet another level of indexing can be created
(three-level), and so on.
Overhead: Indices at all levels must be updated on
insertion or deletion from the file!
Database System Concepts - 6th Edition
11.28
©Silberschatz, Korth and Sudarshan
Index Update: Deletion
If deleted record was the
only record in the file with its
particular search-key value,
the search-key is deleted
from the index also.
Single-level index entry deletion:
Dense indices – deletion of search-key is similar to file record
deletion.
Sparse indices –
if an entry for the search key exists in the index, it is
deleted by replacing the entry in the index with the next
search-key value in the file (in search-key order).
If
the next search-key value already has an index entry, the
entry is deleted instead of being replaced.
Database System Concepts - 6th Edition
11.29
©Silberschatz, Korth and Sudarshan
Index Update: Insertion
Single-level index insertion:
Perform a lookup using the search-key value appearing in
the record to be inserted.
Dense indices – if the search-key value does not appear in
the index, insert it.
Sparse indices – if index stores an entry for each block of
the file, no change needs to be made to the index unless a
new block is created.
If
a new block is created, the first search-key value
appearing in the new block is inserted into the index.
Multilevel insertion and deletion: algorithms are simple
extensions of the single-level algorithms
Database System Concepts - 6th Edition
11.30
©Silberschatz, Korth and Sudarshan
Disadvantage of indexed-sequential files
Performance of all operations degrades as file grows.
Periodic reorganization of entire file is required
E.g. insertion of a tuple requires either:
Overflow block, or
Shifting of all records following it
Database System Concepts - 6th Edition
11.31
©Silberschatz, Korth and Sudarshan
11.3 B+-Tree Index Files
B+-tree indices are an alternative to indexed-sequential files.
B+-tree is a type of balanced tree.
a.k.a. “skinny” tree
Database System Concepts - 6th Edition
11.32
©Silberschatz, Korth and Sudarshan
11.3 B+-Tree Index Files
Why do we need balanced trees?
Most operations in a tree (search, insert, delete)
are “Big-Oh” of the height
Database System Concepts - 6th Edition
11.33
©Silberschatz, Korth and Sudarshan
B+-Tree
At most n – 1 keys and n pointers in each node. (n = 4 here)
Nodes may be only
partially filled
All leaves must
be on the same
level
To avoid “skinny” trees, all nodes except the root and
leaves must have at least n/2 keys. We say that the
“fill factor” is at least 50%.
Ceiling function
Image
source:
http://en.wikipedia.org/wiki/B%2B_tree
Database System
Concepts
- 6th Edition
11.34
©Silberschatz, Korth and Sudarshan
There are 3 types of pointers in a B+-Tree
• Those inside the tree point to other nodes of the tree.
• Horizontal from one
leaf to the next
• Those in the leaves point to the actual tuple of data
on the disk.
Image
source:
http://en.wikipedia.org/wiki/B%2B_tree
Database System
Concepts
- 6th Edition
11.35
©Silberschatz, Korth and Sudarshan
B+-Tree Node Structure
Logically, the keys are “in between” the pointers
The search-keys in a node are ordered
K1 < K2 < K3 < . . . < Kn–1
(Initially assume no duplicate keys, address duplicates later)
All keys x in the subtree pointed to by Pi lie in between the two
adjacent keys:
Ki-1 ≤ x < Ki
Sometimes we write informally:
Ki-1 ≤ Pi < Ki
Database System Concepts - 6th Edition
11.36
©Silberschatz, Korth and Sudarshan
The B+-tree is a multi-level, index for the leaf nodes!
There’s a mistake in Fig.11.9 of our text.
Database System Concepts - 6th Edition
11.37
©Silberschatz, Korth and Sudarshan
Completing the definition of B+-tree
Each internal node (i.e. not root or leaf) has
at least n/2 pointers (children).
A leaf node has at least (n–1)/2 keys.
In many applications, the
fill factor can be adjusted,
even with separate
values for leaves and
internal nodes!
Special cases:
If the root is not a leaf, it has at least 2
children.
If the root is a leaf (i.e. there are no other
nodes in the tree!), it can have between
0 and (n–1) values.
Database System Concepts - 6th Edition
11.38
©Silberschatz, Korth and Sudarshan
Leaf Nodes in B+-Trees
Properties of a leaf node:
For i = 1, 2, . . ., n–1, pointer Pi points to a file record with
search-key value Ki,
If Li, Lj are leaf nodes and i < j, Li’s search-key values are less
than or equal to Lj’s search-key values
Pn points to next leaf node in search-key order
Database System Concepts - 6th Edition
11.39
©Silberschatz, Korth and Sudarshan
The previous slides contain all the
formal requirements that a B+ tree
must fulfill
To do for next time/review:
List on one page all the requirements.
Include the page in your cheat-sheet.
Database System Concepts - 6th Edition
11.40
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of the
tree; in which leaf node is Ki to be found?
Q: Is it certain that Ki will be found in a leaf?
Database System Concepts - 6th Edition
11.41
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of the
tree; in which leaf node is Ki to be found?
Q: Is it certain that Ki will be found in a leaf?
A: Yes, b/c the non-leaf nodes are an index to the
leaves!
Database System Concepts - 6th Edition
11.42
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of
the tree; in which leaf node is Ki to be found?
Remember the ordering property:
Two cases are possible:
If Ki is just above the leaf level, then the leaf
cannot start with a smaller key!
Database System Concepts - 6th Edition
11.43
©Silberschatz, Korth and Sudarshan
Searching in B+-Trees
Problem: Let’s consider a key Ki in an internal node of
the tree; in which leaf node is Ki to be found?
If Ki is not just above the leaf level, the first key in
the right subtree must be > Ki (why?), so from this
point on we can take only left pointers to reach Ki .
Conclusion: In either case, Ki is
the minimal key in the right
subtree!
Image
http://www.cburch.com/cs/340/reading/btree/index.html
Databasesource:
System Concepts
- 6th Edition
11.44
©Silberschatz, Korth and Sudarshan
B+ tree example
Find:
The value of n
The fill limit for the leaf nodes (# of keys)
The fill limit for internal nodes (# of
pointers/children)
The fill limit for the root
Is this a valid B+ tree?
Database System Concepts - 6th Edition
11.45
©Silberschatz, Korth and Sudarshan
B+ tree example
n = 6 (max # of pointers/children)
Leaf nodes must have between at least 3 keys, acc. to
fmla. (n–1)/2.
Internal nodes must have between at least 3 children,
acc. to fmla. (n/2.
Root must have at least 2 children, since it’s not the only
node in the tree.
Database System Concepts - 6th Edition
11.46
©Silberschatz, Korth and Sudarshan
Another B+ tree example
Find:
The value of n
The fill limit for the leaf nodes (# of keys)
The fill limit for internal nodes (# of
pointers/children)
The fill limit for the root
Is this a valid B+ tree?
Database System Concepts - 6th Edition
11.47
©Silberschatz, Korth and Sudarshan
Design problem
We want each node of a B+ tree to fit as closely as possible in a
block on the disk.
The block size is 4 KB, and the total capacity of the disk is 2 TB.
The keys are integers, represented on 4 Bytes.
The pointers are serial numbers unique to each disk block.
Hint: How many such serial numbers are there?
Find the optimal value of n for this system.
Database System Concepts - 6th Edition
11.48
©Silberschatz, Korth and Sudarshan
Remarks on B+-trees
Since the inter-node connections are done by pointers, logically close
blocks need not be physically close.
The non-leaf levels of the B+-tree form a hierarchy of sparse indices.
The B+-tree contains a relatively small number of levels
Level
Next
..
below root has at least 2* n/2 values
level has at least 2* n/2 * n/2 values
etc.
If there are K search-key values in the file, the tree height is no
more than logn/2(K)
Thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled efficiently, as
the index can be restructured in logarithmic time (next slides).
Database System Concepts - 6th Edition
11.49
©Silberschatz, Korth and Sudarshan
Queries on B+-Trees
To find record with search-key value V:
1. C=root
2. While C is not a leaf node
{
1. Let Ki be the smallest key in C such that V Ki.
2. If no such exists
set C = last non-null pointer in C
3. Else if (V = Ki ) set C = Pi +1
4. Else
set C = Pi
}
//end while
3. Let Ki be the smallest key in C such that Ki = V
1. If there is such a value i, follow pointer Pi to desired tuple on disk
2. Else
no record with search-key value V exists.
Database System Concepts - 6th Edition
11.50
©Silberschatz, Korth and Sudarshan
QUIZ: Search
for key 55 in
this B+ tree
Database System Concepts - 6th Edition
11.51
©Silberschatz, Korth and Sudarshan
QUIZ: Search
for key 85 in
this B+ tree
Database System Concepts - 6th Edition
11.52
©Silberschatz, Korth and Sudarshan
SKIP Handling Duplicates
Database System Concepts - 6th Edition
11.53
©Silberschatz, Korth and Sudarshan
Efficiency of Queries on B+-Trees
If there are K search-key values in the file, the height of the tree
is no more than logn/2(K).
A node is generally the same size as a disk block, typically 4 KB
and n is typically around 100 (40 bytes per index entry).
With 1 million search key values and n = 100
at most log50(1,000,000) = 4 nodes are accessed in a lookup.
Contrast this with a balanced binary tree with 1 million search key
values — around 20 nodes are accessed in a lookup
above difference is significant since every node access may
need a disk I/O, costing ~ 10 milliseconds
Database System Concepts - 6th Edition
11.54
©Silberschatz, Korth and Sudarshan
Your turn!
If there are K search-key values in the file,
h ≤ logn/2(K).
A node is generally the same size as a disk block, typically 4 KB
Assume:
10 million search key values
Size of one tuple on disk is 100 Bytes
n = 80
Accessing each block on the HDD requires 10 ms
Find and compare the number of nodes accessed with:
B+ tree index
Binary tree index
Sequential search of a non-indexed file
Database System Concepts - 6th Edition
11.55
©Silberschatz, Korth and Sudarshan
Updates on B+-Trees: Insertion
1. Find the leaf node in which the search-key value would appear
2. If the search-key value is already present in the leaf node
1.
Add record to the file (duplicate!)
2.
If necessary add a pointer to the bucket.
3. If the search-key value is not present, then
1.
add the record to the main file (and create a bucket if
necessary)
2.
If there is room in the leaf node, insert (key-value, pointer)
pair in the leaf node
3.
Otherwise, split the node (along with the new (key-value,
pointer) entry) as discussed in the next slide.
Database System Concepts - 6th Edition
11.56
©Silberschatz, Korth and Sudarshan
Insertion may require splitting node(s)
Splitting a leaf node:
take the n (search-key value, pointer) pairs (including the one
being inserted) in sorted order. Place the first n/2 in the
original node, and the rest in a new node.
let the new node be p, and let k be the least key value in p.
Insert (k,p) in the parent of the node being split.
If the parent is full, split it and propagate the split further up.
Splitting of nodes proceeds upwards till a node that is not full is
found.
In the worst case the root node may be split increasing the
height of the tree by 1.
EXAMPLES
Database System Concepts - 6th Edition
11.57
©Silberschatz, Korth and Sudarshan
B+-Tree insertion example 1
Split node containing Brandt, Califieri and Crick.
Next step: insert entry with (Califieri,pointer-to-new-node) into parent
B+-Tree before and after insertion of “Adams”
Database System Concepts - 6th Edition
11.58
©Silberschatz, Korth and Sudarshan
B+-Tree insertion example 2
B+-Tree before and after insertion of “Lamport”
Database System Concepts - 6th Edition
11.59
©Silberschatz, Korth and Sudarshan
Insertion in B+-Trees (Cont.)
Splitting a non-leaf node: when inserting (k,p) into an already full
internal node N
Copy N to an in-memory area M with space for n+1 pointers and n
keys
Insert (k,p) into M
Copy P1,K1, …, K n/2-1,P n/2 from M back into node N
Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node
N’
Insert (K n/2,N’) into parent N
Read full pseudocode: p.494 of text!
Califieri
Adams Brandt Califieri Crick
Database System Concepts - 6th Edition
Adams Brandt
11.60
Crick
©Silberschatz, Korth and Sudarshan
QUIZ: Insert keys 21, 22, 23 in this B+ tree
Database System Concepts - 6th Edition
11.61
©Silberschatz, Korth and Sudarshan
SKIP Deletion
Database System Concepts - 6th Edition
11.62
©Silberschatz, Korth and Sudarshan
Ways to handle non-unique search keys
Buckets on separate block (bad idea)
List of tuple pointers with each key
Extra code to handle long lists
Deletion of a tuple can be expensive if there are many
duplicates on search key (why?)
Low space overhead, no extra cost for queries
Make search key unique by adding a unique record-identifier,
a.k.a. an uniquifier attribute
Extra storage overhead for keys
Simpler code for insertion/deletion
Widely used
Database System Concepts - 6th Edition
11.63
©Silberschatz, Korth and Sudarshan
Ways to handle non-unique search keys
Buckets on separate block (bad idea)
List of tuple pointers with each key
Extra code to handle long lists
Deletion of a tuple can be expensive if there are many
duplicates on search key (why?)
Low space overhead, no extra cost for queries
Make search key unique by adding a unique record-identifier,
a.k.a. an uniquifier attribute
Extra storage overhead for keys
Simpler code for insertion/deletion
What support does
PostgreSQL offer for
uniquifiers?
Widely used
Database System Concepts - 6th Edition
11.64
©Silberschatz, Korth and Sudarshan
See PostgreSQL lab: SERIAL data type!
CREATE SEQUENCE test_seq
nextval('test_seq');
-- Increments the value of the
specified sequence and returns the new value (as an integer)
currval('test_seq');
-- Returns the value returned by
the most recent nextval(). If nextval() hasn't been used yet, an
error is returned (see 2nd cmd. below).
setval('test_seq', n);
-- Sets the current value of the
sequence to n.
DROP SEQUENCE test_seq
Database System Concepts - 6th Edition
11.65
©Silberschatz, Korth and Sudarshan
Conclusion on B+-Tree Index Files
Advantage of B+-tree index files:
automatically reorganizes itself with small, local,
changes, in the face of insertions and deletions.
Reorganization of entire file is not required to
maintain performance.
(Minor) disadvantage of B+-trees:
extra insertion and deletion overhead, space
overhead.
Advantages of B+-trees outweigh disadvantages
B+-trees are used extensively
Database System Concepts - 6th Edition
11.66
©Silberschatz, Korth and Sudarshan
11.4 B+-Tree File Organization
Idea: Use the B+-Tree structure not only for organizing the
index, but for organizing the file itself
The leaf nodes in a B+-tree file organization store records,
instead of pointers
Database System Concepts - 6th Edition
11.67
©Silberschatz, Korth and Sudarshan
11.4 B+-Tree File Organization
Leaf nodes are still required to be half full
Since records are larger than pointers, the maximum number of
records that can be stored in a leaf node is less than the number
of pointers in a nonleaf node.
Insertion and deletion are handled in the same way as in a B+-tree
index.
Database System Concepts - 6th Edition
11.68
©Silberschatz, Korth and Sudarshan
B+-Tree File Organization
Good space utilization is important since records use more space
than pointers.
To improve space utilization, involve sibling nodes in redistribution
during splits (for insertion) (and merges, for deletion)
If node full, first attempt to redistribute to adjacent siblings
If both adjacent siblings are full, split node and redistribute to the
two new nodes and one of the adjacent siblings!
Result: in each node having at least
Database System Concepts - 6th Edition
11.69
2n / 3
entries
©Silberschatz, Korth and Sudarshan
SKIP pp.502-509 in text
EoL3
Database System Concepts - 6th Edition
11.70
©Silberschatz, Korth and Sudarshan
QUIZ B+ trees
Is this a valid B+ tree?
Insert 42.
Database System Concepts - 6th Edition
11.71
©Silberschatz, Korth and Sudarshan
11.6 Hashing
Basic ideas:
Tree indices are efficient (height ≤ logn/2(N)) and
mathematically well-understood, but …
When querying for individual tuples we have too much
precision: although we need only one tuple, reading from
the HDD is still done in entire blocks (~4KB), each of
which usually stores multiple tuples.
The parable of the two storage cabinets …
Database System Concepts - 6th Edition
11.72
©Silberschatz, Korth and Sudarshan
One little drawer for each item
Image source:
www.quantum-bits.com
Database
System Concepts
- 6th Edition
Larger drawers, with multiple items in each
11.73
©Silberschatz, Korth and Sudarshan
Static Hashing
Bucket = unit of storage containing one or more
records (typically a disk block).
Hash file organization: we obtain the bucket of a record
directly from its search-key value using a hash
function.
Hash function h is a function from the set of all search-
key values K to the set of all bucket addresses B.
Hash function is used to locate records for query,
insertion and deletion.
Records w/different search-key values will be mapped
to same bucket → entire bucket is searched
sequentially to locate a record.
Database System Concepts - 6th Edition
11.74
©Silberschatz, Korth and Sudarshan
Example of Hash File Organization
File stores instructor table, using dept_name as key
(See figure in next slide.)
There are 8 buckets
The binary representation of the ith character is assumed to be
the integer i.
The hash function returns the sum of the binary representations
of the characters modulo 8
E.g. h(Music) = 1
h(Physics) = 3
Database System Concepts - 6th Edition
h(History) = 2
h(Elec. Eng.) = 3
11.75
©Silberschatz, Korth and Sudarshan
Database System Concepts - 6th Edition
11.76
©Silberschatz, Korth and Sudarshan
EXAMPLE: Hash functions in Python
The hash function returns the sum of the binary representations of
the characters modulo 8
E.g. h(Music) = 1
h(Physics) = 3
h(History) = 2
h(Elec. Eng.) = 3
Write in Python the function h(string) described
Hints:
The Python library function ord() returns the ASCII
code of its character argument
The modulus is implemented as the operator %
Database System Concepts - 6th Edition
11.77
©Silberschatz, Korth and Sudarshan
Database System Concepts - 6th Edition
11.78
©Silberschatz, Korth and Sudarshan
Hash Functions
Worst case: hash function maps all search-key values to same
bucket; this makes access time proportional to the number of
search-key values in the file.
Ideal hash function is uniform → each bucket is assigned the
same average number of search-key values from the set of all
possible values.
Ideal hash function is random, so each bucket will have the
same number of records assigned to it irrespective of the actual
distribution of search-key values in the file.
To achieve (pseudo)randomness, typical hash functions
perform computations on the internal binary representation
of the search-key, as in the prev. example.
Database System Concepts - 6th Edition
11.79
©Silberschatz, Korth and Sudarshan
What to do when a bucket becomes full?
Database System Concepts - 6th Edition
11.80
©Silberschatz, Korth and Sudarshan
Handling of Bucket Overflows
Bucket overflow can occur because of
Insufficient buckets
Skew in distribution of records → two reasons:
multiple records have same search-key value
chosen hash function produces non-uniform
distribution of key values
Although the probability of bucket overflow can be reduced,
it cannot be eliminated; it is handled by using overflow
buckets.
Database System Concepts - 6th Edition
11.81
©Silberschatz, Korth and Sudarshan
Overflow chaining – the overflow buckets of a given
bucket are chained together in a linked list.
This is called closed hashing.
An alternative exists, called open hashing, which does not
use overflow buckets. Not suitable for DB apps.!
E.g. linear
probing
Database System Concepts - 6th Edition
11.82
©Silberschatz, Korth and Sudarshan
Hash Indices
Hashing can be used not only for file organization, but
also for index-structure creation.
A hash index organizes the search keys, with their
associated record pointers, into a hash file structure.
Strictly speaking, hash indices are always secondary
indices
if the file itself is organized using hashing, a separate
primary hash index on it using the same search-key
is unnecessary.
However, we use the term hash index to refer to
both secondary index structures and hash organized
files.
Database System Concepts - 6th Edition
11.83
©Silberschatz, Korth and Sudarshan
Example of Hash Index
Hash index on instructor.ID
Database System Concepts - 6th Edition
11.84
©Silberschatz, Korth and Sudarshan
Deficiencies of Static Hashing
Function h maps search-key values to a fixed set of B of bucket
addresses. Databases grow or shrink with time.
If initial number of buckets is too small, and file grows,
performance will degrade due to too much overflows.
If space is allocated for anticipated growth, a significant amount of
space will be wasted initially (and buckets will be underfull).
If database shrinks, again space will be wasted.
One solution: periodic re-organization of the file with a new hash
function
Expensive, disrupts normal operations
Better solution: allow the number of buckets to be modified
dynamically.
Database System Concepts - 6th Edition
11.85
©Silberschatz, Korth and Sudarshan
11.7 Dynamic Hashing
Good for DBs that grow and shrink in size
Allows the hash function to be modified dynamically
Extendable hashing – one form of dynamic hashing
Hash function generates values over a large range — typically b-bit
integers, with b = 32.
At any time use only a prefix of the hash function to index into a table
of bucket addresses.
Let the length of the prefix be i bits, 0 i 32.
Bucket address table size = 2i. Initially i = 0
Value of i grows and shrinks as the table grows and shrinks.
Multiple entries in the bucket address table may point to the same
bucket (why?)
Thus, actual # of buckets is < 2i
The # of buckets also changes dynamically due to merging and
splitting of buckets.
Database System Concepts - 6th Edition
11.86
©Silberschatz, Korth and Sudarshan
Extendable Hash Structure
In this structure, i2 = i3 = i, whereas i1 = i – 1 (see next
slide for details)
Database System Concepts - 6th Edition
11.87
©Silberschatz, Korth and Sudarshan
Use of Extendable Hash Structure
Each bucket j stores a value ij
All the entries that point to the same bucket have the same values
on the first ij bits.
To locate the bucket containing search-key Kj:
1. Compute h(Kj) = X
2. Use the first i high order bits of X as a displacement into bucket
address table, and follow the pointer to appropriate bucket
To insert a record with search-key value Kj
follow same procedure as look-up and locate the bucket, say j.
If there is room in the bucket j insert record in the bucket.
Else the bucket must be split and insertion re-attempted (next
slide.)
Overflow
Database System Concepts - 6th Edition
buckets used instead in some cases
11.88
©Silberschatz, Korth and Sudarshan
Insertion in Extendable Hash Structure
To split a bucket j when inserting record with search-key value Kj:
If i > ij (more than one pointer to bucket j)
allocate a new bucket z, and set ij = iz = (ij + 1)
Update the second half of the bucket address table entries originally
pointing to j, to point to z
remove each record in bucket j and reinsert (in j or z)
recompute new bucket for Kj and insert record in the bucket (further
splitting is required if the bucket is still full)
If i = ij (only one pointer to bucket j)
If i reaches some limit b, or too many splits have happened in this
insertion, create an overflow bucket
Else
increment i and double the size of the bucket address table.
replace each entry in the table by two entries that point to the
same bucket.
recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
Database System Concepts - 6th Edition
11.89
©Silberschatz, Korth and Sudarshan
Use of Extendable Hash Structure: Example
Database System Concepts - 6th Edition
11.90
©Silberschatz, Korth and Sudarshan
Example (Cont.)
Initial Hash structure; bucket size = 2
Database System Concepts - 6th Edition
11.91
©Silberschatz, Korth and Sudarshan
Example (Cont.)
Hash structure after insertion of “Mozart”, “Srinivasan”,
and “Wu” records
Database System Concepts - 6th Edition
11.92
©Silberschatz, Korth and Sudarshan
Example (Cont.)
Hash structure after insertion of Einstein record
Database System Concepts - 6th Edition
11.93
©Silberschatz, Korth and Sudarshan
Example (Cont.)
Hash structure after insertion of Gold and El Said records
Database System Concepts - 6th Edition
11.94
©Silberschatz, Korth and Sudarshan
Your turn!
Insert (Katz, Comp.Sci.) record.
Database System Concepts - 6th Edition
11.95
©Silberschatz, Korth and Sudarshan
Example (Cont.)
Hash structure after insertion of Katz record
Database System Concepts - 6th Edition
11.96
©Silberschatz, Korth and Sudarshan
Example (Cont.)
And after insertion of
eleven records
Database System Concepts - 6th Edition
11.97
©Silberschatz, Korth and Sudarshan
Example (Cont.)
And after insertion of
Kim record in previous
hash structure
Database System Concepts - 6th Edition
11.98
©Silberschatz, Korth and Sudarshan
SKIP: Deletion in Extendable Hash
Structure
Database System Concepts - 6th Edition
11.99
©Silberschatz, Korth and Sudarshan
Hashing pros and cons
Benefits of extendable hashing:
Hash performance does not degrade with growth of file
Minimal space overhead
Disadvantages of extendable hashing
Extra level of indirection to find desired record
Bucket address table may itself become very big (larger than
memory)
Cannot allocate very large contiguous areas on disk either
Solution: B+-tree structure to locate desired record in bucket
address table
Changing size of bucket address table is an expensive operation
Linear hashing is an alternative mechanism
Allows incremental growth of its directory (equivalent to bucket
address table) …
… at the cost of more bucket overflows.
Database System Concepts - 6th Edition
11.100
©Silberschatz, Korth and Sudarshan
SKIP
11.8, 11.9
Database System Concepts - 6th Edition
11.101
©Silberschatz, Korth and Sudarshan
11.10 Index definition in SQL
Note: In keeping with the declarative nature of SQL, there is no
standard way provided to specify what type of index, or any other
index details .
Database System Concepts - 6th Edition
11.102
©Silberschatz, Korth and Sudarshan
To do for next time:
Review problems for Ch.11:
3, 4, 5, 6
15, 16, 17, 19, 22
Database System Concepts - 6th Edition
11.103
©Silberschatz, Korth and Sudarshan
Chapter 12 review:
Indexing and Hashing
Basic Concepts
Ordered Indices
B+-Tree Indices
B-Tree Index Files
Static Hashing
Dynamic Hashing
Comparison of Ordered Indexing and Hashing
Database System Concepts - 6th Edition
11.104
©Silberschatz, Korth and Sudarshan
B+-Tree properties
Paths from the root to any leaf have the same length (tree is
perfectly balanced!)
Keys Ki are sorted (assume no duplicates!)
Each inner node (not root or leaf) has between n/2 and n children
Pi points to subtree with key values x s.t. Ki-1 ≤ x < Ki
Leaf node has between (n–1)/2 and n–1 keys
Pi points to tuple with key Ki
Pn points to next leaf
The root normally has to have at least 2 children.
If the root is a leaf (there are no other nodes in the tree), it can
have between 0 and n–1 keys.
Database System Concepts - 6th Edition
11.105
©Silberschatz, Korth and Sudarshan
QUIZ: Verify all the conditions for this B+ tree
n=
Database System Concepts - 6th Edition
11.106
©Silberschatz, Korth and Sudarshan
QUIZ: Verify all the conditions for this B+ tree
n=
Database System Concepts - 6th Edition
11.107
©Silberschatz, Korth and Sudarshan
QUIZ: Are these valid B+ trees?
Database System Concepts - 6th Edition
11.108
©Silberschatz, Korth and Sudarshan
QUIZ: Insert the keys 42, 22, 52, 1, 2, 3 in a
B+ tree with n = 4, that is initially empty
Database System Concepts - 6th Edition
11.109
©Silberschatz, Korth and Sudarshan
QUIZ: Insert keys 21, 55, 22, 23 in this
B+ tree file
Database System Concepts - 6th Edition
11.110
©Silberschatz, Korth and Sudarshan
© Copyright 2025