Chap. 6 Organizing Files for Performance

Chap. 6 Organizing
Files for Performance
Kim Joung-Joon
Database Lab.
[email protected]
Chapter Outline
6.1 Data Compression
6.2 Reclaiming Space in Files
6.3 Finding Things Quickly : An
Introduction to Internal Sorting and
Binary Searching
6.4 Keysorting
File Structures (6)
Konkuk University (DB Lab.)
2
6.1 Data Compression
“
Data compression
 encoding the information in a file in such a way
that it takes up less space
 Some are designed for specific kinds of data,
such as speech, picture, text …
“
Reasons for compression
→ make smaller files
 use less storage
 can be transmitted faster, decreasing access
time
 can be processed faster sequentially
File Structures (6)
Konkuk University (DB Lab.)
3
6.1.1 Using a Different Notation
“
Compact notation
 decrease the number of bits by finding a more
compact notation
⇒
“
binary 6 bits (26 = 64) for 50 states (1 byte)
Cost
 the file is unreadable by human (binary encoding)
 cost for encoding and decoding (compression
algorithms are very simple)
 complexity of the software is increased (since
encoding/decoding module is included
File Structures (6)
Konkuk University (DB Lab.)
4
6.1.2 Suppressing Repeating Sequences
“
Run-length encoding
(1) choose one special, unused byte value as a runlength code
(2) run-length encoding algorithm
(i) substitute the repeated pixels with the
following three bytes
① special run-length code indicator (e.g. 0xff)
② repeated pixel value
③ number of repeated times


File Structures (6)
(e.g.) 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26
25 24
=> 22 23 ff 24 07 25 ff 26 06 25 24
not guarantee the amount of space savings
Konkuk University (DB Lab.)
5
6.1.3 Assigning Variable-length Codes (1/3)
“
Variable-length codes (Morse code)
 based on the principle that some values occur
more frequently than others
 the more frequently occurring letters getting fewer
symbols
 no delimiters
 implemented using a lookup table to encode or
decode the data
File Structures (6)
Konkuk University (DB Lab.)
6
6.1.3 Assigning Variable-length Codes (2/3)
“
Lookup table

used to encode and decode the data
1.
predictable frequency distribution

2.
“
never change
unpredictable frequency distribution
 use Huffman code -> binary tree -> table
Huffman encoding
Letter:
Prob.:
Code:
a
0.4
1
b
0.1
010
c
0.1
011
d
0.1
0000
e
0.1
0001
f
0.1
0010
g
0.1
0011
ex) the string “abde”
101000000001
File Structures (6)
Konkuk University (DB Lab.)
7
6.1.3 Assigning Variable-length Codes (3/3)
“
Huffman code
 determine the probabilities of each value
occurring in the date set and then build a
binary tree
 more frequently occurring values are given
shorter search paths
0
01
00
000
d(0000)
File Structures (6)
e(0001)
a(1)
001
f(0010)
Konkuk University (DB Lab.)
b(010)
c(011)
g(0011)
8
6.1.4 Irreversible Compression Techniques
“
Irreversible compression
 based on assumption that some information can
be sacrificed
 no way to return to the original data
 (e.g.)
① shrinking a raster image from 400x400 pixels to
100x100
pixels (16 pixels --> 1 pixel)
② voice coding with varying amounts of distortion
File Structures (6)
Konkuk University (DB Lab.)
9
6.2 Reclaiming Space in Files
“
Modification of a variable-length record
( new record is longer than original record )
1. append the extra data to the end of the file and
put a pointer from the original record space to
the extension
=> slower
2. rewrite the whole record at the end of the file (if
not sorted), leaving a hole at the original location
=> wasted space
File Structures (6)
Konkuk University (DB Lab.)
10
6.2 Reclaiming Space in Files
“
Three forms of modification
1. record addition
2. record updating : deletion -> addition
3. record deletion
File Structures (6)
Konkuk University (DB Lab.)
11
6.2.1 Record Deletion and Storage
Compaction (1/2)
“
Approach to record deletion

place a special mark in a special field of each deleted
record
(e.g.) asterisk in the first field : Fig. 6.3(a), Fig. 6.3(b)
Ames|Mary|123 Maple|Stillwater|OK|74075|………………………
(a)
Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|74820|
Brown|Martha|625 Kimbark|Des Moines|IA|50311|………………..
Ames|Mary|123 Maple|Stillwater|OK|74075|…………………….
(b)
*|rrison|Sebastian|9035 South Hillcrest|Forest Village|OK|74820|
Brown|Martha|625 Kimbark|Des Moines|IA|50311|………………


a program ignores the marked record as deleted
adv.


Undelete a record with very little effort
disadv.

Don’t reuse the space for a while (rely on storage compaction)
File Structures (6)
Konkuk University (DB Lab.)
12
6.2.1 Record Deletion and Storage
Compaction (2/2)
“
Storage compaction
 make files smaller by looking for unused places in
a file and then recovering this space (how
often ?)
 a special program reconstructs a file with all the
deleted records squeezed out : Fig. 6.3(c)
“
Compaction methods
1. through a file copy program (out place)
2. through more complicated and time-consuming
compacting algorithm (in place)
(c)
File Structures (6)
Ames|Mary|123 Maple|Stillwater|OK|74075|………………………
Brown|Martha|625 Kimbark|Des Moines|IA|50311|…………………
Konkuk University (DB Lab.)
13
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically(1/6)
“
Dynamic storage reclamation
 reuse the space from deleted records as soon as
possible
1. look through the file, record by record, until a
deleted record is found (if not found, append at
the end)
=> slow
2. know immediately if there are empty slots in the
file and jump directly to one of those slots
=> more quickly
File Structures (6)
Konkuk University (DB Lab.)
14
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically(2/6)
“
Linked list : Fig. 6.4 (avail list)
 for all of the deleted records
(≡ available space within the file)
 handle a list as a stack
 pointing is done through RRNs
(<- fixed-length record)
 if avail list is empty, added at the end of
the file
File Structures (6)
Konkuk University (DB Lab.)
15
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (3/6)
The Linked List
Head
pointer
ptr
ptr
ptr
-1
ptr
The Stack
(a) pointer(5)
Head
RRN
2
5
RRN
2
Head
RRN
5
3
RRN
2
5
(b) pointer (3)
File Structures (6)
Konkuk University (DB Lab.)
-1
RRN
2
-1
16
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (4/6)
“
Where to keep the avail list ?
1. maintained in a separate file : no
2. embedded within the data file : yes
File Structures (6)
Konkuk University (DB Lab.)
17
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (5/6)
List head(first available record)
0
(a)
Edwards...
1
Betas...
2
Wills...
3
5 (delete 3, 5 )
4
*-1
Masters..
List head(first available record)
0
(b)
Edwards...
1
*5
2
Wills...
3
*-1
(c)
1
2
3
*3
6
Chavez...
1 (delete 1)
4
5
Masters..
List head(first available record)
0
5
*3
6
Chavez...
-1 (insert three new record)
4
5
6
Edwards..1st new rec Wills... 3rd new rec Masters..2nd new rec Chavez...
File Structures (6)
Konkuk University (DB Lab.)
18
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (6/6)
“
Implementation
1. place deleted records on a linked avail list
2. treat the avail list as a stack
3. keep the RRN of the first available record
in a header record
File Structures (6)
Konkuk University (DB Lab.)
19
6.2.3 Deleting Variable-length Records (1/3)
“
File structure for variable-length
records
 length(byte count) field at the beginning
of each record
“
Avail list for variable-length records
 place a special asterisk in the first field,
followed by a binary link field pointing to
the next deleted record
 use "byte offset" (not RRN) for links
File Structures (6)
Konkuk University (DB Lab.)
20
6.2.3 Deleting Variable-length Records (2/3)
“
Sample file
1. before deletion : Fig. 6.6(a)
2. after deletion of the second record :
Fig. 6.6(b)
(a)
HEAD.FIRST_AVAIL : -1
40 Ames|Mary|123 Maple|Stillwater|OK|74075|64 Morrison|Sebastian|9035 South
Hillcrest|Forest Village|OK|74820|45 Brown|Marta|625 Kimbark|Des Moines|IA|50311|
(b)
HEAD.FIRST_AVAIL : 43
40 Ames|Mary|123 Maple|Stillwater|OK|74075|64 *|-1……………………..…………
………………………………………45 Bown|Marta|625 Kimbark|Des Moines|IA|50311|
File Structures (6)
Konkuk University (DB Lab.)
21
6.2.3 Deleting Variable-length Records (3/3)
“
Access to the avail list
 search through the avail list for a record slot that
is the right size or big enough (not as a stack)
“
Example : size 55 ?
(a) Before removal
Size
47
Size
38
Size
72
Size
68
-1
(b) After removal
Size
47
Size
38
New link
Size
68
-1
Removed record
File Structures (6)
Konkuk University (DB Lab.)
Size
72
-1
22
6.2.4 Storage Fragmentation (1/5)
“
Internal fragmentation :
 wasted space within a record
Unused space -->
Internal fragmentation
Ames | Jonh | 123 Maple | Stillwater | OK | 740751 |...................................
Morrison | Sebastian | 9035 South Hillcrest | Forest Village | OK | 74820 |
Brown | Martha | 625 Kimbark | Des Moines | IA | 50311 | .........................
64-byte fixed-length records
File Structures (6)
Konkuk University (DB Lab.)
23
6.2.4 Storage Fragmentation (2/5)
File with variable-length records
“
 to eliminate the wasted space due to
internal fragmentation
record
length
Record[1]
Record[2]
40 Ames | Jone | 123 Maple | Stillwater | OK | 740751 | 64 Morrison | Sebastian |
Record[3]
9035 South Hillcrest | Forest Village | OK | 74820 | 45 Brown | Martha | 625 Kimb
bark | Des Moines | IA | 50311 |
Internal or External
fragmentation
ex) Delete Record[2] and Insert New Record[i] : 37-byte unused space
Record[i]
27 Ham | Al | 28 Elm | Ada | OK | 70322 |
File Structures (6)
Konkuk University (DB Lab.)
24
6.2.4 Storage Fragmentation (3/5)
“
Record : deleted and replaced with a
shorter record
1. Use the entire original record slot : Fig.
6.10

internal fragmentation within a variable-length
record
2. Break the original record slot into two
parts : Fig. 6.11


one part for the new record and the other for
another records (=> avail list)
no internal fragmentation --> external
fragmentation
File Structures (6)
Konkuk University (DB Lab.)
25
6.2.4 Storage Fragmentation (4/5)
“
External Fragmentation : Fig 6.12
 unused space outside or between
individual records
 space that is actually on the avail list, but
is too fragmented to be reused
File Structures (6)
Konkuk University (DB Lab.)
26
6.2.4 Storage Fragmentation (5/5)
“
Ways to combat external fragmentation
1. Storage compaction

regenerate a file when external fragmentation becomes
intolerable
2. Coalescing holes


combine physically adjacent two record slots on the avail
list to make a single, larger record slot
good when the avail list is kept in physical record order
3. Placement strategy

use it to select a record slot from the avail list to minimize
fragmentation
File Structures (6)
Konkuk University (DB Lab.)
27
6.2.5 Placement Strategies (1/2)
“
Placement strategies
1. First-fit placement



treat the avail list as a stack
look through the avail list to find the first record slot that
is big enough
not best fit (<- 10 times bigger or perfect fit)
2. Best-fit placement



avail list : in ascending order by size
the first record encountered is the smallest record slot
① search through at least a part of the avail list
② result in external fragmentation
③ make best-fit searches longer as time goes on
File Structures (6)
Konkuk University (DB Lab.)
28
6.2.5 Placement Strategies (2/2)
3. Worst-fit placement
 avail list : in descending order by size
 the first record encountered is the largest
record slot
 ① if the first record slot is not large enough,
none of the others will be
② unused portion of the slot is as large as
possible
=> decrease the likelihood of external
fragmentation
“
Conclusion
 no one placement strategy is superior for
all circumstances
File Structures (6)
Konkuk University (DB Lab.)
29
6.3 Finding Things Quickly :
6.3.1 Finding Things in a Simple Field & Record Files
“
Direct access method
1. by Relative Record Number for fixed-length
records (--> record's byte offset -->
jump to it using direct access)
2. by record's byte offset for variable-length
records
3. by record's key value


keyed access --> sequential search ?
no record or several records
=> look through the entire file
File Structures (6)
Konkuk University (DB Lab.)
30
6.3.2 Search by Guessing :Binary Search
“
Binary search : Fig. 6.13, Fig. 6.14
 for a file which is sorted in ascending
order by key
Int BinarySearch(FixedRecordFile &file, RecType &obj, KeyType & key)
{ //이진 탐색:키가 발견되면, obj는 해당하는 레코드를 포함하고 1을 반환
int low = 0; int high = file.NumRecs() - 1;
while(low<=high)
{
int guess = (high - low)/2;
file.ReadByRRN(obj,guess);
if(obj.Key() == key) return 1; //레코드를 찾은 경우
if(obj.Key() < key) high = guess - 1; // guess 앞부분을 검색
else low = guess + 1; // guess 뒷부분을 검색
rerturn 0; // 키를 발견하지 못하고 루프를 끝나는 경우
File Structures (6)
Konkuk University (DB Lab.)
}
}
31
6.3.2 Search by Guessing :Binary Search
“
Example : a file of 1,000 fixed-length
records
(1) sequential search

at most 1,000 comparisons
(on average 500 comparisons)
(2) binary search

at most 10 comparisons
File Structures (6)
Konkuk University (DB Lab.)
32
6.3.3 Binary Search Versus Sequential Search
“
Binary search for n records


“
comparisons
comparisons
O(log2 n)
Sequential search for n records


“
at most
log2 n + 1
on average ( log2 n +1)/2
at most
on average
n
n/2
comparisons
comparisons
O(n)
Example : a file of 2,000 records
1. binary search

at most
log2 2,000 + 1 = 11 comparisons
(=> sorting is needed !!!)
2. sequential search

at most
n
= 2,000 comparisons
File Structures (6)
Konkuk University (DB Lab.)
33
6.3.4 Sorting a Disk File in Memory
“
Sorting with very large memory
(i) read the entire file from the disk into memory
sequentially
(ii) do the internal sorting in memory
unsorted
file
Read the entire file
(sequential access)
unsorted
file
sorted
file
Sort in memory
Disk
File Structures (6)
Memory
Konkuk University (DB Lab.)
34
6.3.5 The Limitations of Binary Searching
and Internal Sorting (1/3)
“
Problems of "sort, then binary search"
1. Binary searching requires more than one or
two accesses
 average case : ( log2 n + 1)/2 comparisons
(accesses)


1,000 items ... 9.5 accesses
100,000 items ... 16.5 accesses
① access by RRN (performance) : with a single access
② access by key (usefulness)
File Structures (6)
: use of index structures
Konkuk University (DB Lab.)
35
6.3.5 The Limitations of Binary Searching
and Internal Sorting (2/3)
2. Keeping a file sorted is very expensive
① insertion
(i) read through half the records
(ii) shift the records to open up the space
② searching
binary search
=> the benefits of faster retrieval can more than offset the costs of
keeping the file sorted
“
better solutions for the problem "access by key“
not involve reordering of the records in the file when a new
record is added => use of indexes or hashing
2. associate with data structures that allow for substantially
more rapid, efficient reordering of the file => use of tree
structures, such as a B-tree
1.
File Structures (6)
Konkuk University (DB Lab.)
36
6.3.5 The Limitations of Binary Searching
and Internal Sorting (3/3)
3. An internal sort works only on small
files
small file (in memory) ... internal sort
large file
... external sort => “keysort”
File Structures (6)
Konkuk University (DB Lab.)
37
6.4 Keysorting
“
Keysort (tag sort)
(i) read the keys from the file into memory
(ii) sort them in memory
(iii) rearrange the records in the file
according to the new ordering of the keys
 can sort larger files entirely in memory
File Structures (6)
Konkuk University (DB Lab.)
38
6.4.1 Description of the Method
Conceptual
view
before
sorting
Conceptual
view
after sorting
keys in
memory
KEYNODES array
KEY
RRN
HARRISON
1
KELLOG
HARRIS
2
3
Records
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
Harris|Margaret|4343 West....
.
.
.
.
BELL
k
Bell|Robert|8912 Hill....
BELL
HARRIS
k
3
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
1
Harris|Margaret|4343 West....
2
Bell|Robert|8912 Hill....
HARRISON
.
.
.
.
KELLOG
File Structures (6)
In memory
Konkuk University (DB Lab.)
39
On secondary store
6.4.2 Limitations of the Keysort Method
“
Limitations
1. read in the records a second time
not sequentially, but randomly by RRN
(--> random seeks)
2. write out the new sorted file
 sequentially (--> seeks)

=> disk drive must move the head back and
forth between two files as it reads and
writes
File Structures (6)
Konkuk University (DB Lab.)
40
6.4.3 Another Solution (1/2)
“
Solution
 do not write the records back in sorted
order
 write out the contents of the
KEYNODES[] array as an index file
“
Looking for a record
1. do binary search on the index file (=>
RRN)
2. use the RRN to find the corresponding
record
File Structures (6)
Konkuk University (DB Lab.)
41
6.4.3 Another Solution (2/2)
KEY
Records
RRN
BELL
HARRIS
HARRISON
k
3
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
1
Harris|Margaret|4343 West....
2
Bell|Robert|8912 Hill....
.
.
.
.
KELLOG
Index file
File Structures (6)
Original file
Konkuk University (DB Lab.)
42