Download Report

Chap. 6 Organizing
Files for Performance
Kim Joung-Joon
Database Lab.
[email protected]
Chapter Outline
6.1 Data Compression
6.2 Reclaiming Space in Files
6.3 Finding Things Quickly : An
Introduction to Internal Sorting and
Binary Searching
6.4 Keysorting
File Structures (6)
Konkuk University (DB Lab.)
2
6.1 Data Compression

Data compression
encoding the information in a file in such a way
that it takes up less space
Some are designed for specific kinds of data,
such as speech, picture, text …

Reasons for compression
→ make smaller files
use less storage
can be transmitted faster, decreasing access
time
can be processed faster sequentially
File Structures (6)
Konkuk University (DB Lab.)
3
6.1.1 Using a Different Notation

Compact notation
decrease the number of bits by finding a more
compact notation
⇒

binary 6 bits (26 = 64) for 50 states (1 byte)
Cost
the file is unreadable by human (binary encoding)
cost for encoding and decoding (compression
algorithms are very simple)
complexity of the software is increased (since
encoding/decoding module is included
File Structures (6)
Konkuk University (DB Lab.)
4
6.1.2 Suppressing Repeating Sequences

Run-length encoding
(1) choose one special, unused byte value as a runlength code
(2) run-length encoding algorithm
(i) substitute the repeated pixels with the
following three bytes
① special run-length code indicator (e.g. 0xff)
② repeated pixel value
③ number of repeated times

File Structures (6)
(e.g.) 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26
25 24
=> 22 23 ff 24 07 25 ff 26 06 25 24
not guarantee the amount of space savings
Konkuk University (DB Lab.)
5
6.1.3 Assigning Variable-length Codes (1/3)

Variable-length codes (Morse code)
based on the principle that some values occur
more frequently than others
the more frequently occurring letters getting fewer
symbols
no delimiters
implemented using a lookup table to encode or
decode the data
File Structures (6)
Konkuk University (DB Lab.)
6
6.1.3 Assigning Variable-length Codes (2/3)

Lookup table

used to encode and decode the data
1.
predictable frequency distribution

never change
unpredictable frequency distribution
use Huffman code -> binary tree -> table
Huffman encoding
Letter:
Prob.:
Code:
a
0.4
1
b
0.1
010
c
0.1
011
d
0.1
0000
e
0.1
0001
f
0.1
0010
g
0.1
0011
ex) the string “abde”
101000000001
File Structures (6)
Konkuk University (DB Lab.)
7
6.1.3 Assigning Variable-length Codes (3/3)

Huffman code
determine the probabilities of each value
occurring in the date set and then build a
binary tree
more frequently occurring values are given
shorter search paths
0
01
00
000
d(0000)
File Structures (6)
e(0001)
a(1)
001
f(0010)
Konkuk University (DB Lab.)
b(010)
c(011)
g(0011)
8
6.1.4 Irreversible Compression Techniques

Irreversible compression
based on assumption that some information can
be sacrificed
no way to return to the original data
(e.g.)
① shrinking a raster image from 400x400 pixels to
100x100
pixels (16 pixels --> 1 pixel)
② voice coding with varying amounts of distortion
File Structures (6)
Konkuk University (DB Lab.)
9
6.2 Reclaiming Space in Files

Modification of a variable-length record
( new record is longer than original record )
1. append the extra data to the end of the file and
put a pointer from the original record space to
the extension
=> slower
2. rewrite the whole record at the end of the file (if
not sorted), leaving a hole at the original location
=> wasted space
File Structures (6)
Konkuk University (DB Lab.)
10
6.2 Reclaiming Space in Files

Three forms of modification
1. record addition
2. record updating : deletion -> addition
3. record deletion
File Structures (6)
Konkuk University (DB Lab.)
11
6.2.1 Record Deletion and Storage
Compaction (1/2)

Approach to record deletion

a program ignores the marked record as deleted
adv.

Undelete a record with very little effort
disadv.

Don’t reuse the space for a while (rely on storage compaction)
File Structures (6)
Konkuk University (DB Lab.)
12
6.2.1 Record Deletion and Storage
Compaction (2/2)

Storage compaction
make files smaller by looking for unused places in
a file and then recovering this space (how
often ?)
a special program reconstructs a file with all the
deleted records squeezed out : Fig. 6.3(c)

Dynamic storage reclamation
reuse the space from deleted records as soon as
possible
1. look through the file, record by record, until a
deleted record is found (if not found, append at
the end)
=> slow
2. know immediately if there are empty slots in the
file and jump directly to one of those slots
=> more quickly
File Structures (6)
Konkuk University (DB Lab.)
14
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically(2/6)

Linked list : Fig. 6.4 (avail list)
for all of the deleted records
(≡ available space within the file)
handle a list as a stack
pointing is done through RRNs
(<- fixed-length record)
if avail list is empty, added at the end of
the file
File Structures (6)
Konkuk University (DB Lab.)
15
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (3/6)
The Linked List
Head
pointer
ptr
ptr
ptr
-1
ptr
The Stack
(a) pointer(5)
Head
RRN
2
5
RRN
2
Head
RRN
5
3
RRN
2
5
(b) pointer (3)
File Structures (6)
Konkuk University (DB Lab.)
-1
RRN
2
-1
16
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (4/6)

Where to keep the avail list ?
1. maintained in a separate file : no
2. embedded within the data file : yes
File Structures (6)
Konkuk University (DB Lab.)
17
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (5/6)
List head(first available record)
0
(a)
Edwards...
1
Betas...
2
Wills...
3
5 (delete 3, 5 )
4
*-1
Masters..
List head(first available record)
0
(b)
Edwards...
1
*5
2
Wills...
3
*-1
(c)
1
2
3
*3
6
Chavez...
1 (delete 1)
4
5
Masters..
List head(first available record)
0
5
*3
6
Chavez...
-1 (insert three new record)
4
5
6
Edwards..1st new rec Wills... 3rd new rec Masters..2nd new rec Chavez...
File Structures (6)
Konkuk University (DB Lab.)
18
6.2.2 Deleting Fixed-length Records for
Reclaiming Space Dynamically (6/6)

Implementation
1. place deleted records on a linked avail list
2. treat the avail list as a stack
3. keep the RRN of the first available record
in a header record
File Structures (6)
Konkuk University (DB Lab.)
19
6.2.3 Deleting Variable-length Records (1/3)

File structure for variable-length
records
length(byte count) field at the beginning
of each record

Avail list for variable-length records
place a special asterisk in the first field,
followed by a binary link field pointing to
the next deleted record
use "byte offset" (not RRN) for links
File Structures (6)
Konkuk University (DB Lab.)
20
6.2.3 Deleting Variable-length Records (2/3)

Access to the avail list
search through the avail list for a record slot that
is the right size or big enough (not as a stack)

Example : size 55 ?
(a) Before removal
Size
47
Size
38
Size
72
Size
68
-1
(b) After removal
Size
47
Size
38
New link
Size
68
-1
Removed record
File Structures (6)
Konkuk University (DB Lab.)
Size
72
-1
22
6.2.4 Storage Fragmentation (1/5)

to eliminate the wasted space due to
internal fragmentation
record
length
Record[1]
Record[2]
40 Ames | Jone | 123 Maple | Stillwater | OK | 740751 | 64 Morrison | Sebastian |
Record[3]
9035 South Hillcrest | Forest Village | OK | 74820 | 45 Brown | Martha | 625 Kimb
bark | Des Moines | IA | 50311 |
Internal or External
fragmentation
ex) Delete Record[2] and Insert New Record[i] : 37-byte unused space
Record[i]
27 Ham | Al | 28 Elm | Ada | OK | 70322 |
File Structures (6)
Konkuk University (DB Lab.)
24
6.2.4 Storage Fragmentation (3/5)

Record : deleted and replaced with a
shorter record
1. Use the entire original record slot : Fig.
6.10

internal fragmentation within a variable-length
record
2. Break the original record slot into two
parts : Fig. 6.11

one part for the new record and the other for
another records (=> avail list)
no internal fragmentation --> external
fragmentation
File Structures (6)
Konkuk University (DB Lab.)
25
6.2.4 Storage Fragmentation (4/5)

External Fragmentation : Fig 6.12
unused space outside or between
individual records
space that is actually on the avail list, but
is too fragmented to be reused
File Structures (6)
Konkuk University (DB Lab.)
26
6.2.4 Storage Fragmentation (5/5)

Ways to combat external fragmentation
1. Storage compaction

regenerate a file when external fragmentation becomes
intolerable
2. Coalescing holes

combine physically adjacent two record slots on the avail
list to make a single, larger record slot
good when the avail list is kept in physical record order
3. Placement strategy

use it to select a record slot from the avail list to minimize
fragmentation
File Structures (6)
Konkuk University (DB Lab.)
27
6.2.5 Placement Strategies (1/2)

Placement strategies
1. First-fit placement

treat the avail list as a stack
look through the avail list to find the first record slot that
is big enough
not best fit (<- 10 times bigger or perfect fit)
2. Best-fit placement

avail list : in ascending order by size
the first record encountered is the smallest record slot
① search through at least a part of the avail list
② result in external fragmentation
③ make best-fit searches longer as time goes on
File Structures (6)
Konkuk University (DB Lab.)
28
6.2.5 Placement Strategies (2/2)
3. Worst-fit placement
avail list : in descending order by size
the first record encountered is the largest
record slot
① if the first record slot is not large enough,
none of the others will be
② unused portion of the slot is as large as
possible
=> decrease the likelihood of external
fragmentation

Conclusion
no one placement strategy is superior for
all circumstances
File Structures (6)
Konkuk University (DB Lab.)
29
6.3 Finding Things Quickly :
6.3.1 Finding Things in a Simple Field & Record Files

Direct access method
1. by Relative Record Number for fixed-length
records (--> record's byte offset -->
jump to it using direct access)
2. by record's byte offset for variable-length
records
3. by record's key value

keyed access --> sequential search ?
no record or several records
=> look through the entire file
File Structures (6)
Konkuk University (DB Lab.)
30
6.3.2 Search by Guessing :Binary Search

Binary search : Fig. 6.13, Fig. 6.14
for a file which is sorted in ascending
order by key
Int BinarySearch(FixedRecordFile &file, RecType &obj, KeyType & key)
{ //이진 탐색:키가 발견되면, obj는 해당하는 레코드를 포함하고 1을 반환
int low = 0; int high = file.NumRecs() - 1;
while(low<=high)
{
int guess = (high - low)/2;
file.ReadByRRN(obj,guess);
if(obj.Key() == key) return 1; //레코드를 찾은 경우
if(obj.Key() < key) high = guess - 1; // guess 앞부분을 검색
else low = guess + 1; // guess 뒷부분을 검색
rerturn 0; // 키를 발견하지 못하고 루프를 끝나는 경우
File Structures (6)
Konkuk University (DB Lab.)
}
}
31
6.3.2 Search by Guessing :Binary Search

Example : a file of 1,000 fixed-length
records
(1) sequential search

at most 1,000 comparisons
(on average 500 comparisons)
(2) binary search

at most 10 comparisons
File Structures (6)
Konkuk University (DB Lab.)
32
6.3.3 Binary Search Versus Sequential Search

Binary search for n records

comparisons
comparisons
O(log2 n)
Sequential search for n records

at most
log2 n + 1
on average ( log2 n +1)/2
at most
on average
n
n/2
comparisons
comparisons
O(n)
Example : a file of 2,000 records
1. binary search

at most
log2 2,000 + 1 = 11 comparisons
(=> sorting is needed !!!)
2. sequential search

at most
n
= 2,000 comparisons
File Structures (6)
Konkuk University (DB Lab.)
33
6.3.4 Sorting a Disk File in Memory

Sorting with very large memory
(i) read the entire file from the disk into memory
sequentially
(ii) do the internal sorting in memory
unsorted
file
Read the entire file
(sequential access)
unsorted
file
sorted
file
Sort in memory
Disk
File Structures (6)
Memory
Konkuk University (DB Lab.)
34
6.3.5 The Limitations of Binary Searching
and Internal Sorting (1/3)

Problems of "sort, then binary search"
1. Binary searching requires more than one or
two accesses
average case : ( log2 n + 1)/2 comparisons
(accesses)

1,000 items ... 9.5 accesses
100,000 items ... 16.5 accesses
① access by RRN (performance) : with a single access
② access by key (usefulness)
File Structures (6)
: use of index structures
Konkuk University (DB Lab.)
35
6.3.5 The Limitations of Binary Searching
and Internal Sorting (2/3)
2. Keeping a file sorted is very expensive
① insertion
(i) read through half the records
(ii) shift the records to open up the space
② searching
binary search
=> the benefits of faster retrieval can more than offset the costs of
keeping the file sorted

better solutions for the problem "access by key“
not involve reordering of the records in the file when a new
record is added => use of indexes or hashing
2. associate with data structures that allow for substantially
more rapid, efficient reordering of the file => use of tree
structures, such as a B-tree
1.
File Structures (6)
Konkuk University (DB Lab.)
36
6.3.5 The Limitations of Binary Searching
and Internal Sorting (3/3)
3. An internal sort works only on small
files
small file (in memory) ... internal sort
large file
... external sort => “keysort”
File Structures (6)
Konkuk University (DB Lab.)
37
6.4 Keysorting

Keysort (tag sort)
(i) read the keys from the file into memory
(ii) sort them in memory
(iii) rearrange the records in the file
according to the new ordering of the keys
can sort larger files entirely in memory
File Structures (6)
Konkuk University (DB Lab.)
38
6.4.1 Description of the Method
Conceptual
view
before
sorting
Conceptual
view
after sorting
keys in
memory
KEYNODES array
KEY
RRN
HARRISON
1
KELLOG
HARRIS
2
3
Records
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
Harris|Margaret|4343 West....
.
.
.
.
BELL
k
Bell|Robert|8912 Hill....
BELL
HARRIS
k
3
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
1
Harris|Margaret|4343 West....
2
Bell|Robert|8912 Hill....
HARRISON
.
.
.
.
KELLOG
File Structures (6)
In memory
Konkuk University (DB Lab.)
39
On secondary store
6.4.2 Limitations of the Keysort Method

Limitations
1. read in the records a second time
not sequentially, but randomly by RRN
(--> random seeks)
2. write out the new sorted file
sequentially (--> seeks)

=> disk drive must move the head back and
forth between two files as it reads and
writes
File Structures (6)
Konkuk University (DB Lab.)
40
6.4.3 Another Solution (1/2)

Solution
do not write the records back in sorted
order
write out the contents of the
KEYNODES[] array as an index file

Looking for a record
1. do binary search on the index file (=>
RRN)
2. use the RRN to find the corresponding
record
File Structures (6)
Konkuk University (DB Lab.)
41
6.4.3 Another Solution (2/2)
KEY
Records
RRN
BELL
HARRIS
HARRISON
k
3
Harrison|Susan|387 Eastern....
Kellog|Bill|17 Maple....
1
Harris|Margaret|4343 West....
2
Bell|Robert|8912 Hill....
.
.
.
.
KELLOG
Index file
File Structures (6)
Original file
Konkuk University (DB Lab.)
42