Chap. 6 Organizing Files for Performance Kim Joung-Joon Database Lab. [email protected] Chapter Outline 6.1 Data Compression 6.2 Reclaiming Space in Files 6.3 Finding Things Quickly : An Introduction to Internal Sorting and Binary Searching 6.4 Keysorting File Structures (6) Konkuk University (DB Lab.) 2 6.1 Data Compression Data compression encoding the information in a file in such a way that it takes up less space Some are designed for specific kinds of data, such as speech, picture, text … Reasons for compression → make smaller files use less storage can be transmitted faster, decreasing access time can be processed faster sequentially File Structures (6) Konkuk University (DB Lab.) 3 6.1.1 Using a Different Notation Compact notation decrease the number of bits by finding a more compact notation ⇒ binary 6 bits (26 = 64) for 50 states (1 byte) Cost the file is unreadable by human (binary encoding) cost for encoding and decoding (compression algorithms are very simple) complexity of the software is increased (since encoding/decoding module is included File Structures (6) Konkuk University (DB Lab.) 4 6.1.2 Suppressing Repeating Sequences Run-length encoding (1) choose one special, unused byte value as a runlength code (2) run-length encoding algorithm (i) substitute the repeated pixels with the following three bytes ① special run-length code indicator (e.g. 0xff) ② repeated pixel value ③ number of repeated times File Structures (6) (e.g.) 22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24 => 22 23 ff 24 07 25 ff 26 06 25 24 not guarantee the amount of space savings Konkuk University (DB Lab.) 5 6.1.3 Assigning Variable-length Codes (1/3) Variable-length codes (Morse code) based on the principle that some values occur more frequently than others the more frequently occurring letters getting fewer symbols no delimiters implemented using a lookup table to encode or decode the data File Structures (6) Konkuk University (DB Lab.) 6 6.1.3 Assigning Variable-length Codes (2/3) Lookup table used to encode and decode the data 1. predictable frequency distribution 2. never change unpredictable frequency distribution use Huffman code -> binary tree -> table Huffman encoding Letter: Prob.: Code: a 0.4 1 b 0.1 010 c 0.1 011 d 0.1 0000 e 0.1 0001 f 0.1 0010 g 0.1 0011 ex) the string “abde” 101000000001 File Structures (6) Konkuk University (DB Lab.) 7 6.1.3 Assigning Variable-length Codes (3/3) Huffman code determine the probabilities of each value occurring in the date set and then build a binary tree more frequently occurring values are given shorter search paths 0 01 00 000 d(0000) File Structures (6) e(0001) a(1) 001 f(0010) Konkuk University (DB Lab.) b(010) c(011) g(0011) 8 6.1.4 Irreversible Compression Techniques Irreversible compression based on assumption that some information can be sacrificed no way to return to the original data (e.g.) ① shrinking a raster image from 400x400 pixels to 100x100 pixels (16 pixels --> 1 pixel) ② voice coding with varying amounts of distortion File Structures (6) Konkuk University (DB Lab.) 9 6.2 Reclaiming Space in Files Modification of a variable-length record ( new record is longer than original record ) 1. append the extra data to the end of the file and put a pointer from the original record space to the extension => slower 2. rewrite the whole record at the end of the file (if not sorted), leaving a hole at the original location => wasted space File Structures (6) Konkuk University (DB Lab.) 10 6.2 Reclaiming Space in Files Three forms of modification 1. record addition 2. record updating : deletion -> addition 3. record deletion File Structures (6) Konkuk University (DB Lab.) 11 6.2.1 Record Deletion and Storage Compaction (1/2) Approach to record deletion place a special mark in a special field of each deleted record (e.g.) asterisk in the first field : Fig. 6.3(a), Fig. 6.3(b) Ames|Mary|123 Maple|Stillwater|OK|74075|……………………… (a) Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|74820| Brown|Martha|625 Kimbark|Des Moines|IA|50311|……………….. Ames|Mary|123 Maple|Stillwater|OK|74075|……………………. (b) *|rrison|Sebastian|9035 South Hillcrest|Forest Village|OK|74820| Brown|Martha|625 Kimbark|Des Moines|IA|50311|……………… a program ignores the marked record as deleted adv. Undelete a record with very little effort disadv. Don’t reuse the space for a while (rely on storage compaction) File Structures (6) Konkuk University (DB Lab.) 12 6.2.1 Record Deletion and Storage Compaction (2/2) Storage compaction make files smaller by looking for unused places in a file and then recovering this space (how often ?) a special program reconstructs a file with all the deleted records squeezed out : Fig. 6.3(c) Compaction methods 1. through a file copy program (out place) 2. through more complicated and time-consuming compacting algorithm (in place) (c) File Structures (6) Ames|Mary|123 Maple|Stillwater|OK|74075|……………………… Brown|Martha|625 Kimbark|Des Moines|IA|50311|………………… Konkuk University (DB Lab.) 13 6.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically(1/6) Dynamic storage reclamation reuse the space from deleted records as soon as possible 1. look through the file, record by record, until a deleted record is found (if not found, append at the end) => slow 2. know immediately if there are empty slots in the file and jump directly to one of those slots => more quickly File Structures (6) Konkuk University (DB Lab.) 14 6.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically(2/6) Linked list : Fig. 6.4 (avail list) for all of the deleted records (≡ available space within the file) handle a list as a stack pointing is done through RRNs (<- fixed-length record) if avail list is empty, added at the end of the file File Structures (6) Konkuk University (DB Lab.) 15 6.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically (3/6) The Linked List Head pointer ptr ptr ptr -1 ptr The Stack (a) pointer(5) Head RRN 2 5 RRN 2 Head RRN 5 3 RRN 2 5 (b) pointer (3) File Structures (6) Konkuk University (DB Lab.) -1 RRN 2 -1 16 6.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically (4/6) Where to keep the avail list ? 1. maintained in a separate file : no 2. embedded within the data file : yes File Structures (6) Konkuk University (DB Lab.) 17 6.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically (5/6) List head(first available record) 0 (a) Edwards... 1 Betas... 2 Wills... 3 5 (delete 3, 5 ) 4 *-1 Masters.. List head(first available record) 0 (b) Edwards... 1 *5 2 Wills... 3 *-1 (c) 1 2 3 *3 6 Chavez... 1 (delete 1) 4 5 Masters.. List head(first available record) 0 5 *3 6 Chavez... -1 (insert three new record) 4 5 6 Edwards..1st new rec Wills... 3rd new rec Masters..2nd new rec Chavez... File Structures (6) Konkuk University (DB Lab.) 18 6.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically (6/6) Implementation 1. place deleted records on a linked avail list 2. treat the avail list as a stack 3. keep the RRN of the first available record in a header record File Structures (6) Konkuk University (DB Lab.) 19 6.2.3 Deleting Variable-length Records (1/3) File structure for variable-length records length(byte count) field at the beginning of each record Avail list for variable-length records place a special asterisk in the first field, followed by a binary link field pointing to the next deleted record use "byte offset" (not RRN) for links File Structures (6) Konkuk University (DB Lab.) 20 6.2.3 Deleting Variable-length Records (2/3) Sample file 1. before deletion : Fig. 6.6(a) 2. after deletion of the second record : Fig. 6.6(b) (a) HEAD.FIRST_AVAIL : -1 40 Ames|Mary|123 Maple|Stillwater|OK|74075|64 Morrison|Sebastian|9035 South Hillcrest|Forest Village|OK|74820|45 Brown|Marta|625 Kimbark|Des Moines|IA|50311| (b) HEAD.FIRST_AVAIL : 43 40 Ames|Mary|123 Maple|Stillwater|OK|74075|64 *|-1……………………..………… ………………………………………45 Bown|Marta|625 Kimbark|Des Moines|IA|50311| File Structures (6) Konkuk University (DB Lab.) 21 6.2.3 Deleting Variable-length Records (3/3) Access to the avail list search through the avail list for a record slot that is the right size or big enough (not as a stack) Example : size 55 ? (a) Before removal Size 47 Size 38 Size 72 Size 68 -1 (b) After removal Size 47 Size 38 New link Size 68 -1 Removed record File Structures (6) Konkuk University (DB Lab.) Size 72 -1 22 6.2.4 Storage Fragmentation (1/5) Internal fragmentation : wasted space within a record Unused space --> Internal fragmentation Ames | Jonh | 123 Maple | Stillwater | OK | 740751 |................................... Morrison | Sebastian | 9035 South Hillcrest | Forest Village | OK | 74820 | Brown | Martha | 625 Kimbark | Des Moines | IA | 50311 | ......................... 64-byte fixed-length records File Structures (6) Konkuk University (DB Lab.) 23 6.2.4 Storage Fragmentation (2/5) File with variable-length records to eliminate the wasted space due to internal fragmentation record length Record[1] Record[2] 40 Ames | Jone | 123 Maple | Stillwater | OK | 740751 | 64 Morrison | Sebastian | Record[3] 9035 South Hillcrest | Forest Village | OK | 74820 | 45 Brown | Martha | 625 Kimb bark | Des Moines | IA | 50311 | Internal or External fragmentation ex) Delete Record[2] and Insert New Record[i] : 37-byte unused space Record[i] 27 Ham | Al | 28 Elm | Ada | OK | 70322 | File Structures (6) Konkuk University (DB Lab.) 24 6.2.4 Storage Fragmentation (3/5) Record : deleted and replaced with a shorter record 1. Use the entire original record slot : Fig. 6.10 internal fragmentation within a variable-length record 2. Break the original record slot into two parts : Fig. 6.11 one part for the new record and the other for another records (=> avail list) no internal fragmentation --> external fragmentation File Structures (6) Konkuk University (DB Lab.) 25 6.2.4 Storage Fragmentation (4/5) External Fragmentation : Fig 6.12 unused space outside or between individual records space that is actually on the avail list, but is too fragmented to be reused File Structures (6) Konkuk University (DB Lab.) 26 6.2.4 Storage Fragmentation (5/5) Ways to combat external fragmentation 1. Storage compaction regenerate a file when external fragmentation becomes intolerable 2. Coalescing holes combine physically adjacent two record slots on the avail list to make a single, larger record slot good when the avail list is kept in physical record order 3. Placement strategy use it to select a record slot from the avail list to minimize fragmentation File Structures (6) Konkuk University (DB Lab.) 27 6.2.5 Placement Strategies (1/2) Placement strategies 1. First-fit placement treat the avail list as a stack look through the avail list to find the first record slot that is big enough not best fit (<- 10 times bigger or perfect fit) 2. Best-fit placement avail list : in ascending order by size the first record encountered is the smallest record slot ① search through at least a part of the avail list ② result in external fragmentation ③ make best-fit searches longer as time goes on File Structures (6) Konkuk University (DB Lab.) 28 6.2.5 Placement Strategies (2/2) 3. Worst-fit placement avail list : in descending order by size the first record encountered is the largest record slot ① if the first record slot is not large enough, none of the others will be ② unused portion of the slot is as large as possible => decrease the likelihood of external fragmentation Conclusion no one placement strategy is superior for all circumstances File Structures (6) Konkuk University (DB Lab.) 29 6.3 Finding Things Quickly : 6.3.1 Finding Things in a Simple Field & Record Files Direct access method 1. by Relative Record Number for fixed-length records (--> record's byte offset --> jump to it using direct access) 2. by record's byte offset for variable-length records 3. by record's key value keyed access --> sequential search ? no record or several records => look through the entire file File Structures (6) Konkuk University (DB Lab.) 30 6.3.2 Search by Guessing :Binary Search Binary search : Fig. 6.13, Fig. 6.14 for a file which is sorted in ascending order by key Int BinarySearch(FixedRecordFile &file, RecType &obj, KeyType & key) { //이진 탐색:키가 발견되면, obj는 해당하는 레코드를 포함하고 1을 반환 int low = 0; int high = file.NumRecs() - 1; while(low<=high) { int guess = (high - low)/2; file.ReadByRRN(obj,guess); if(obj.Key() == key) return 1; //레코드를 찾은 경우 if(obj.Key() < key) high = guess - 1; // guess 앞부분을 검색 else low = guess + 1; // guess 뒷부분을 검색 rerturn 0; // 키를 발견하지 못하고 루프를 끝나는 경우 File Structures (6) Konkuk University (DB Lab.) } } 31 6.3.2 Search by Guessing :Binary Search Example : a file of 1,000 fixed-length records (1) sequential search at most 1,000 comparisons (on average 500 comparisons) (2) binary search at most 10 comparisons File Structures (6) Konkuk University (DB Lab.) 32 6.3.3 Binary Search Versus Sequential Search Binary search for n records comparisons comparisons O(log2 n) Sequential search for n records at most log2 n + 1 on average ( log2 n +1)/2 at most on average n n/2 comparisons comparisons O(n) Example : a file of 2,000 records 1. binary search at most log2 2,000 + 1 = 11 comparisons (=> sorting is needed !!!) 2. sequential search at most n = 2,000 comparisons File Structures (6) Konkuk University (DB Lab.) 33 6.3.4 Sorting a Disk File in Memory Sorting with very large memory (i) read the entire file from the disk into memory sequentially (ii) do the internal sorting in memory unsorted file Read the entire file (sequential access) unsorted file sorted file Sort in memory Disk File Structures (6) Memory Konkuk University (DB Lab.) 34 6.3.5 The Limitations of Binary Searching and Internal Sorting (1/3) Problems of "sort, then binary search" 1. Binary searching requires more than one or two accesses average case : ( log2 n + 1)/2 comparisons (accesses) 1,000 items ... 9.5 accesses 100,000 items ... 16.5 accesses ① access by RRN (performance) : with a single access ② access by key (usefulness) File Structures (6) : use of index structures Konkuk University (DB Lab.) 35 6.3.5 The Limitations of Binary Searching and Internal Sorting (2/3) 2. Keeping a file sorted is very expensive ① insertion (i) read through half the records (ii) shift the records to open up the space ② searching binary search => the benefits of faster retrieval can more than offset the costs of keeping the file sorted better solutions for the problem "access by key“ not involve reordering of the records in the file when a new record is added => use of indexes or hashing 2. associate with data structures that allow for substantially more rapid, efficient reordering of the file => use of tree structures, such as a B-tree 1. File Structures (6) Konkuk University (DB Lab.) 36 6.3.5 The Limitations of Binary Searching and Internal Sorting (3/3) 3. An internal sort works only on small files small file (in memory) ... internal sort large file ... external sort => “keysort” File Structures (6) Konkuk University (DB Lab.) 37 6.4 Keysorting Keysort (tag sort) (i) read the keys from the file into memory (ii) sort them in memory (iii) rearrange the records in the file according to the new ordering of the keys can sort larger files entirely in memory File Structures (6) Konkuk University (DB Lab.) 38 6.4.1 Description of the Method Conceptual view before sorting Conceptual view after sorting keys in memory KEYNODES array KEY RRN HARRISON 1 KELLOG HARRIS 2 3 Records Harrison|Susan|387 Eastern.... Kellog|Bill|17 Maple.... Harris|Margaret|4343 West.... . . . . BELL k Bell|Robert|8912 Hill.... BELL HARRIS k 3 Harrison|Susan|387 Eastern.... Kellog|Bill|17 Maple.... 1 Harris|Margaret|4343 West.... 2 Bell|Robert|8912 Hill.... HARRISON . . . . KELLOG File Structures (6) In memory Konkuk University (DB Lab.) 39 On secondary store 6.4.2 Limitations of the Keysort Method Limitations 1. read in the records a second time not sequentially, but randomly by RRN (--> random seeks) 2. write out the new sorted file sequentially (--> seeks) => disk drive must move the head back and forth between two files as it reads and writes File Structures (6) Konkuk University (DB Lab.) 40 6.4.3 Another Solution (1/2) Solution do not write the records back in sorted order write out the contents of the KEYNODES[] array as an index file Looking for a record 1. do binary search on the index file (=> RRN) 2. use the RRN to find the corresponding record File Structures (6) Konkuk University (DB Lab.) 41 6.4.3 Another Solution (2/2) KEY Records RRN BELL HARRIS HARRISON k 3 Harrison|Susan|387 Eastern.... Kellog|Bill|17 Maple.... 1 Harris|Margaret|4343 West.... 2 Bell|Robert|8912 Hill.... . . . . KELLOG Index file File Structures (6) Original file Konkuk University (DB Lab.) 42
© Copyright 2024