CHAPTER –I INTRODUCTION 1.1 INTRODUCTION Compression algorithms reduce the redundancy in data representation to decrease the storage required for that data. Data compression offers an attractive approach to reduce communication costs by using available bandwidth effectively. Over the last decade, there has been an unprecedented explosion in the amount of digital data transmitted via the Internet. The advent of office automation systems and newspaper, journal and magazine repositories have brought the issue of maintaining archival storage for the search and retrieval to the forefront of research. As an example, the Text REtrieval Conference (TREC) data base holds around 800 million static pages having 6 trillion bytes of plain text equal to the size of a million books. Text compression is concerned with techniques for representing the digital data in alternative representations that take less space. It not only helps to conserve the storage space for archival and online data, but also it helps system performance by using less secondary storage access and improves network transmission bandwidth utilization by reducing the transmission time. Data compression methods are generally classified into Lossless and Lossy [22][72]. Lossless compression allows the original data to be recovered exactly [117][58]. The Lossless compressions are useful in special classes of images such as medical imaging, fingerprint data, astronomical images and database containing mostly vital numerical data, tables and texts used for video, audio and still image applications. The deterioration of the quality of Lossy images is usually not detectable by human perceptual system, and the compression exploits this by process called “Quantization” to achieve compression by a factor of ten to a factor of couple hundreds. Understanding the importance of Lossless methods for both Lossy and Lossless compression application, Lossy algorithm may use them at the final stage of the encoding. One of the main aims of the computational complexity theory is to determine upper and lower bounds on the amount of resources required to solve a computational problem. The resources considered most commonly are time and space. However, several agents are required to arrive at an answer to a problem which also requires the distribution of the input 1 among them. The amount of computation often required to determine the solution becomes efficient, only when it evaluates the computational complexity. When the compression algorithm does not satisfy ability, the factors like Compression Ratio, Compression Factor and Saving Percentage need to be considered. Various parameters which are used to measure the performance and ability of data compression algorithms are discussed in this chapter. The purpose of this thesis is to reexamine the selected algorithms like (Huffman coding, Lempel Ziv 77 (LZ77), Run length encoding (RLE), Arithmetic Coding and Lempel Ziv Welch (LZW)), and the performance is evaluated on the basis of parameters like Compression Ratio, Compression Factor and Saving Percentage. From this reexamination, the LZW data compression algorithm is found best. This thesis presents the novel approach to optimal reduction in computational complexity of LZW algorithm. This chapter of the thesis briefly explains the compression types, compression measuring parameters, the dataset used for the experimentation and basic concepts of data compression theory. 1.2 OBJECTIVE OF THESIS Objective of thesis is discussed as below: To study in detail the famous Lossless data compression algorithms To analyze problem of existing compression algorithms, various parameters used and tested. To analyze the computational complexity of LZW, all famous data structures are used for experimentation in the LZW architecture. To improve the performance of Binary search, a new algorithm is proposed called Binary insertion sort. To improve the performance of Data structures, a new clustering algorithm is designed and proposed. To design and develop a new algorithm to reduce the computational complexity of LZW, the proposed clustering algorithm and multiple dictionaries are combined. 2 1.3 INFORMATION THEORY AND BACKGROUND The Lossy and Lossless data compression algorithms applied key role in information technology, especially in data transfer through networks and memory utilization. 1.3.1 HISTORY OF DATA COMPRESSION Morse code[5], invented in 1838 for use in telegraphy, is an early example of data compression based on using shorter code words for letters such as "e" and "t" that are more common in English. Modern work on data compression began in the late 1940s with the development of information theory. In 1949, Claude Shannon [103] and Robert Fano [27] devised a systematic way to assign code words based on probabilities of blocks. An optimal method for doing this was found by David Huffman in 1951 [41]. Early implementations were typically done in hardware, with specific choices of code words made as compromises between compression and error correction. In the mid-1970s, this idea emerged from dynamically updating code words for Huffman encoding, based on the actual data encountered. In the late 1970s, online storage of text files became common, software compression programs began to develop, and almost all based on adaptive Huffman coding. In 1977, Abraham Lempel and Jacob Ziv suggested the basic idea of pointer-based encoding. In the mid-1980s, [44] following the work of Terry Welch [121], the work is called LZW [121] algorithm rapidly became the method of choice for most general-purpose compression systems. It was used in the program PKZip (Phil Katz), and in hardware devices such as modems. In the late 1980s, digital images became more common, and standards for compressing them emerged. In the early 1990s, Lossy compression methods (to be discussed in the next section of this chapter) also began to be used widely. Current image compression standards include: FAX CCITT 3 (Facsimile International Telegraph and Telephone Consultative Committee), RLE (Run-Length Encoding) with code words determined by Huffman coding from a definite distribution of run lengths; Graphics Interchange Format (GIF) (LZW); JPEG (Lossy discrete cosine transform, Huffman or arithmetic coding); BitMaP (Run-Length Encoding, etc.); Tagged Image File Format (TIFF) (FAX, Joint Photographic Experts Group (JPEG), GIF, etc.). Typical compression ratios currently achieved for text are around 3:1, for line diagrams and text images around 3:1, and for photographic images around 2:1 Lossless, and 20:1 Lossy. The Mathematical notations used throughout the thesis are described in definition 1. 3 DEFINITION 1 The following may be assumed as the input sequence = , … . The sequence length is denoted by n or|| , i.e., a total number of elements in X. denotes the element of X. The element belongs to the finite ordered set = , … that is called an alphabet with the corresponding probabilities = , … . The number of elements in is the size of the alphabet denoted by, the elements of the alphabet are called symbols or characters. Let be the weight, or probability of . The sequence after compression is = , … where or || is the length of input sequence. = ̅ , ̅ … ̅ is the sequence after decompression, the size of the sequence is || and the time taken is denoted by. ((||)) represent Minimum time taken by the algorithm for encode |X| and ((||)) represents Maximum time taken by the algorithm for encode |X|. ((||)) Represent Minimum time taken by the algorithm for decode |C| and ((||)) represents Maximum time taken by the algorithm for decode |C|. D is used to represent the dictionary with the size |D|, when the algorithm uses many dictionaries, then number of dictionaries is denoted by m i.e., 1 to M. The length of each dictionary |# | = where = 1,2, … , and∑ * = '() ∑ *|# | = |+|. ,Indicate empty or null dictionary, initially each|+ | = ,, and m is the number of non-empty dictionary. Only non- empty dictionaries are considered to measure computational cost. K indicates number of clusters, where indicates Cluster and ranges from 1 to K, - indicates the Index cluster and is ranges from 1 to K, where each cluster size is denoted by| |. 1.4 DATA COMPRESSION In general, data compression consists of taking a stream of symbols in X and transforming them into codes. If the compression is effective, the resulting stream of codes is smaller than the original symbols i.e.,|| < || [117][58][3][72]. The decision to generate an output of a certain code or certain symbols or a set of symbols is based on a model. The model is simply the collection data rules used to process input symbols and determines which code to be transformed. A program uses the model to accurately define the probabilities for each symbols and coder to produce an appropriate code based on those probabilities P. Decompression is a process of transforming the code C to , in other words representing the codes to the original. The figure 1.1 shows the statistical model of encoder and decoder: 4 Figure 1.1 Statistical Models of Encoder and Decoder. 1.5 CLASSIFICATION OF DATA COMPRESSION ALGORITHMS. The classification of data compression algorithms is purely based on the amount of data lost during decoding [72]. Primary classification of compression algorithm is listed below. • Lossless compression • Lossy compression The Lossless algorithm can be classified into three broad categories: statistical methods, dictionary methods, and transform based methods. The classification of algorithms is shown in figure 1.2. 1.5.1 LOSSLESS DATA COMPRESSION ALGORITHM Lossless data compression is a class of data compression algorithm that allows the exact original data to be reconstructed from the compressed data. The term Lossless is in contrast to Lossy data compression, which only allows constructing an approximation of the original data, in exchange for better compression rates. Lossless data compression is used in many applications. For example, it is used in the ZIP file format, and in the UNIpleXed Information and Computing System (UNIX) tool Gnu ZIP (GZIP). It is often used as a component within Lossy data compression technologies (e.g. Lossless mid/side joint stereo preprocessing by the LAME MP3 encoder and other Lossy audio encoders). 5 Figure 1.2 Data Compression Classification Chart Lemma 1: The compression is Lossless thenD(̅ , ) = 0 Proof: Function D is used to calculate the difference between |X| and||, where = 1,2 … and |X|1|| = 0 . Then2(̅ 1 ) 3 (̅ 1 ) 3 ⋯ 3 (̅ 1 )5 = 0, as the percentage of loss is nil, these algorithms come under Lossless data compression. Lossless compression is used in cases where it is important that the original and the decompressed data are to be identical, or where deviations from the original data could be deleterious. Typical examples are executable programs, text documents, and source code. Some image file formats, like Portable Network Graphics (PNG) or GIF, use only Lossless 6 compression, while others like TIFF may use either Lossless or Lossy method. Lossless audio formats are most often used for archiving or production purposes, with smaller Lossy audio files are being typically used on portable players and in other cases where storage space is limited or exact replication of the audio is unnecessary. 1.5.1.1 DICTIONARY BASED DATA COMPRESSION ALGORITHM A dictionary coder[72], also sometimes known as a substitution coder, is a class of Lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in a data structure (called the 'dictionary') maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure. Some dictionary coders use a 'static dictionary', one whose full set of strings is determined before coding begins and does not change during the coding process. This approach is most often used when the message or set of messages to be encoded is fixed and large. For instance, an application that stores the contents of the Bible in the limited storage space of a Personal Digital Assistant (PDA) generally builds a static dictionary from a concordance of the text and then uses that dictionary to compress the verses. This scheme of using Huffman coding to represent indices into a concordance is called “Huff word”. Most common methods of the dictionary start with some predetermined state; however the contents change during the encoding process, based on the data that have already been encoded. Both the Lempel Ziv 77 (LZ77) and Lempel Ziv 78 (LZ78) algorithms work on this principle. In LZ77, a data structure called the "sliding window" is used to hold the last N bytes of data processed; this window serves as the dictionary, effectively storing every substring that appeared in the past N bytes as dictionary entries. Instead of a single index identifying a dictionary entry, two values are needed: the length indicates the length of the matched text, and the offset (also called the distance) indicates that the match is found in the sliding window starting offset bytes before the current text. LZ78 uses a more explicit dictionary structure. At the beginning of the encoding process, the dictionary only needs to contain entries for the symbols of the alphabet used in the text to be compressed, but the indexes are numbered so as to leave spaces for many more entries. At each step of the encoding process, the longest entry in the dictionary that matches 7 the text is found, its index is written to the output, and the combination of that entry and the character that followed it in the text is then added to the dictionary as a new entry. Lossless compression algorithms usually exploit statistical redundancy in such a way as to represent the sender's data more concisely, but nevertheless perfectly. Lossless compression is possible because most real-world data has statistical redundancy. For example, in English text, the letter 'e' is more common than the letter 'z', and the probability that, the letter 'q' will be followed by the letter 'z' is very less. 1.5.1.2 STATISTICAL BASED DATA COMPRESSION ALGORITHM The other type is called statistical based data compression algorithm [3][72]. Statistical algorithm uses a static table of probabilities. For example, Huffman tree considered significant, so it was frequently performed. Instead, representative blocks of data analyzed once, giving table of character frequency counts. Huffman encoding / decoding trees then built and stored. Compression programs had access to this static model and would compress data using it. Using a universal static model has limitations. If an input stream X does not match well with the previously accumulated statistics, the compression ratio will be degraded – possibly to the point where the output stream C becomes larger than the input stream X. Building static table for each file to be compressed has its own advantages. As the table is uniquely adapted to that particular file, it will give better compression than the universal table. But there may be some additional overhead since the table (or the statistics used to build the table) has to be passed to the decoder ahead of the compressed code stream. For an order- 0 compression table, the actual statistics used to create the table may take up as little as 256 bytes – not a very large amount of overhead. But tracing to achieve better compression through use of a higher order table will make the statistics that need to be passed to the decoder at an alarming rate. One model can be boosted to the statistics table from 256 to 65536 bytes. The compression ratio will undoubtedly improve when moving to order-1, the overhead of passing the statistics table will probably wipe the gains. For this reason, compression research in the last 10 years has concentrated on adaptive models. The most famous statistical coders are Huffman coding and Shannon-Fano coding, etc. 8 1.5.1.3 TRANSFORMATION BASED DATA COMPRESSION ALGORITHM The word “transform” is used to describe this method because the X undergoes some transformation, which performs permutations of the each in the X, so that X having the similar lexical context will be clustered together in the output. Given the X, the forward, Burrows-Wheeler Transformation (BWT) [13] forms all cyclic rotation of in the X in the form of matrix M, whose rows are lexicographically stored (which specify ordering of symbols in X). The last column L of this sorted M and an index r of the row, where the original text in this M is output as transform. The X is divided into blocks or the entire X could be considered as one block. The transformation is applied to individual blocks separately, by this reason this type of algorithms referred as block sorting [29]. The repetition of the same character in the block may slow the sorting process. To avoid this, Run Length Encoding step may precede the transform. The Bzip2 algorithm based on the transformation uses the following steps: the output of BWT algorithm undergoes a final transformation using Move To Front (MTF) encoding or Distance Coding (DC) [9], which exploits the clustering of character in the BWT output to generate a sequence of numbers dominated by small values. This sequence sends to entropy coder (Huffman or Arithmetic) to obtain the final compressed form. The inverse operation of recovering from the C proceeds by decoding the inverse of the entropy decoder, the inverse of MTF or DC, and then the inverse of BWT. The inverse of BWT obtains || where = ||. 1.5.2 LOSSY DATA COMPRESSION ALGORITHM In Lossy data compression, the decompressed data need not be exactly same as the original data. Often, it suffices to have a reasonably close approximation. A distortion measure is a mathematical entity which specifies exactly how close the approximation is [103]. Generally, it is a function which assigns to any two letters and ̅ in the alphabet, a non-negative number, denoted as #(, ̅ ) ≥ 0 (1.1) Here, is the original data, 8 is the approximation, and #(, ̅ ) is the amount of distortion between and̅ . The most common distortion measures are the Hamming distortion measures: 9 #(, ̅ ) = 9 0: = ̅ 1: ≠ ̅ (1.2) Lossy data compression is contrasted with Lossless data compression. In these schemes, some loss of information is acceptable. Depending upon the application, details can be dropped from the data to save storage space. Generally, Lossy data compression schemes are guided by research on how people perceive the data in question. For example, the human eye is more sensitive to subtle variations in luminance than in color. JPEG image compression works in part by "rounding off" less-important visual information. There is a corresponding trade-off between information lost and size reduction. A number of popular compression formats exploit these perceptual differences, including those used in music files, images, and video. Lossy image compression can be used in digital cameras, to increase storage capacities with minimal degradation of picture quality. Similarly, Digital Video Discs (DVDs) use the Lossy Motion Picture Experts Group (MPEG)-2 Video codec for video compression. In Lossy audio compression, methods of psychoacoustics are used to remove non-audible (or less audible) components in the signal. Compression of human speech is often performed with even more specialized techniques, so that "speech compression" or "voice coding" is sometimes distinguished as a separate discipline from "audio compression". Different audio and speech compression standards are listed under audio codes. Voice compression is used in Internet telephony. For example, while audio compression is used for Compactable Disk (CD) ripping and it is decoded by audio players. Lemma 2: The compression is Lossy D(̅ , ) ≠ 0 Proof: Function D is used to calculate the difference |X| and||, where = 1,2 … and |X|1|| may or may not be zero. Then2(̅ 1 ) 3 (̅ 1 ) 3 ⋯ 3 (̅ 1 )5 < 0() > 0. As the percentage of loss of original symbols is not nil, these algorithms come under Lossy data compression. 1.6 PARAMETERS TO MEASURE THE PERFORMANCE OF COMPRESSION ALGORITHMS Depending on the nature of the application there are various criteria to measure the performance of a compression algorithm. When measuring the performance, the main 10 concern would be the space efficiency. The time efficiency is another factor. Since the compression behavior depends on the redundancy of the symbols in X, it is difficult to measure the performance of a compression algorithm in general. The performance depends on the type and the structure of the input source. Additionally the compression behavior depends on the category of the compression algorithm: Lossy or Lossless. If a Lossy compression algorithm is used to compress a particular source file, the space efficiency and time efficiency would be higher than Lossless compression algorithm. Thus measuring a general performance is difficult and there should be different measurements to evaluate the performances of those compression families. Following are some measurements used for evaluating the performance of Lossless algorithms. COMPRESSION RATIO - Data compression ratio [42], also known as compression power, is a computer-science term used to quantify the reduction in data-representation size produced by a data compression algorithm. The data compression ratio is analogous to the physical compression ratio used to measure physical compression of substances, and is defined in the same way, as the ratio between the compressed size and the uncompressed size. (1.3) |@| ()=>>( ?( = |A| Thus a representation that compresses a 10 MB file to 2 MB has a compression ratio of 2/10 = 0.2, often notated as an explicit ratio, 1:5 (read "one to five"), or as an implicit ratio, 1/5. Note that this formulation applies equally for compression, where the uncompressed size is that of the original; and for decompression, where the uncompressed size is that of the reproduction. COMPRESSION FACTOR -Compression Factor is the inverse of the compression ratio, i.e., the ratio between the size of the source file and the size of the compressed file: ()=>>( CD() = (1.4) |E| |@| SAVING PERCENTAGE - Saving Percentage calculates the shrinkage of the source file as a percentage: GH I=)D= I= = 1 1 11 || |X| (1.5) COMPRESSION AND DECOMPRESSION TIME - All the above methods evaluate the effectiveness of compression algorithms using file sizes. There are some other methods to evaluate the performance of compression algorithms. Compression time, computational complexity and probability distribution are also used to measure the effectiveness. Time taken for the compression and decompression should be considered separately. In some applications like transferring compressed video data, the decompression time is more important. In some cases both compression and decompression time are equally important. If the compression and decompression time of an algorithm is less or acceptable level, it implies that the algorithm is acceptable with respective to the time factor. With the development of high speed computer accessories this factor may give very small values and those may depend on the performance of computers, where (|X|) #(||) represent time taken for encoding and decoding respectively. In real, the time taken for a process is not constant during in all execution, and the average is not a correct term to represent the time taken. It always lies between the minimum time required () for a process and maximum time ()requires for a process. The time taken per bit for encoding and decoding is calculated as follows: L==)MN=(O D(# I) = (|A|) |A| (1.6) L==)MN=(#=D(# I) = (|@|) |@| (1.7) ENTROPY - Shannon has borrowed the definition of entropy from statistical physics to capture the notion how much information is contained in A and their probabilities. For a set of possible messages A, Shannon has defined entropy [58] as, R() = ∑WXY () log V(W) (1.8) Where () is the probability of message. The definition of Entropy is very similar to that of statistical physics. In physics, S is the set of possible states, a system can be in () is the probability in state A. The second law of thermodynamics basically says that the entropy of a system and its surroundings can only increase. Getting back to messages, considering the individual messages[, Shannon has defined the notion of the selfinformation of a message as. 12 (>) = log V(W) (1.9) This self-information represents the number of bits of information contained in it and, roughly speaking, the number of bits should use to send that message. The equation says that messages with higher probability will contain less information (e.g., a message saying that it will be sunny out in LA tomorrow is less informative than one saying that it is going to snow). The entropy is simply a weighted average of the information of each message. Larger entropies represent more information, and perhaps counter-intuitively, the more random a set of messages (the more even the probabilities) the more information they contain on average. CODE EFFICIENCY - Average code length is the average number of bits required to represent a single code word. If the source and the length of the code words are known, the average code length can be calculated using the following equation: ]̅ = ∑ ^* ^ (1.10) where ^ is the occurrence probability of jth symbol of the source message, _^ is the length of the particular code word for that symbol and ` = _1, _2, … . ` .Code Efficiency is the ratio in percentage between the entropy of the source and the average code length and it is defined as follows: O(, `) = a(Y) b(V,c) (1.11) Where, E (P, L) is the code efficiency, H (P) is the entropy and l (P, L) is the average code length. The above equation is used to calculate the code efficiency as a percentage. It can also be computed as a ratio. The code is said to be optimum, if the code efficiency value is equal to 100% (or 1.0). If the value of the code efficiency is less than 100%, that implies that the code words can be optimized than the current situation. 1.7 BIG O NOTATION Big O notation is used in Computer Science to describe the performance or complexity of an algorithm. Big O specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used (e.g. in memory or on disk) by an algorithm. Big O notation also called Landau’s symbol, is a symbolism used in complexity 13 theory, computer science and mathematics to describe the asymptotic behavior of functions. For example, when analyzing some algorithm, one may find that the time t (or the number of steps) is taken to complete the problem of size n is given by ( ) = 4 1 2 3 2 . After ignoring the constants and slower going terms, it can be said that ( ) grows at the order and writesd( ). 1.7.1 TYPES OF ORDER Table 1.1 Types of Order Notation d(1) d(_(I( )) d(_(I(_(I( )) (( ) d( ) d( _(I( )) d( 2) d( 3) d( D) d(D ) d( !) Name Constant Logarithmic Double logarithmic (iterative logarithmic) Sub linear Linear Log linear, Line arrhythmic, Quasi linear or Supra linear Quadratic Cubic Polynomial (different class for each c > 1) Exponential (different class for each c > 1) Factorial The table 1.1 is a list of common types of orders and their names: f(g) - An algorithm that always executes in the same time (or space) regardless of the size of the input data set. f(h) - An algorithm whose performance will grow linearly and in direct proportion to the size of the input data set. The example below also demonstrates how Big O favors the worstcase performance scenario; a matching string may be found during any iteration of the for loop and the function will return early, but Big O notation always assumes the upper limit where the algorithm will performs the maximum number of iterations. 14 f(hi) 1 represents the performance directly proportional to the square of the input data set size. This is common with algorithms that involve nested iterations over the data set. Deeper nested iterations will result in f(hj), f(hk) etc. Properties of Big O • d( 3 N) = d() 3 d(N) 1 Distributivepropertyoveraddition • d( ∗ N) = d() ∗ d(N) 1Distributive property over multiplication 1.7.2 COMPUTATIONAL COMPLEXITY AND LOGARITHMS Binary search is a technique used to search sorted data sets. It works by selecting the middle element of the data set, essentially the median, and compares it against a target value [90]. If the values match it return success. If the target value is higher than the value of the probe element, it takes the upper half of the data set and performs the same operation against it. Likewise, if the target value is lower than the value of the probe element, it performs the operation against the lower half. It continues to halve the data set with every iterations until the value is found or until it no longer splits the data set. Table 1.2 Time Complexity and Speed Complexity 10 20 50 100 1 000 10 000 100 000 d(1) <1s <1s <1s <1s <1s <1s <1s d(_(I( )) <1s <1s <1s <1s <1s <1s <1s d( ) <1s <1s <1s <1s <1s <1s <1s d( ∗ _(I( )) <1s <1s <1s <1s <1s <1s <1s d( ) <1s <1s <1s <1s <1s 2s 3-4 min d( z ) <1s <1s <1s <1s 20 s 5 hours 231 days d(2 ) <1s <1s 260 days hangs hangs hangs hangs d( !) <1s hangs hangs hangs hangs hangs hangs d( 3-4 min hangs hangs hangs hangs hangs hangs ) This type of algorithm is described asd(_(I'). The iterative halving of data sets described in the binary search example produces a growth curve that peaks at the beginning 15 and slowly flattens out as the size of the data sets increase. For instance, an input data set containing 10 items takes one second to complete, a data set containing 100 items takes two seconds, and a data set containing 1000 items takes three seconds. Doubling the size of the input data set has little effect on its growth as after a single iteration of the algorithm the data set halved and therefore on par with an input data set half the size. This makes algorithms like binary search extremely efficient when dealing with large data sets. The time complexity and its speed [5] is shown in table 1.2. 1.8 BENCH MARK FILES The types of bench mark files used for the experimentation are given below: • The Canterbury Corpus • The Artificial Corpus • The Large Corpus • The Miscellaneous Corpus • The Calgary Corpus The Canterbury Corpus - This collection is the main benchmark for comparing compression methods. The Calgary collection is provided for historic interest, the Large corpus is useful for algorithms that cannot "get up to speed" on smaller files, and the other collections may be useful for particular file types. This collection was developed in 1997 as an improved version of the Calgary corpus. The files were chosen because their results on existing compression algorithms are "typical", and so it is hoped that this also be true for new methods. There are 11 files in this corpus shown in the table 1.3. Table 1.3 The Canterbury Corpus File alice29.txt asyoulik.txt cp.html fields.c grammar.lsp kennedy.xls lcet10.txt plrabn12.txt ptt5 Sum xargs.1 Abbrev Text Play Html Csrc List Excel Tech Poem Fax SPRC Man Category English text Shakespeare HTML source C source LISP source Excel Spreadsheet Technical writing Poetry CCITT test set SPARC Executable GNU manual page 16 Size (bytes) 152089 125179 24603 11150 3721 1029744 426754 481861 513216 38240 4227 The Artificial Corpus - This collection contains files for which the compression methods may exhibit pathological or worst-case behavior-files containing little or no repetition (e.g. random.txt), files containing large amounts of repetition (e.g. alphabet.txt), or very small files (e.g. a.txt). As such, "average" results for this collection will have little or no relevance, as the data files are designed to detect outliers. Similarly, time for "trivial" files will be negligible, and should not be reported. There are 4 files in this corpus. They are shown in the table1.4: Table 1.4 The Artificial Corpus File Abbrev Category a The letter'a' aaa The letter 'a', repeated 100,000 times 100000 alphabet.txt alphabet 100000 random.txt random alphabet Enough repetitions of the alphabet to fill 100,000 characters random 100,000 characters, randomly selected from [a-z|A-Z|0-9|!| ] (alphabet size 64) a.txt aaa.txt Size (bytes) 1 100000 The Large Corpus - This is a collection of relatively large files. While most compression methods can be evaluated satisfactorily on smaller files, some requires very large amounts of data to get good compression, and some are so fast that the larger size makes speed measurement (Computational Complexity) more reliable. New files can be added to this collection. There are 3 files in this corpus shown in table 1.5: Table 1.5 The Large Corpus Abbrev Category Size (bytes) E.coli E.coli Complete genome of the E. Coli bacterium 4638690 bible.txt bible The King James version of the bible 4047392 world192.txt world The CIA world fact book 2473400 File The Miscellaneous Corpus - This is a collection of "miscellaneous" files that are designed to be added by researchers and others willing to publish compression results using their own files shown in table 1.6. 17 Table 1.6 The Miscellaneous Corpus File Abbrev Category Size (bytes) pi.txt Pi The first million digits of pi 1000000 The Calgary Corpus- This was developed in the late 1980s, and during the 1990s became something of a de facto standard for Lossless compression evaluation. The collection is now rather dated, but it is still reasonably reliable as a performance indicator. It is still available so that older results can be compared. The collection may not be changed, although there are four files (paper3, paper4, paper5 and paper6) that have been used in some evaluations but are no longer in the corpus because they do not add to the evaluation (the Calgary Corpus collection are shown in table 1.7 . Table 1.7 The Calgary Corpus File Abbrev Category Size (bytes) Bib Bib Bibliography (refer format) 111261 book1 book1 Fiction book 768771 book2 book2 Non-fiction book (troff format) 610856 Geo Geo Geophysical data 102400 News News USENET batch file 377109 obj1 obj1 Object code for VAX 21504 obj2 obj2 Object code for Apple Mac 246814 paper1 paper1 Technical paper 53161 paper2 paper2 Technical paper 82199 Pic Pic Black and white fax picture 513216 Progc Progc Source code in "C" 39611 Progl Progl Source code in LISP 71646 Progp Progp Source code in PASCAL 49379 Trans Trans Transcript of terminal session 93695 18 1.9 TIME COMPLEXITY EVALUATOR TOOL The focus of this research is mainly to study the computational complexity of Data Compression Algorithm. There are some other methods to evaluate the performance of compression algorithms. Compression time, computational complexity and probability distribution are also used to measure the effectiveness. Time taken for the compression and decompression should be considered separately [43]. In some applications like transferring compressed video data, the decompression time is more important, while in some other applications both compression and decompression time are equally important. But the time taken by the same algorithm is different during each execution. So time taken is not constant or not average and also it is constant or average time taken by an algorithm is not fair in the calculation of time, because the execution time is varies of an algorithm from ()(() during several iterations, to find the minimum time () and Maximum time () which need more execution of an algorithm on the same file (To find () and () an algorithm is executed hundred time on same file). Manually performing this operation is very tedious as there is so far no tool is available for the same. A tool which is developed to define this task is called compression time complexity evaluator shown in the figure 1.3. The proposed tool is designed in Visual Basic 6.0. This tool is capable of executing algorithms designed in .java or .exe type. The tool is very useful to fetch the time taken of algorithm on multiple files. They can also define the number of execution of an algorithm on the same file (No iteration in the figure 1.3). The tool results the Min and Max time taken by an algorithm on the different files separately. There are three types of graph which can plot with this tool (i.) Plot a graph for time taken by an algorithm on single file during each execution. (ii.) Plot a graph for time taken an algorithm with different files. (iii.) Plot a graph for time taken by different algorithm on different files. The newly designed tool is very useful when analyzing the performance based on computational complexity of different algorithms. Time which is recorded by the tool is in nanoseconds. 19 Figure 1.3 Compression Time Complexity Evaluator 1.10 EXPERIMENTAL SETUP All experiments have been done on a 2.20 GHz Intel (R) Celeron (R) 900 Central Processing Unit (CPU) equipped with 3072 KB L2 cache and 2GB main memory. As the machine does not have any other significant CPU tasks running, only a single thread of execution is used. The Operating System is Windows XP SP3 (32 bit). All programs have been compiled using java version jdk1.6.0_13. The times have been recorded in nanoseconds. The time taken by each algorithm has been calculated using the tool compression time complexity evaluator, major graph plotted with the same tool and sum graphs are plotted using MS Excel. All data structures reside in main memory during computation. 1.10 FRAME WORK OF THIS THESIS. This thesis consists of eleven chapters. The initial chapter is introductory in nature, containing the thesis statement and objectives of the study, and list of parameters to measure the performance of Compression algorithm. CHAPTER II The second chapter deals with the literature survey of the study on LZW Data compression. The computational complexity of the LZW Encoding algorithm is large. Hence 20 the study of LZW compression algorithm techniques is essential to reduce the Computational complexity of LZW without affecting the performance of the algorithm. CHAPTER III This chapter re-examines and explains the famous lossless data compression algorithm like Huffman Coding, RLE, Arithmetic Coding, LZ77, and LZW. The selected algorithm is tested against the Compression ratio, Compression Factor, Saving percentage, and time taken for encoding and decoding. CHAPTER IV This chapter proposes and explains a simple sorting algorithm called Binary Insertion Sort (BIS). This algorithm is developed by exploiting the concept of Binary search. The enhanced BIS uses one comparison per cycle, average case, Best Case, and worst case analysis of BIS is done in this chapter. CHAPTER V This chapter experiments and explains how BST, Linear array, Chained Hash Table, and BIS are played a key role to determine the computational complexity of LZW encoding Algorithm. Each data structure and algorithm employed in LZW and its computational are analyzed. CHAPTER VI Multiple Dictionary LZW or Parallel Dictionary LZW encoding algorithms is the simple and best approach to reduce the computational cost of LZW. So in this chapter explained and experimented MDLZW architecture and how the computational complexity is reduced while comparing conventional LZW and its various data structure implementations. The computational cost of each data structure is implemented and tested. CHAPTER VII The computational cost is purely related with Data structures and employed in LZW. So to optimize the performance of data structure and algorithm, a novel data mining approach is proposes, implements and tested called Indexed K Nearest Twin Neighbor algorithm (IKNTN). The computational complexity is analyzed and tested before and after BST, Linear Search, Chained Hash Table and BIS with Binary Search. CHAPTER VIII The computation Cost of LZW is reduced by clubbing the features of IKNTN with LZW. The enhanced implementation is tested with BIS, Chained Hash Table, BST, BIS, and Linear array. CHAPTER IX Two novel approaches are proposed and explained, the enhancement is done by combining the features of MDLZW and IKNTN_LZW. The first approach clusters each 21 dictionary into clusters and the second enhancement is done by grouping each cluster into dictionary. CHAPTER X The penultimate chapter deals with the results of various implementations of LZW which will give optimum reduction of Computational cost. The results are presented in tables and graphs. Various parameters are used for calculation. CHAPTER XI The final chapter sums up the details of the research work carried out. The limitations of the present study also have been mentioned along with a recommendation of enhancements for future applications. 22
© Copyright 2024