Coding and Entropy February 3, 2010 Harvard QR48 1 Squeezing out the “Air” Suppose you want to ship pillows in boxes and are charged by the size of the box Lossless data compression Entropy = lower limit of compressibility February 3, 2010 Harvard QR48 2 Claude Shannon (1916-2001) A Mathematical Theory of Communication (1948) February 3, 2010 Harvard QR48 3 Communication over a Channel Source Coded Bits S Received Bits X Y Decoded Message T Channel symbols bits bits symbols Encode bits before putting them in the channel Decode bits when they come out of the channel E.g. the transformation from S into X changes “yea” --> 1 “nay” --> 0 Changing Y into T does the reverse For now, assume no noise in the channel, i.e. X=Y February 3, 2010 Harvard QR48 4 Example: Telegraphy Source English letters -> Morse Code Baltimore D -.. -.. Washington -.. February 3, 2010 Harvard QR48 D 5 Low and High Information Content Messages The more frequent a message is, the less information it conveys when it occurs Two weather forecast messages: Bos: LA: In LA “Sunny” is a low information message and “cloudy” is a high information message February 3, 2010 Harvard QR48 6 Harvard Grades % A A- B+ B B- C+ 2005 24 25 21 13 6 2 1995 21 23 20 14 8 3 1986 14 19 21 17 10 5 Less information in Harvard grades now than in recent past February 3, 2010 Harvard QR48 7 Fixed Length Codes (Block Codes) Example: 4 symbols, A, B, C, D A=00, B=01, C=10, D=11 In general, with n symbols, codes need to be of length lg n, rounded up For English text, 26 letters + space = 27 symbols, length = 5 since 24 < 27 < 25 (replace all punctuation marks by space) AKA “block codes” February 3, 2010 Harvard QR48 8 Modeling the Message Source Source Destination Characteristics of the stream of messages coming from the source affect the choice of the coding method We need a model for a source of English text that can be described and analyzed mathematically February 3, 2010 Harvard QR48 9 How can we improve on block codes? Simple 4-symbol example: A, B, C, D If that is all we know, need 2 bits/symbol What if we know symbol frequencies? Use shorter codes for more frequent symbols Morse Code does something like this Example: February 3, 2010 A .7 B .1 C .1 D .1 0 100 101 110 Harvard QR48 10 Prefix Codes Only one way to decode left to right February 3, 2010 A .7 B .1 C .1 D .1 0 100 101 110 Harvard QR48 11 Minimum Average Code Length? Average bits per symbol: .7·1+.1·3+.1·3+.1·3 = 1.6 bits/symbol (down from 2) .7·1+.1·2+.1·3+.1·3 = 1.5 February 3, 2010 Harvard QR48 A B C D .7 .1 .1 .1 0 100 101 110 A B C D .7 .1 .1 .1 0 10 110 111 12 Entropy of this code <= 1.5 bits/symbol Possibly lower? How low? .7·1+.1·2+.1·3+.1·3 = 1.5 February 3, 2010 Harvard QR48 A B C D .7 .1 .1 .1 0 10 110 111 13 Self-Information If a symbol S has frequency p, its selfinformation is H(S) = lg(1/p) = -lg p. February 3, 2010 S A B C D p .25 .25 .25 .25 H(S) 2 2 2 2 p .7 .1 .1 .1 H(S) .51 3.32 3.32 3.32 Harvard QR48 14 First-Order Entropy of Source = Average Self-Information S A B C D p .25 .25 .25 .25 -lgp 2 2 2 2 -plgp .5 .5 .5 .5 p .7 .1 .1 .1 -lgp .51 3.32 3.32 3.32 -plgp .357 .332 .332 .332 February 3, 2010 Harvard QR48 -∑ plgp 2 1.353 15 Entropy, Compressibility, Redundancy Lower entropy More redundant More compressible Less information Higher entropy Less redundant Less compressible More information A source of “yea”s and “nay”s takes 24 bits per symbol but contains at most one bit per symbol of information 010110010100010101000001 = yea 010011100100000110101001 = nay February 3, 2010 Harvard QR48 16 Entropy and Compression A B C D .7 .1 .1 .1 0 10 110 111 Average length for this code =.7·1+.1·2+.1·3+.1·3 = 1.5 No code taking only symbol frequencies into account can be better than first-order entropy First-order Entropy of this source = .7·lg(1/.7)+.1·lg(1/.1)+ .1·lg(1/.1)+.1·lg(1/.1) = 1.353 First-order Entropy of English is about 4 bits/character based on “typical” English texts “Efficiency” of code = (entropy of source)/(average code length) = 1.353/1.5 = February 8, 2010 Harvard QR48 17 A Simple Prefix Code: Huffman Codes Suppose we know the symbol frequencies. We can calculate the (first-order) entropy. Can we design a code to match? There is an algorithm that transforms a set of symbol frequencies into a variable-length, prefix code that achieves average code length approximately equal to the entropy. David Huffman, 1951 February 8, 2010 Harvard QR48 18 Huffman Code Example A B C D E .35 .05 .2 .15 .25 BD .2 BCD .4 AE .6 ABCDE February 8, 2010 1.0 Harvard QR48 19 Huffman Code Example A B C D E .35 .05 .2 .15 .25 0 BD 0 A B C D E 1 1 .2 1 0 BCD .4 00 100 11 101 01 AE .6 1 0 ABCDE 1.0 February 8, 2010 Harvard QR48 Entropy 2.12 Ave length 2.20 20 Efficiency of Huffman Codes Huffman codes are as efficient as possible if only first-order information (symbol frequencies) is taken into account. Huffman code is always within 1 bit/symbol of the entropy. February 8, 2010 Harvard QR48 21 Second-Order Entropy Second-Order Entropy of a source is calculated by treating digrams as single symbols according to their frequencies Occurrences of q and u are not independent so it is helpful to treat qu as one Second-order entropy of English is about 3.3 bits/character February 8, 2010 Harvard QR48 22 How English Would Look Based on frequencies alone • • • • 0: xfoml rxkhrjffjuj zlpwcfwkcyj ffjeyvkcqsghyd qpaamkbzaacibzlhjqd 1: ocroh hli rgwr nmielwis eu ll nbnesebya th eei alhenhttpa oobttva 2: On ie antsoutinys are t inctore st be s deamy achin d ilonasive tucoowe at 3: IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA February 8, 2010 Harvard QR48 23 How English Would Look Based on word frequencies • • 1) REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE 2) THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED February 8, 2010 Harvard QR48 24 What is entropy of English? Entropy is the “limit” of the information per symbol using single symbols, digrams, trigrams, … Not really calculable because English is a finite language! Nonetheless it can be determined experimentally using Shannon’s game Answer: a little more than 1 bit/character February 8, 2010 Harvard QR48 25 Shannon’s Remarkable 1948 paper February 8, 2010 Harvard QR48 26 Shannon’s Source Coding Theorem No code can achieve efficiency greater than 1, but For any source, there are codes with efficiency as close to 1 as desired. The proof does not give a method to find the best codes. It just sets a limit on how good they can be. February 8, 2010 Harvard QR48 27 Huffman coding used widely Eg JPEGs use Huffman codes to for the pixel-to-pixel changes in color values Colors usually change gradually so there are many small numbers, 0, 1, 2, in this sequence JPEGs sometimes use a fancier compression method called “arithmetic coding” Arithmetic coding produces 5% better compression February 8, 2010 Harvard QR48 28 Why don’t JPEGs use arithmetic coding? Because it is patented by IBM United States Patent 4,905,297 Langdon, Jr. , et al. February 27, 1990 Arithmetic coding encoder and decoder system Abstract Apparatus and method for compressing and de-compressing binary decision data by arithmetic coding and decoding wherein the estimated probability Qe of the less probable of the two decision events, or outcomes, adapts as decisions are successively encoded. To facilitate coding computations, an augend value A for the current number line interval is held to approximate … What if Huffman had patented his code? February 8, 2010 Harvard QR48 29
© Copyright 2025