Pages

                           

Monday, January 3, 2011

Huffman Coding



Huffman coding algorithm is a minimal variable-length character coding algorithm based on the frequency of each character.The algorithm was developed by David Huffman.  This is called "lossless" compression and is different from "lossy" compression used in audio and video files where the exact original cannot be reproduced. This is used in programs like Winzip, gzip, stuffit etc. The secret of compressing data lies in the fact that not all characters are equally common. For instance, in typical English text the letter 'e' is much more common than the letter 'z'. In fact there are roughly 70 'e's for every 'z'. If we could encode the letter 'e' with less than the usual 8 bits and in exchange let the letter 'z' be take more.


Huffman Coding program consists of 7 functions. The functions is explained below:


1. Determining Relative frequencies


The function input is a string. A dictionary of letters along with their number of occurrence is created.



def Freq(stri):

freq = {}

for ch in stri:

freq[ch] = freq.get(ch,0)+1

print freq

2. Sorting list of tuples


This function builds and sorts a list of tuples. This list will be the input for the main Huffman algorithm.



def SortFreq(freq):

tuples=[]

for all in freq:

tuples.append((freq[all],all))

tuples.sort()

print tuples


3. Building a Tree


This function pops two elements from the front of the list and combined into a new element.  These 2 elements will have the least frequencies. The frequency of the new branch point is simply the sum of the frequencies of its two parts. Next we add this new element onto the list and sorted to find it's new position. The process is repeated until the list has a single element which represents the complete tree.



def BuildTree(tuples):

global string

while len(tuples)>1:

leastTwo=tuple(tuples[0:2])

rest=tuples[2:]

combFreq=leastTwo[0][0]+leastTwo[1][0]

tuples=rest+[(combFreq,leastTwo)]

tuples.sort()

print tuples 



4. Trimming of the frequencies 



This function trims the frequencies thus resulting in tuples containing only letters with no corresponding frequencies.



def Trim(tuples):

p=tuples[1];

if type(p)==type(""):

return p

else: 

return (Trim(p[0]),Trim(p[1])) 



5. Assigning codes to the Characters


This function assigns codes to the letters in the tree keeping track of the left and right turns in the variable "pat". There is a global dictionary "codes" which will be filled in along the way.




def Coding(node,pat=""):

global code

if type(node)==type(""):

code[node] = pat


else:

 Coding(node[0],pat+'0')

 Coding(node[1],pat+'1')

return code


6. Encoding a Text Stream



This function looks up the bit string for each character from the coded dictionary and replace it..



def encoding(str):

global code

result=""

for let in str:

result+=code[let]

return result 


7. Decoding a Text Stream 


This function does the reverse process of encodeing. In this each "zero" in the bit stream means a left turn up the tree, each "one" a right turn. When a leaf is reached, its character is sent to the output and we restart at the base of the tree for the next character.



def decoding(tree,str):

result=""

p=tree

for num in str:

if num=='0':

p=p[0]

else:

p=p[1]

if type(p)==type(""):

result+=p

p=tree

return result



Output:


This program turns the original string "aaabccdeeeeeffg" into 44 bits that we represent with "00000010010100101010111111111101101110111000". The original text at one byte per character requires 120 bits.




No comments:

Post a Comment