papers
+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1998 ~
by Joa

Courtesy of fravia's page of reverse engineering

Here is the second part of Joa's essay.
See also the first part (Introduction)


Joa's work is extremely clear and will guide and help you to understand these matters, that for obvious reasons no real reverser should ever underestimate. My advice: read, re-read and then -once you have understood the basis- search the web and deepen (and then contribute :-)



Little essay about the various methods and viewpoints of crunching.



Part II: Propability and Crunching (Huffman isn't the only way)







By Joa Koester (JoKo2000@HotMail.Com) 

This essay is free to be copied and read and spread as long as

you are decent enough to leave my name in it. 

If you have corrections, comments, better examples than the ones used 

in this essay, etc. - drop me a line.





OK, let's dive into the world of crunchers and munchers...



In the Intro i told you that crunching is a process of reducing

the SPACE data needs to be stored, but not the MEANING (the

informations transmitted in the symbols).

The meter 'Entropy' is used here like in the physics. The more

entropy a symbol has, the LESS information is transmitted thru it.

Less information compared to the whole of all symbols, that is.

This means of course there is a lot of redundancy in it - meaning

a big 'crunch me, crunch me' for us. So we should concentrate

on the symbols that have a very high entropy. How we calculate

an entropy for our means depends on the data we want to crunch

and with which algorithm we want to crunch it.

I also told you that a certain student called Huffman once came

up with an algorithm to compress data. His idea is based on the

fact that in data (escpecially speech) some symbols happen 

to be more often than others. Here one overstated example:



aaaaaaaaaabbbbbcccdd



What we have here is a algorithm based on propability. That means

that some symbols are not transmitted in the way they used to be,

because the algorithm will create new symbols in respect to the

propabilities of the symbols. Some of these new symbols will be

shorter. And those lesser apperaring symbols will be transformed

into longer symbols:



Before:



a = 61h = %0110 0001

b = 62h = %0110 0010

c = 63h = %0110 0011

d = 64h = %0110 0100



and 



after:



????



How do we compute such a new symbol table? OK, before we explore

this algorithm let me change the data from ASCII to the lower

nibble. That should be enough to explain the algorithm:



Before:



a = %0001

b = %0010

c = %0011

d = %0100



and we have the propability:



a = 10

b =  5

c =  3 

d =  2

   ===

    20 zero != NULL) )

         knot = knot->zero

      else 

         if (knot->one != NULL) ) knot = knot->one



      if (knot.one == NULL) //is enough. knot.zero is NULL, too

         //symbol found. great. To the output with it

         symbol = knot.symbol

      end if

    end while   

    Do_Outpur_Char(symbol);

end while



Done. 





         

Hm, i have another little problem...



What is it, Watson?





I did as you told me. Took a table of 256 longs, counted the 256 chars,

sorted them and let the computer calculate an optimized Huffman-tree.

But all i got was a tree with all leaves having a length of 8 - all chars

are encoded with 8 bits and there is no crunching in that. What happened?



Ah, Dr. Watson,

you did not remember my sentence about Entropy. This file you examined

had all bytes nearly equally spread all over the length of the file.

So -statisticalwise- the file is equalized. No surprises here. With 

a Huffman-tree, which considers the whole file you won't get any 

crunching here, i fear.



But what do i do now?



Well, there are three scenarios that will satisfy your desire:

- We use a different algorithm

- We use a brutal tree

- We use an adapting technique



??? Tell me about the brutal tree.



Ok, i got this one from the 1001 Packer from the C64 (Anyone ringing a 

bell? 1001 Crew? No? Where are the old days gone to...)

They used a tree i just can label 'brutal tree'. Here's the idea:

When we have the case that we have data that is so equal that no

Huffman-tree will help us, we help us ourself. We take all the

Bytecounters and sort them. Then we create a table of the 256 chars

ordered by the sorted table, so that the still most frequent

chars appear on the top. Now we have the 256 possible bytes sorted.

So far so good.

Lets think about a byte. How do we read a byte? Isn't it:



#33 = $21 = %0010 0001  ?



What, if we we would change our view of the binary to a 2 + 6:



#33 = $21 = %00 100001



Our 256 chars would now be represented as 4 (00, 01, 10, 11) blocks

of 6 bits. What if we now 'huffman' these first two bits to:



0

10

110

111



So that we would have %0 100001 instead of %00 100001. Every byte that

is in the first 64Byteblock is coded with 7 bits, while the others

are coded either with 8 or with 9 bits. We can only hope that there

are enough bytes in the file to outweigh the two 9bit-chars. But it is

amazing how this little trick works. And the speed. It's so easy to

implement it and it is so fast. Even on the C64 it took only fractals of

a second to decode a programm as big as the main memory area (other 

packers took several seconds for the same size).





Wow, cool. But what about your different algorithms 

and this 'adapting technique' ?





Well, this is another story and will be told in the next chapter...









I hope you enjoyed this chapter. 

There will be more - i promise.

Just have a little patience.



Greetings from a programmer



Joa


In the meantime, after having read Joa's work above, you may enjoy a look at a huffer and a unhuffer, both written in C some time a go, and donated to the world, by Shaun Case... once you are done studying it, may be you'll want to have a look also at sixpack another more advanced (adaptive Huffman encoding) compresser, written in C as well, by Philip Gage, for a data compression competition some time ago... since this code includes SIX different algorhitms, I believe it could be useful for readers as 'passage' to the next part of Joa's essay (if Joa doesn't agree with me I'll kill these links of course).




redhomepage redlinks red anonymity red+ORC redstudents' essays redacademy database
redtools redcounter measures redcocktails redantismut redbots wars redsearch_forms redmail_fravia
redIs reverse engineering legal?