crunchi4.htm: Little essay about the various methods and viewpoints of crunching

+HCU papers

Little essay about the various methods and viewpoints of crunching
~ version June 1998 ~
by Joa Part. IV
Courtesy of fravia's page of reverse engineering
Well, Joa continues his fundamental paper on crunching, this is part IV (since the preceding one was part 3 :-)
enjoy!
05 June 98 Joa ~ crunchi1.htm Little essay about the various methods and viewpoints of crunching papers ~ fra_0126

10 June 98 Joa ~ crunchi2.htm Little essay about the various methods and viewpoints of crunching II papers ~ fra_0129

17 June 98 Joa ~ crunchi3.htm Little essay about the various methods and viewpoints of crunching III papers ~ fra_012E

17 June 98 Joa ~ crunchi4.htm Little essay about the various methods and viewpoints of crunching IV papers ~ fra_012F

Little essay about the various methods and viewpoints of crunching.



Part IV: Leaving Huffman and entering the realms of R'lyeh, eh, RLE







By Joa Koester (JoKo2000@HotMail.Com) 

This essay is free to be copied and read and spread as long as

you are decent enough to leave my name in it. 

If you have corrections, comments, better examples than the ones used 

in this essay, etc. - drop me a line.





OK, let's dive into the world of crunchers and munchers...





Last time we discussed the possible uses of the Huffman-tech and

the possibibility to use it in an adaptive way. I would like

to conclude this with the remark that the Huffman-way is pretty 

useless for 'normal' progs. You achieve about 5% - 10% if you

are lucky. Used on normal texts yielding only characters with

ASCII > 127 there will be at least a 25% ratio. Normally there

will be a ratio of about 30%-35%. But there you have a limited

Alphabet (the letters in lower and upper cases and the numbers

plus some extra symbols). What we want are algorithms usable

for more general (random) data. One of these algorithms (or

better the principle) is RLE.



What does RLE mean?



RLE, Dr. Watson, stands for Run Length Encoding. It's varying 

implementations are mostly used when speed is important.

On the Amiga days there was a nice picture format - IFF.

It had a RLE mode (or was RLE always on? Can't remember anymore...)

where the single lines of a picture were crunched with a RLE

algorithm leading to a space-saving of about 30% to 80% depending

on the picture. Nice, isn't it?



But what is the basic idea of RLE?



Ah, Watson, always so eager, aren't you? Well, your desire will

be satisfied... 



Imagine a file with some data and some zeros:



abcdefg00000000000000000000



(you will see data like this a lot in uncompressed bitmaps)



Now, we want to crunch it. We would like to crunch the RUN of the

zeros. For this we have to tell the decoder the LENGTH  of the run. 

One obvious idea would be:



abcdefg{0x14}0



where the {0x14} would be a byte which would tell the 

decompressor how often it has to copy the following byte, 

which would give us 9 bytes instead of 27 bytes: a 66.66% ratio.

Now you know where the name RLE comes from. Believe me, this is

one of the most primitive ways of thinking up a crunch-method.

But as with all most primitive ways it is extremely fast and 

astoundingly stable. 



--- Little private note ----

As a programmer you don't win a prize for the most complicated 

algorithm you can think of (because you aren't paid (enough) for it 

and there is in most cases no time for it) (except for

the field of security where complicated algs are expected. 

And yet - a programmer has to debug his own routines and so he won't 

use tricks he won't understand a few weeks later), so you don't do it.

You program on a more stable level: KISS. Keep It Simple, Stupid!

If there is a more easy way of doing it, choose it. Use preformulated

libraries like the C++ STL or (yuk) MFC or Borland VCL. That's the

way programmers program their applications. Hugh!

--- Little private note end ---



One immediate question is of course coming up:

What, if the zeros would have appeared 0x61 times (the value for 'a')?

The output would have been:

abcdefg{0x61}0 = abcdefga0



How does the decruncher know when to activate a copy loop (crunch) or 

just copy the actual bytes (no-crunch)?



We have to build up a rule that both, the coder and the decoder

use without exceptions. The coder has to build the output in a way that

the decoder can decode it's input without problems.



So, what we need is a kind of SWITCH. A switch which the encoder sends.

Linked with the descision of implementing the switch is automatically 

the question: WHEN do i crunch? 

Do i crunch when there are 10 equal bytes following in a run? Certainly.

There will be enough ways to tell the decoder that a run is coming up.

Do i crunch when there are 5 equal bytes following in a run? Yep. Same answer.

Do i crunch when there are 4 equal bytes following in a run? Yep. Same answer.

Do i crunch when there are 3 equal bytes following in a run? Hm, let me think of it.

Do i crunch when there are 2 equal bytes following in a run? Eh, (panic) don't know. HELP!



Ok, ok. What you need is a little overview of some ideas i observed while 

analysing some RLE methods. In most crunching algorithms there will be telling

bits or bytes, telling the cruncher to change into crunch/normal mode. After

this switch the following bits/bytes are interpreted completely different.



- IFF like.

  The IFf format was build upon the idea of signed chars. I don't remember the 

  algorithm in all detail, but enough to explain the idea. If someone has the

  original IFF-Readme's he should correct me here. But the basic idea is the 

  following:

  There are TELLING bytes. These telling bytes are either signed or unsigned.

  The crunched file starts with a telling byte.

  The switch was the 8th bit (the sign bit) of this char. 

  When the decoder encountered a signed char it knew that, by masking with 0x7f,

  it would get to the Runlength of the following byte. It copied the following

  byte the decoded time +2 and went on with the next byte. +2 coz there had to

  be at least 2 bytes, so you could add 2 in mind, giving you a length of a

  crunch-run from 2 (encoded with %1 + (2 - 2 = 0 = %000 0000) = 0x80) 

  to 129 (encoded with %1 + (129 - 2 = 127 = %111 1111) = 0xff).

  If the byte was not signed, that would mean that the next X +1 bytes were to 

  be copied as they were. X +1 because there HAS to be at least one byte, so 

  you can add 1 in mind, giving us a range 

  from 1 byte (encoded with %0 + (1 - 1 = 0 = %000 0000) = 0x00) 

  to 128 (encoded with %0 + (128 - 1 = 127 =  %111 1111) = 0x7f). 

  The minimum crunching point was 2 equal following bytes. They would be

  coded as two bytes (0x80,0x??). Yes, i know, there is no saving here, but the other

  consequence would have been to encode these two bytes as no-crunch, leading

  to a three byte coding (0x01, 0x??, 0x??). 

  In cases like this it is most important to keep the garbage down, or else you are

  increasing the filesize with it.

  One example:

  

  abcdefg00000000000000000000 (= abcdefg + 20 '0')

  

  would be coded as:

  

  {0x06}abcdefg{0x92}0. 

  0x06 because we have 7 bytes here and the decoder adds one for you.

  0x92 = %10010010 = %1 0010010 = 128 + 18. 18 because the decoder adds 2 for you.



  It is important that you add the original filelength with the crunched file or your 

  decoder doesn't know when to stop. 



  If you have a file of total random data (like a already crunched file) you will

  add (filesize / 128) bytes to it. So, if your Word '95 with 3.845.120 bytes was

  total random data it would be enlarged by 30.040 bytes. But run the alf on uncompressed

  bitmaps and watch them shrink.



  Remark: IFF is a very good and simple to implement algorithm and it's a shame that

          the Amiga is dead nowadays. I see IFF only on AIFF sound formats. ;(





- one for eight

  We have a telling byte which is viewed as a row of eight status bits. 1 means we have

  a crunch, 0 means we have garbage (or the other way round - your choice).

  Crunch and Garbage have unsigned counting bytes giving us a range from 1 to 256.

  After we worked thru a sequence of all 8 bits, the next telling byte is read (thus

  forming a package, remember?)

  The crunched file starts with such a telling byte.

  One example:



  abcdefg00000000000000000000xyz111222333444abcdefg = 49 bytes



  We step thru the first 7 bytes and consider them Garbage. 

  The next 20 bytes are Crunch.

  Then 3 bytes Garbage.

  Then 3 bytes Crunch.

  Then 3 bytes Crunch.

  Then 3 bytes Crunch.

  Then 3 bytes Crunch.

  Then 7 bytes Garbage.



  So our telling byte would be: %0 1 0 1 1 1 1 0 = 0x5e.

  The counting bytes would be (decremented by one, remember): 

  0x06, 0x13, 0x02, 0x02, 0x02, 0x02, 0x02, 0x06



  The crunching algorithm can start by crunching two bytes and we should do it

  or else we would produce one counting byte, plus those two bytes = 3 bytes!



  It is important that you add the original filelength with the crunched file or your 

  decoder doesn't know when to stop. 



  The whole package would look like this:

  {0x5e}{0x06}abcdefg{0x13}0{0x02}xyz{0x02}1{0x02}2{0x02}3{0x02}4{0x06}abcdefg = 31 bytes



  The IFF-coder would have produced:

  {0x06}abcdefg{0x92}0{0x02}xyz{0x81}1{0x81}2{0x81}3{0x81}4{0x06}abcdefg = 30 bytes



  While the algorithm is a little bit more complicated than the IFF-alg, it has the advantage

  of having a range of 256 instead of 129 byte. In big files this could lead to some significant

  savings.



  I lifted this algorithm from the bit-level to the bytelevel. I saw both implementations and

  due to my observations the bytelevel is much faster. In the bitlevel you read a telling-BIT. 

  If it's a one (crunch) you read 8 bits as a counter, add one, read the next 8 bit as the char 

  and copy your decoded byte ?-times into your destination buffer. 

  If it's a zero you read the next 8 bits 

  as a counter, add one and reading ? times 8 bits as a char you copy the next ? bytes. It's the

  same mechanism, but implemented as byte it's sooooo much faster. The bitlevel is just more

  pseudo-crypting, because you can't read anything anymore :-).



  If you have a file of total random data (like a already crunched file) you will

  add (filesize / 256) + ( (filesize / 256) / 8) bytes to it. 

  So, if your Word '95 with 3.845.120 bytes was total random data it would be 

  enlarged by 15.020 + 1878 = 16898 bytes. 

  



- No Garbage

  This one is most interesting. As you saw above, one problem arising is the coding of garbage.

  This idea here now implements a way of RLE without having the problems of garbage with the

  disadvantage of only being able to crunch from up to 3 bytes in a row instead of 2 like the

  others.

  The idea is, that the decoder can do something about the garbage recognizing. If the decoder

  would somehow recognize that we have a crunch here, it just would have to read a counter byte 

  and could start the copy routine. 



  Well, the easiest way of making the decoder recognizing a byte-run is to let the run partially 

  STAY in the source. That means that we have to let at least two bytes in a row stay in the source for

  to make the decoder able to recognize a crunch. Than we add a counting byte with a range from

  1 to 256 of how many of this byte will FOLLOW and there are no garbage bytes anymore.

  1 to 256 because we will crunch starting by three equal bytes in a row, so the byte will at least 

  be copied once more, thus adding one to the range of a byte.

  In the praxis this means you will have a run with a certain length. When the length is more than

  two and shorter than 259 you subtract 3 from it. When you have a length of 3 you subtract 3 giving

  a zero as output. As the decoder will at least copy one byte the run is perfect. If you have a run

  of 258 you subtract 3 giving 255 as output. Exactly what can be put into a unsigned char. As two

  bytes will be coded the normal way and the decoder will add one to the counter we have our 258 

  runlength encoded.



  One example:

  abcdefg00000000000000000000xyz111222333444abcdefg = 49 bytes



  abcdefg               is encoded as is

  00000000000000000000  is encoded as crunch: 20 - 3 = 17 bytes will follow, so 00{0x11}

  xyz                   is encoded as is

  111                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 11{0x00}

  222                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 22{0x00}

  333                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 33{0x00}

  444                   is encoded as crunch: 3  - 3 = 0  byted will follow, so 44{0x00}

  abcdefg               is encoded as is



  makes alltogether:



  abcdefg00{0x11}xyz11{0x00}22{0x00}33{0x00}44{0x00}abcdedfg = 32 bytes





  If you have a file of total random data (like a already crunched file) you will

  add 0 (!) bytes to it. That's an important feature. This algorithm can be

  let loose on all data you know. In the worst case the output will be the same

  as the input and no harm was done.

  So, if your Word '95 with 3.845.120 bytes was total random data it would be 

  still 3.845.120 bytes

  The disadvantage is of course the unability to crunch two consequenting bytes.

  But hey, you can't have it all, can you?





There are of course a lot more possibilities of coding RLE. I hope you get the idea.

Don't let the statistics deceive you. It strongly depends on the input data which

of the above mentioned algorithms are best suited for the task. One good strategy 

would be to load about 20%-10% of the file (if it's big) and test crunch it with 

ALL your algs. Then choose the best one and crunch the whole file. This will not

work on smaller files where you should load the whole file into a buffer (say, 64K

or 128K or so). For most RLE-algs speed is the most critical factor. Have this 

always on your mind when you invent your own RLE-crunchers.



When i started writing little crunchers on my own i started with the DECruncher.

It worked always fine for me. First think of a file and a possible coding. Write

it down. Create a file with this coding. Write a decruncher. And then think of a 

way to generate such a code. The central point of all crunch algorithms will be

the pointers to your source-buffer and to internal arrays and/or to some

history tables. Try it. It's easier than you may think right now.





Eh, i have observed something important, i think.



What is it, Watson?



In your example (abcdefg00000000000000000000xyz111222333444abcdefg) the 

sequence 'abcdefg' happens to appear two times. After having observed this

i examined some files and texts and i realized that sequences (and, or, in...)

(re)appear a lot more often than equal-byte runs in files. 

Can we crunch these sequences, too?



Oh, yes, Watson. We can! Jacob Ziv and Abraham Lempel published essays dealing

exactly with this problem in 1977 and 1978. The theories described there are

the basics of what we call today the LZxxx-algorithms. Programs like the

UNIX-Compress or ZIP or RAR are based on these theories. But this is a 

little bit more complicated and better dealt with in an chapter on it's own.







I hope you enjoyed this chapter. 

There will be more - i promise.

Just have a little patience.



Greetings from a programmer



Joa
homepage
links
anonymity
+ORC
students' essays
academy database
tools
counter measures
cocktails
antismut
bots wars
search_forms
mail_fravia
Is reverse engineering legal?
05 June 98	Joa	~	crunchi1.htm	Little essay about the various methods and viewpoints of crunching	papers	~	fra_0126
10 June 98	Joa	~	crunchi2.htm	Little essay about the various methods and viewpoints of crunching II	papers	~	fra_0129
17 June 98	Joa	~	crunchi3.htm	Little essay about the various methods and viewpoints of crunching III	papers	~	fra_012E
17 June 98	Joa	~	crunchi4.htm	Little essay about the various methods and viewpoints of crunching IV	papers	~	fra_012F