+HCU papers

Little essay about the various methods and viewpoints of crunching

~ version December 1998 ~

by Joa ----- Part. VII

Courtesy of reverser's page of reverse engineering

Well, Joa continues his fundamental paper on crunching, this is part VII

enjoy!

12 December 98 |
Joa
| ~ |
crunchi1.htm
| Little essay about the various methods and viewpoints of crunching
| papers
| ~ | fra_0126 |

10 June 98 |
Joa
| ~ |
crunchi2.htm
| Little essay about the various methods and viewpoints of crunching II
| papers
| ~ | fra_0129 |

17 June 98 |
Joa
| ~ |
crunchi3.htm
| Little essay about the various methods and viewpoints of crunching III
| papers
| ~ | fra_012E |

17 June 98 |
Joa
| ~ |
crunchi4.htm
| Little essay about the various methods and viewpoints of crunching IV
| papers
| ~ | fra_012F |

17 June 98 |
Joa
| ~ |
crunchi5.htm
| Little essay about the various methods and viewpoints of crunching V
| papers
| ~ | fra_012F |

17 June 98 |
Joa
| ~ |
crunchi6.htm
| Little essay about the various methods and viewpoints of crunching VI
| papers
| ~ | fra_012F |

Little essay about the various methods and viewpoints of crunching.
Part VII: Arithmetic Crunching (crunching bits apart...)
By Joa Koester (JoKo2000@HotMail.Com)

This essay is free to be copied and read and spread as long as
you are decent enough to leave my name in it.
If you have corrections, comments, better examples than the ones used
in this essay etc. - drop me a line.
hello, hello,
well, shave my legs and call me Grandpa if that wasn't a long delay, but i
had some serious difficulties accessing the Internet recently. I hope that
i can access you more frequently in the near future. And to sweet up the time
till then a little bit here my next essay about crunching methods:
Arithmetic Crunching!
What is Arithmetic Crunching?
Arithmetic Crunching is based on the observation that a certain floating
point number decrements to a much smaller degree the higher the factor is with
which you multiply it. For us that means, that that arithmetic crunching a based
on probabilities like Huffman. But as you will see, much better.
Wanna example? Well, imagine a floating point number. Say, we imagine 1,00.
Then imagine we have a multiplicating factor like 0,9.
And now watch what happens when we multiply the 1,00 with 0,9 and then
multiply the result with 0,9 again and again and ... and that some times:
1,00 * 0,9 = 0,9
0,9 * 0,9 = 0,81
0,81 * 0,9 = 0,729
...
as we continue we see that the value of the number decrements. But it
decrements SLOWLY. And it decrements with the factor 0,9. This is the first key
element for arithmetic crunching: you multiply a number with a factor. The
higher the factor the slower the number decrements, the better the crunching.
You want a practical example? Coming up... Have a look at the following byte
sequence:
aaaaaaaaaabbbbcccddd
which consists of four chars. Counting the appearences of the char reveals the
following:
10 x a
4 x b
3 x c
3 x d
-------
20 bytes
As we know that Arithmetic Crunching has something to do with probabilities,
let's build up a table with the probabilites of the chars:
a = 50%
b = 20%
c = 15%
d = 15%
And to explain the mechanics, let's start at the most basic implementation. The
basic idea is to transform the probabilities into slots, starting - for now - in
floating point mode in the range from 0,0 to 1,0.
So let's scale this table into a range from 0.0 to 1.0, giving the lowest
value the starting range 0.0:
Byte | % | Low | High
a | 50% | 0.50 | 0.999999(999...)
b | 20% | 0.30 | 0.499999(999...)
c | 15% | 0.15 | 0.299999(999...)
d | 15% | 0.00 | 0.149999(999...)
So we have the char 'a' with the range from 0.5 till 0.99999. But what does
that bring us?
Well, we have now TWO values and we automatically have a RANGE, namely the
range of the two actual calculated values. That means we have three values that
are calculated: the actual range, the actual Low and the actual High. The actual
range is the range that is later saved bytewise to disk, the actual High and Low
are the parts that change according to the crunched byte.
Just watch as we step thru the byte-sequence:
(We start with Low = 0.0 and High with 1.0)
Low: 0.0
High: 1.0
Range: (High - Low) = 1.0
char: a
LowOfChar: 0.5
HighOfChar: 0.99999(999...)
High= Low + (Range * HighOfChar)= 0.0 + (1.0 * 0.99999) = 0.99999
Low = Low + (Range * LowOfChar) = 0.0 + (1.0 * 0.5) = 0.5
What happened here? Well at first we calculated the actual RANGE. This range is
our arithmetic number, yielding the values of the chars we are crunching. Then
we calculate the High and then the Low.
These are build upon the slots (the ranges) of the actual char. Because we are
having TWO values we have to calculate also two new values. The High is to be
calculated with the high position of the actual slot (0.99999 for the char a,
0.49999 for the char b, etc.).
We multiply the high and low position with the actual range, because this
range is recreated dynamically with every byte we crunch. At first the range is
1,0. But with every byte the range becomes smaller and smaller until a byte or
so has to emitted to give space again and the range is shifted upwards again.
If for example your range is 0.00025 that would mean that your High and Low
are only 0.00025 away. At this point we have to emit the first number(s) of Low
to disk and shift the rest up until we are behind the decimal point again.
So, if Low was 0.12325 and High was 0.12350, our RANGE would be 0.00025. Too
small to continue. First we have to emit something and then shift the rest
upwards again. We emit for example 0.12 to disk and subtract this from Low and
High.
Low and High would then be 0.00325 / 0.00350. Then we have to shift the
values up again until they are behind the decimal point: 0.325 / 0.350.
The next bytes you crunch will more or less slowly bring the values nearer again
and then we emit again some values, probably 0.3 and shift the rest up again,
and so on.
How many decimal points you emit to disk depends on your personal ideas. You
could send as much decimal numbers to disk until there is a difference in High
and Low (so in the example 0.12350 and 0.12325 we could have sent 0.123 to
disk because these are the values that aren't different) or you could send
a certain number of numbers to disk and shift upwards until a certain difference
is ensured.
Or something completely different. It's up to you.
We calculate the High first, because the High-Value is calculated on the actual
Low-Value. If we would calculate the Low-Value first, we couldn't calculate our
High-Value anymore. When we calculate these values we take the actual range,
multiply it with the factor(s) of the actual char and give the result back into
High or Low.
What that mechanism will do will become clearer when we watch the crunching of
the second byte:
Low: 0.5
High: 0.99999
Range: (High - Low) = 0.49999
char: a
LowOfChar: 0.5
HighOfChar: 0.999999
High= Low + (Range * HighOfChar)= 0.5 + (0.49999 * 0.99999) = 0.999985
Low = Low + (Range * LowOfChar) = 0.5 + (0.49999 * 0.5) = 0.749995
What you see is that the number is still in the range of the char 'a'.
What will happen when we crunch the next 8 a's ? I reckon that the number will
get nearer to each other.
But still it will be in the range of 0.5 - 0.99999.
What we could use now is an example consisting of just 4 chars:
aabc
Ok. Now build the probability table:
a = 50%
b = 25%
c = 25%
And next the transforming-table from the percentages into the range of 0.0 - 1.0.
Byte | % | Low | High
a | 50% | 0.50 | 0.99999
b | 25% | 0.25 | 0.49999
c | 25% | 0.00 | 0.24999
Now again, watch the ranges as we skip the first two crunching steps as the
result is identical with the first example. Let's watch what happens when we now
reach the 'b':
Low: 0.749995
High: 0.999985
Range: (High - Low) = 0.24999
char: b
LowOfChar: 0.25
HighOfChar: 0.49999
High= Low + (Range * HighOfChar)= 0.749995 + (0.24999 * 0.49999) = 0.8749875
Low = Low + (Range * LowOfChar) = 0.749995 + (0.24999 * 0.25) = 0.8124925
Phew. As you can see the values made some heavy jumps. The High sank from
0.999985 down to 0.8749875, while the Low jumped from 0,749995 up to 0,8124925.
I think i will not surprise you when i say the the next step will go in the same
direction:
Low: 0.8124925
High: 0.8749875
Range: (High - Low) = 0.062495
char: b
LowOfChar: 0.0
HighOfChar: 0.24999
High= Low + (Range * HighOfChar)= 0.8124925 + (0.062495 * 0.24999) = 0.828115625
Low = Low + (Range * LowOfChar) = 0.8124925 + (0.062495 * 0.0) = 0.8124925
Oh boy, oh boy, oh boy. Look, what we have here. The Low and the High are damn
near to each other. But the bytes are thru. The crunching is over. And what now?
Well, all we have to do now is to emit the Low and that's all. We are done. The
sequence aabc is coded into 0.8124925! It is most important to understand that we
are not emitting a number, but a collection of RANGES!!! When we decrunch the
number it will become even clearer:
(For now the decoder is supposed to know the model)
Byte | % | Low | High
a | 50% | 0.50 | 0.99999
b | 25% | 0.25 | 0.49999
c | 25% | 0.00 | 0.24999
0.8124925 is in the range of the char a. So we output the char 'a'. And
then? It's simple. We subtract the Low of the char and shift the rest upwards
by multiplying it with the RANGE:
.99999 - .5 = .49999 = Range
.8124925 - .5 (=LowOfChar) = .3124925
.3124925 / .49999 (=Range) = .6249975
(Remember: dividing by a fraction is the same as multiplying with it's reciprocal.
Dividing by .49999 is nearly the same as multiplying with 2.0, meaning for us
the shifting we searched for.)
.6249975 is in the range of the char a. So we output the char 'a'. And proceed as
before:
.99999 - .5 = .49999 = Range
.6249975 - .5 (=LowOfChar) = .1249975
.1249975 / .49999 (=Range) = .25
Well, 0.25 is clearly in the range of b and so we emit a 'b'. Then we proceed:
.49999 - .25 = .24999 = Range
.25 - .25 (=LowOfChar) = .0
.0 / .24999 (=Range) = .0
0.0 is in the range of c and we emit a 'c'. After proceeding we recognize that
we are finished and stop working. Again: It is most important to notice that
in the crunching process we emit RANGES, or maybe better formulated,
interpretations of ranges. Therefore we emit only the Low of the range in
question. As a consequence when we decrunch the 'number' we subtract the
LowOfRange of the char in question. After that we can shift the 'number' up by
the RANGE of the char.
Now two problems are emerging: Is it necessary to emit the model to the
decruncher and second: How do we implement a floating point number that can be
as long as some hundreds of kilobyte when we only have registers of maybe 80 or
128 bits?
To answer the first question: Well, what about an adaptive model? It will not
crunch efficiently in the first few hundreds bytes but then we won't crunch just
texts of 200 bytes, right?
The answer to the second question is a little bit more complex. Of course we
can't emit a floating point number consisting of thousands of decimal numbers.
But with a mathematical trick we can fake this process. I'm sure that nearly all
of you once tried to programm some 3D stuff. And after some time you came up
with the idea to shift up the floating point numbers by, say, 16384, and
calculate then with these now integer values. After the calculation you'd then
shift the result down again by 16384 and would have speeded up your floating
point calculations by the factor 1000 or so. Now, based on this we could also
say that the number 1.0 could be same as 0x010000 and 0.0 could be 0x0000.
We would then transform the whole calculations into the realm of integer values.
We can only hope that 16 bits are enough to simulate the floating point
calculations, but let's think our example thru again:
Byte | % | Low | High
a | 50% | 0x008000 | 0x00ffff
b | 25% | 0x004000 | 0x007fff
c | 25% | 0x000000 | 0x003fff
First byte:
Low: 0x000000
High: 0x010000
Range: (High - Low) = 0x010000
char: a
LowOfChar: 0x008000
HighOfChar: 0x00ffff
High= Low + (Range * HighOfChar)= 0 + (0x010000 * 0x00ffff) = BANG
Ooops. It seems that we have a little overload here. Hm, but there must be a way
to calculate those values in integer format. It it most necessary for to
output single bytes to disk. What do we do?
Well, there are certain ways out of this. I will tell you one i find extremely
elegant and also has the advantage of being extensible and able to crunch
SIMILAR values, too. This solution comes from
Gordon V. Cormack
University of Waterloo
cormack@uwaterloo.ca
and is build upon the idea that you can see the bytes you crunch also as a
sequence of BITS you crunch. Now when you observe the 8 bits of a byte and would
log the countings of the 1-bit and 0-bit states of the byte you would have a
High and a Low. They would just be transformed into 256 entries of two arrays.
In C/C++ these array would be declared: int one[256], int zero[256].
When we then would examine one single bit of a byte we would then have to fetch
the correct entry of the array and calculate the actual fraction with the values
of the entries in the arrays one and zero. What we would get out of this
calculation would be the factor to multiply with the actual range. Well, i
include the sources for both the cruncher and the decruncher at the end of this
essay (they were not formatted by me).
Let's examine the source together:
int max = 0x1000000,
min = 0,
mid,
index,
c,
i,
bytes = 0,
obytes = 3;
int bit;
int one[256], zero[256];
for (i=0;i<256;i++) {
one[i] = 1;
zero[i] = 1;
}
Ok, some ints are created and the arrays are build and initialized. Why they are
initialized with 1 is due to the fact that they are used to be divided with.
Were they not initialized it would create an error "Division by Zero".
for(;;){
c = getchar();
if (c == EOF) {
min = max-1;
fprintf(stderr,"compress done bytes in %d bytes out %d ratio %f\n",
bytes,obytes, (float)obytes/bytes);
break;
}
...
So, forever (or until End Of File) we read a char and if it's EOF we break the
loop and land here:
putchar(min>>16);
putchar((min>>8) & 0xff);
putchar(min & 0x00ff);
}
So we at least output three bytes. Again you can see that we output from the
min-value (which is of course our Low). But that is the starting and the end of
the loop. What happens in the normal course:
bytes++;
for (i=0;i<8;i++){
bit = (c << i) & 0x80;
index = (1<