Board index » delphi » Arithmetic data compression

Arithmetic data compression

With the help of an article I figured out how to compress a short string
(10 characters) using floating point numbers and calculations, according
to the arithmetic compression algorithm. I used the extended data type
to store the result of the compression as it provides the greatest
precision.

How many consecutive characters can I compress at a time when using
extended, considering that only the fractional part of the number is
relevant? It should be more than 10 characters, as an extended variable
is already 10 bytes long. I guess that after a certain number of
characters, the results of the calculations become to small to store in
a variable of type extended and a data loss would evidently occur.
Is there a better way to implement arithmetic compression? How does it
stack up against other compression algorithms, both in terms of speed
and compression ratio?

 

Re:Arithmetic data compression


i donna if you are a guru, but according to your message it seems that you
are not familiar with compression,

and, for speed, i suggest you to employ "GNU Compress" for it and for
High(est?) compression with slow speed i suggest "BZip2". (both src code is
availble for d/l at sunsites..(i.e. sunsite.unc.edu etc)

if you want a easy one, i suggest Huffman.

Quote
Frederic <f...@altavista.net> wrote in message

news:37FF7960.32488445@altavista.net...
Quote
> With the help of an article I figured out how to compress a short string
> (10 characters) using floating point numbers and calculations, according
> to the arithmetic compression algorithm. I used the extended data type
> to store the result of the compression as it provides the greatest
> precision.

> How many consecutive characters can I compress at a time when using
> extended, considering that only the fractional part of the number is
> relevant? It should be more than 10 characters, as an extended variable
> is already 10 bytes long. I guess that after a certain number of
> characters, the results of the calculations become to small to store in
> a variable of type extended and a data loss would evidently occur.
> Is there a better way to implement arithmetic compression? How does it
> stack up against other compression algorithms, both in terms of speed
> and compression ratio?

Re:Arithmetic data compression


In article <37FF7960.32488...@altavista.net>, Frederic
<f...@altavista.net> writes

Quote
>How many consecutive characters can I compress at a time when using
>extended, considering that only the fractional part of the number is
>relevant?

With a floating point representation there is no "fractional part"
(except as defined by the exponent, which could place the point almost
anywhere, even outside the represented portion of the mantissa). All of
the digits in the mantissa are available to you.

Quote
>It should be more than 10 characters, as an extended variable
>is already 10 bytes long.

That sort of defines "compression" I guess, although remember that some
of the bytes (2? 4?) will be exponent and you might expect two of the
bits to be signs in a signed floating point representation - check your
manual.

The question is still unanswerable unless you give further details of
the compression algorithm you are using. Some algorithms have no fixed
compression ratio anyway - particularly those which capitalise on
letter-frequency.

Quote
>I guess that after a certain number of
>characters, the results of the calculations become to small to store in
>a variable of type extended and a data loss would evidently occur.
>Is there a better way to implement arithmetic compression? How does it
>stack up against other compression algorithms, both in terms of speed
>and compression ratio?

All methods have different characteristics, many are very sensitive to
the distribution of characters in the input stream. You've just got to
do some comparisons. Let us know how you get on.

--
Marcus Morris - South Croydon, LONDON, UK (Mar...@ntos.demon.co.uk)
http://www.ntos.demon.co.uk

Re:Arithmetic data compression


Quote
Marcus Morris wrote:
> The question is still unanswerable unless you give further details of
> the compression algorithm you are using. Some algorithms have no fixed
> compression ratio anyway - particularly those which capitalise on
> letter-frequency.

I wrote the following program to compress the string 'BILL GATES'. I
strictly followed the instructions I found in an article called
"Arithmetic Coding + Statistical Modeling = Data Compression  Part 1 -
Arithmetic Coding" by Mark Nelson, Dr. Dobb's Journal February, 1991. The
result must be 0.257216775200000000, and that is exactly what my program
outputs.
Note that I am utterly new to data compression, so the following code is
probably badly structured an unoptimized. I hope you will find your way
through, though.

var
  prob  : array[#0..#255] of record
                               p : extended;
                               l, h : extended
                             end;
  range  : record
             l, h : extended
           end;
  differ : extended;
  ch : char;
  s : string;
  i : word;

begin
   for ch := #0 to #255 do
    with prob[ch] do
    begin
      p := 0;
      h := 0;
      l := 0
    end;

  s := 'BILL GATES';

  for i := 1 to length(s) do
    prob[s[i]].p := prob[s[i]].p + 1;

  for ch := #0 to #255 do
    prob[ch].p := prob[ch].p / length(s);

  prob[#0].l := 0;
  prob[#0].h := prob[#0].l + prob[#0].p;

  for ch := #1 to #255 do
  begin
    prob[ch].l := prob[char(byte(ch)-1)].h;
    prob[ch].h := prob[ch].l + prob[ch].p
  end;

  range.l := 0;
  range.h := 1;

  for i := 1 to length(s) do
  begin
    differ := range.h - range.l;

    range.h := range.l + differ * prob[s[i]].h;
    range.l := range.l + differ * prob[s[i]].l
  end;

  writeln('Result: ', range.l:1:18)
end.

Re:Arithmetic data compression


In article <3803453E.F1F20...@altavista.net>, Frederic
<f...@altavista.net> writes

Quote
>   for ch := #0 to #255 do
>    with prob[ch] do
>    begin
>      p := 0;
>      h := 0;
>      l := 0
>    end;

Clear the stats - ok...

Quote
>  s := 'BILL GATES';

Got your source - ok...

Quote
>  for i := 1 to length(s) do
>    prob[s[i]].p := prob[s[i]].p + 1;

Counting letter frequency. You appear to be casting characters to
integers (s[i] as subscript). I'm not sure Pascal should be allowed to
do this! That's very much a dirty C trick!

Quote
>  for ch := #0 to #255 do
>    prob[ch].p := prob[ch].p / length(s);

Convert character counts to fraction of total characters - ok...

Quote
>  prob[#0].l := 0;
>  prob[#0].h := prob[#0].l + prob[#0].p;

Don't understand. If prob[#0].l is 0 then why bother adding it to
prob[#0].p? All that appears to be happening here is the copying of
prob[#0].p to prob[#0].h, and what is prob[#0].p likely to be anyway?
Zero?

Summary - probably got an array where .p contains character frequencies
expressed as fractions and .l and .h are still zero.

Quote
>  for ch := #1 to #255 do
>  begin
>    prob[ch].l := prob[char(byte(ch)-1)].h;
>    prob[ch].h := prob[ch].l + prob[ch].p
>  end;

Copying a zero into all .l in a funny way (why?),
copying all probabilities .p into .h (why?)...

Quote
>  range.l := 0;
>  range.h := 1;

Now we guess that .l and .h are "low" and "high"...

Quote
>  for i := 1 to length(s) do
>  begin
>    differ := range.h - range.l;

>    range.h := range.l + differ * prob[s[i]].h;
>    range.l := range.l + differ * prob[s[i]].l
>  end;

Ah - the "clever bit". I'm going to have to study this for a while to
work out exactly what it's doing. However, it does bother me that 10
characters (e.g. "BILL GATES") from a set of 256 possible characters has
256^10 permutations (something like 1.2E24). Can you represent that many
unique numbers with your "extended" data type? I honestly don't know.
I've never used it and it doesn't appear to be listed in the TP7 help
file.

Quote
>  writeln('Result: ', range.l:1:18)

Have you got the algorithm which turns your result back into "BILL
GATES"? That's the acid test!

I'll go away and try it of course but at the moment the method appears
to be either an elaborate hoax or an interesting but impractical
phenomenon.

In any case, to the nearest byte the result you give would require at
least six bytes (5 mantissa + 1 exp) to represent the original ten
character bytes. Not exactly what I would call "compression" - "gentle
squeezing" more like!!!

--
Marcus Morris - South Croydon, LONDON, UK (Mar...@ntos.demon.co.uk)
http://www.ntos.demon.co.uk

Re:Arithmetic data compression


Quote
Marcus Morris wrote:
> Ah - the "clever bit". I'm going to have to study this for a while to
> work out exactly what it's doing. However, it does bother me that 10
> characters (e.g. "BILL GATES") from a set of 256 possible characters has
> 256^10 permutations (something like 1.2E24). Can you represent that many
> unique numbers with your "extended" data type? I honestly don't know.
> I've never used it and it doesn't appear to be listed in the TP7 help
> file.
> I'll go away and try it of course but at the moment the method appears
> to be either an elaborate hoax or an interesting but impractical
> phenomenon.

Okay, the algorithm is supposed to work like this (the data stream being
'BILL GATES'):

1. First, count the number of occurrences of each character in the data
stream to compress, and store it in prob[char].p.

2. Compute the probabilities: prob[char].p := prob[char].p /
length(data_stream).

3. Assign a range to each character:
   [Space] = 0.0..0.1 (p([Space]) = 0.1)
   A = 0.1..0.2 (probability of A = 0.1)
   B = 0.2..0.3 (probability of B = 0.1)
   E = 0.3..0.4 (p(E) = 0.1)
   G = 0.4.. 0.5 (p(G) = 0.1)
   I = 0.5..0.6 (p(I) = 0.1)
   L = 0.6..0.8 (p(L) = 0.2)
   S = 0.8..0.9 (p(S) = 0.1)
   T = 0.9..1.0 (p(T) = 0.1)

4. Compute the result as follows:
    B = 0.2..0.3 -> result := 0.2
    I = 0.5..0.6 -> result := 0.25
    L = 0.6..0.8 -> result := 0.256
    etc...

Or, quoting from the article:

Set low to 0.0
Set high to 1.0
While there are still input symbols do
    get an input symbol
    code_range = high - low.
    high = low + range*high_range(symbol)
    low = low + range*low_range(symbol)
End of While
output low

New Character   Low value        High Value
-------------   ---------        ----------
                0.0              1.0
    B           0.2              0.3
    I           0.25             0.26
    L           0.256            0.258
    L           0.2572           0.2576
  SPACE         0.25720          0.25724
    G           0.257216         0.257220
    A           0.2572164        0.2572168
    T           0.25721676       0.2572168
    E           0.257216772      0.257216776
    S           0.2572167752     0.2572167756

Re:Arithmetic data compression


JRS:  In article <vfIHKAAyW9A4E...@ntos.demon.co.uk> of Wed, 13 Oct 1999
01:43:30 in news:comp.lang.pascal.borland, Marcus Morris

Quote
<Mar...@ntos.demon.co.uk> wrote:
>In article <3803453E.F1F20...@altavista.net>, Frederic
><f...@altavista.net> writes
>>  for i := 1 to length(s) do
>>    prob[s[i]].p := prob[s[i]].p + 1;

>Counting letter frequency. You appear to be casting characters to
>integers (s[i] as subscript). I'm not sure Pascal should be allowed to
>do this! That's very much a dirty C trick!

No, prob is an array [#0..#255] indexed by char.  That's GOOD Pascal.

But, at this stage, .p is a count, and it would be better to use word or
longint, and hence be able to use the faster (I guess) Inc(prob[s[i]].p)

Quote
>>  for ch := #0 to #255 do
>>    prob[ch].p := prob[ch].p / length(s);

and so
        Rlen := 1.0/length(s) ;
        for ch := #0 to #255 do with prob[ch] do pp := p*Rlen ;
(untested)

Quote
> ...
>Ah - the "clever bit". I'm going to have to study this for a while to
>work out exactly what it's doing. However, it does bother me that 10
>characters (e.g. "BILL GATES") from a set of 256 possible characters has
>256^10 permutations (something like 1.2E24). Can you represent that many
>unique numbers with your "extended" data type? I honestly don't know.
>I've never used it and it doesn't appear to be listed in the TP7 help
>file.

Extended is not indexed, but it's in "real types", the manual, and
http://www.merlyn.demon.co.uk/pas-real.htm#FloatTypes; it's the native
type of the FPU & occupies 10 bytes.  As almost all combinations of bits
give different valid numbers (though many do not), it's nearly enough to
encode any ten-character string - and I guess ample for any ten non-
control-character string.

However, it's a limited approach, and string compression is
fundamentally a discrete problem; it's reasonable to use floating point
in deciding how to compress, but not in actual compressing or
decompressing.

--
? John Stockton, Surrey, UK.  j...@merlyn.demon.co.uk   Turnpike v4.00   MIME. ?
 <URL: http://www.merlyn.demon.co.uk/> TP/BP/Delphi/&c., FAQqy topics & links;
 <URL: ftp://garbo.uwasa.fi/pc/link/tsfaqp.zip> Timo Salmi's Turbo Pascal FAQ;
 <URL: http://www.merlyn.demon.co.uk/clpb-faq.txt> Pedt Scragg: c.l.p.b. mFAQ.

Re:Arithmetic data compression


Quote
Frederick wrote:
>With the help of an article I figured out how to compress a short string
>(10 characters) using floating point numbers and calculations, according
>to the arithmetic compression algorithm. I used the extended data type to
>store the result of the compression as it provides the greatest precision.
>How many consecutive characters can I compress at a time when using
>extended, considering that only the fractional part of the number is
>relevant? It should be more than 10 characters, as an extended variable
>is already 10 bytes long. I guess that after a certain number of characters,
>the results of the calculations become to small to store in a variable of
>type extended and a data loss would evidently occur. Is there a better way
>to implement arithmetic compression? How does it stack up against other
>compression algorithms, both in terms of speed and compression ratio?

I suggest that you look at _The Data Compression Book_ by Mark Nelson,
M&T Press, 1981.  He describes a technique for arithmetic compression
that uses no floating point numeric types - all of the processing is
done using 16- and 32-bit integers.  The compressed data is the mantissa
of a single floating point number represented as a bit stream of
arbitrary length.  This bit stream is shifted through the integer
variables and bits are extracted as tokens are recognized (Mr. Nelson
explains this so that is actually makes sense...)

As far as how many characters can be packed into an extended variable,
Mr. Nelson cites a (highly contrived) examples where a file of 100,000
characters was compressed with his implementation into an output file of
only 3 bytes.  Arithmetic compression is not the fastest process because
of the calculation overhead, but if you have an accurate model of the
data it can be very effective.

This is a very good book for an overview of several compression
algorithms, even if the examples and source code (on disc) are (alas!)
in C.

-Tom Cloutier

Re:Arithmetic data compression


Quote
Frederic wrote:
> With the help of an article I figured out how to compress a short string
> (10 characters) using floating point numbers and calculations, according
> to the arithmetic compression algorithm.

Funny you should ask.  My 1976 Stanford Ph.D. dissertation "Source
coding algorithms for fast data compression" was some of the fundamental
work on arithmetic compression.

Quote
> I used the extended data type
> to store the result of the compression as it provides the greatest
> precision.

> How many consecutive characters can I compress at a time when using
> extended, considering that only the fractional part of the number is
> relevant?

There is no hard-and-fast answer to that.  It depends on the
probabilities assigned to the characters, since the length of the
codestring grows as their surprise (-log(probability)).  You can
encode many characters which are not surprising (high probability)
or few which are suprising (low probability).

Quote
> It should be more than 10 characters, as an extended variable
> is already 10 bytes long.

Not all of the bytes in an "extended" character are devoted to the
"fractional" part.  But this is true if the characters are highly
probable.

Quote
> I guess that after a certain number of
> characters, the results of the calculations become to small to store in
> a variable of type extended and a data loss would evidently occur.

True!

Quote
> Is there a better way to implement arithmetic compression?

Yes.  Use multi-precision arithmetic, where the words get extended
to arbitrary length as required.  This would normally cause run time
to grow as the square of the length of the string to be encoded, but
there are ways to avoid it by controlling carry-over.

See, for example,
    R.B. Arps, T.K. Truong, D.J. Lu, R.C. Pasco, and T.D.
    Friedman, "A multi-purpose VLSI chip for adaptive data
    compression of bi-level images," IBM Journal of Research
    & Development, Vol.32, No.6, November, 1988, pp. 775-795.

Actually, that entire issue was devoted to the Q-coder, a type of
arithmetic coder which uses fixed-point arithmetic and various other
techniques.

Quote
> How does it
> stack up against other compression algorithms, both in terms of speed
> and compression ratio?

Depends on the data to be compressed, and how well you can model its
expected probabilities!  Arithmetic coding is not the fastest, but
gets mathematically optimum compression, given the right model.

     - Rich

Re:Arithmetic data compression


Quote
Marcus Morris wrote:
> With a floating point representation there is no "fractional part"
> (except as defined by the exponent, which could place the point almost
> anywhere, even outside the represented portion of the mantissa). All of
> the digits in the mantissa are available to you.

I think he meant the "mantissa" (as opposed to the "exponent") when he
used the term "fractional part."

Quote
> The question is still unanswerable unless you give further details of
> the compression algorithm you are using. Some algorithms have no fixed
> compression ratio anyway - particularly those which capitalise on
> letter-frequency.

Exactly!  With arithmetic coding, the "compression ratio" depends on
the "probabilities" assigned to each character by the model, and the
frequencies with which those characters actually occur.

If characters are assigned low probabilities by the model but occur
frequently, the coder can actually expand the data.

     - Rich

Re:Arithmetic data compression


Quote
>and, for speed, i suggest you to employ "GNU Compress" for it and for
>High(est?) compression with slow speed i suggest "BZip2". (both src code is
>availble for d/l at sunsites..(i.e. sunsite.unc.edu etc)

For zlib (on which GNU Compressed is based) pascal code exist, I don't know
if it will work with TP though (Delphi/FPC)
Do you know any  pascal code for the Bzip2 system?

P.s. a link to the page with paszlib can be found on www.freepascal.org in
the contributed units section. Maybe you can port it to TP, or is it TP
ready.

Other Threads