Valhalla Legends Forums Archive | General Programming | Compression

AuthorMessageTime
Arta
Disclaimer: I know nothing about compression.

I need a compression algorithm that is fast, and will provide acceptible compression for (mostly) textual Data. The data in question looks like it ought to be very compressible to me. The most important thing is that the algorithm is as fast as possible. 

Can anyone advise?
November 21, 2004, 12:39 AM
Spht
[quote author=Arta[vL] link=topic=9622.msg89559#msg89559 date=1100997565]
Disclaimer: I know nothing about compression.

I need a compression algorithm that is fast, and will provide acceptible compression for (mostly) textual Data. The data in question looks like it ought to be very compressible to me. The most important thing is that the algorithm is as fast as possible. 

Can anyone advise?
[/quote]

I ran in to a simple algorithm a long time ago which was fairly interesting for someone that doesn't know anything about compression.  I don't recall the name of it, but it just subsituted characters for bits, so for example 'A' could be 1; 'B' is 10, 'C' is 11, 'D' is 100, etc.

"HELLO" would result in being 1000 101 1100 1100 1111, which is 04 5c cf (3 bytes).
November 21, 2004, 12:48 AM
Arta
Isn't that Huffman?

(ok, I might know a tiny bit :))

PS: But I don't know how fast huffman is
November 21, 2004, 12:54 AM
K
decoding huffman is just accessing nodes a binary tree, so the amortized time would be O(log(n)).  Keep in mind that the number of nodes you will have will be greater than the number of characters since data can only be stored at leafs.

encoding huffman is O(1)/character => O(n) if you already have a tree generated.  And since you're dealing with text, you could just pre-generate a tree based on the letter probability of the english language.
November 21, 2004, 1:46 AM
TehUser
How are the compressed characters read out?  For instance, how would you know that 11 is 'C' instead of "AA"?
November 21, 2004, 1:47 AM
K
[quote author=TehUser link=topic=9622.msg89565#msg89565 date=1101001641]
How are the compressed characters read out?  For instance, how would you know that 11 is 'C' instead of "AA"?
[/quote]

When the bit-path reaches a leaf you will have found the character.  There is never a case where one character is represented by the same sequence of bits of another character + some, since data is only stored at leaves.

Edit:

for example, the tree:

[code]
    (root)
  /      \
/  \    /  \
a  /\  d  e
    b c
[/code]

let a 0 bit represent a left-branch and a 1 bit represent a right branch beginning from the root moving downward.  a would be encoded as 00;  b = 010; c =  011; d = 10; e = 11;

pratice example! ;) decode: 0100010 1110
November 21, 2004, 1:48 AM
iago
Huffman encoding is da bomb.  Read up on it, and I have the algorithm coded in Java if you want to have a look.
November 21, 2004, 1:57 AM
K
[quote author=iago link=topic=9622.msg89569#msg89569 date=1101002239]
Huffman encoding is da bomb. Read up on it, and I have the algorithm coded in Java if you want to have a look.
[/quote]

Completely agree ;) -- a good source to read up on it is "Introduction to Algorithms - Second Edition," which I'm sure you can aquire from somewhere.
[me=K]coughs at TehUser[/me]
November 21, 2004, 1:59 AM
Arta
It's *mostly* textual data - there is some binary data too. My tree would have to contain every value from 0x00 to 0xFF: does that make a difference?

From what I remember, huffman relies on using the unused bits in the data (?). I'm not sure I have any!
November 21, 2004, 11:13 AM
St0rm.iD
Huffman uses binary trees to represent bytes, just as K explained. What you want to do if you want good compression is put the more numerous occuring bytes (for example, in a text document the " " character occurs often) near the top of the tree, so it is represented by a smaller number of bits than the lower-level leaves.

After this, your next compression optimization is to scan for often-repeated sequences of characters (i.e. if you're compressing HTML, <td>, <tr>, </td> etc) and place those in the tree as well.

HTH
November 21, 2004, 2:43 PM
iago
Huffman (the way I've done it) will take the most common character and give it the shortest binary sequence, and take the least common character, and give it the longest sequence.  It can probably be optimized to use sequences of characters ("<td>") instead of single characters, but I've never done it like that.
November 21, 2004, 4:10 PM
Kp
One possible problem with Huffman coding is you must know the frequency of each character before compression can begin, so for optimum performance you must scan the data twice: once to compute probabilities, and again to compute the compressed stream based on the tree formed from those probabilities.  This is mainly a drawback if it's a problem to retain the data (i.e. it doesn't work well as a stream cipher, since you'd have to buffer up the entire stream to allow doing that second pass).  Also, you must pack along your Huffman dictionary so that the decompressor can determine how to lay out the tree.  Finally, choose the format of your dictionary carefully.  I've seen some people try to send the dictionary as the probability table, which can cause some interesting results if the decompressor builds the tree differently (for instance, given two nodes of equal weight, which one becomes the left node?  -- if the compressor and decompressor disagree, you produce garbage output)

If these constraints are a problem in your case, look into LZ78 compression.
November 21, 2004, 5:53 PM
Skywing
FYI, zlib might suit your needs (and it's got a liberal license).
November 21, 2004, 6:15 PM
iago
If your data is always similar (always text with a certain occurance of other symbols), you might be able to use a static huffman tree.  Although it might not be completely optimal, it'll save you from having to serialize/send the entire tree.
November 21, 2004, 8:25 PM
K
[quote author=iago link=topic=9622.msg89617#msg89617 date=1101053442]
Huffman (the way I've done it) will take the most common character and give it the shortest binary sequence, and take the least common character, and give it the longest sequence. It can probably be optimized to use sequences of characters ("<td>") instead of single characters, but I've never done it like that.
[/quote]

The standard algorithm to create the tree is
[code]
For All Nodes N
    Insert N into V
While Count[V] > 1
    A = PopSmallest[V]
    B = PopSmallest[V]
    X = new Node
    Left[X] = A
    Right[X] = B
    Weight[X] = Weight[A] + Weight[B]
    Insert X into V

Root = Pop[V]
[/code]
November 21, 2004, 10:48 PM
Adron
[quote author=Arta[vL] link=topic=9622.msg89608#msg89608 date=1101035581]
It's *mostly* textual data - there is some binary data too. My tree would have to contain every value from 0x00 to 0xFF: does that make a difference?

From what I remember, huffman relies on using the unused bits in the data (?). I'm not sure I have any!
[/quote]

You can have different trees for different parts of your data if there are well-defined boundaries. Huffman can handle binary data, but you might not get any compression if there aren't repeating patterns in the data.
November 27, 2004, 2:18 PM
Arta
Here's a sample:

[code]
27/11/2004 15:54:38, SENT:
0000 FE 04 A2 01 41 64 6D 69 6E 69 73 74 72 61 74 6F þ.¢.Administrato
0010 72 00 60 A7 88 00 13 00 50 72 6F 66 69 6C 65 5C r.`§ˆ...Profile\
0020 53 65 78 00 50 72 6F 66 69 6C 65 5C 41 67 65 00 Sex.Profile\Age.
0030 50 72 6F 66 69 6C 65 5C 4C 6F 63 61 74 69 6F 6E Profile\Location
0040 00 50 72 6F 66 69 6C 65 5C 44 65 73 63 72 69 70 .Profile\Descrip
0050 74 69 6F 6E 00 52 65 63 6F 72 64 5C 53 45 58 50 tion.Record\SEXP
0060 5C 30 5C 57 69 6E 73 00 52 65 63 6F 72 64 5C 53 \0\Wins.Record\S
0070 45 58 50 5C 30 5C 4C 6F 73 73 65 73 00 52 65 63 EXP\0\Losses.Rec
0080 6F 72 64 5C 53 45 58 50 5C 30 5C 44 69 73 63 6F ord\SEXP\0\Disco
0090 6E 6E 65 63 74 73 00 52 65 63 6F 72 64 5C 53 45 nnects.Record\SE
00A0 58 50 5C 30 5C 4C 61 73 74 20 47 61 6D 65 00 52 XP\0\Last Game.R
00B0 65 63 6F 72 64 5C 53 45 58 50 5C 30 5C 4C 61 73 ecord\SEXP\0\Las
00C0 74 20 47 61 6D 65 20 52 65 73 75 6C 74 00 52 65 t Game Result.Re
00D0 63 6F 72 64 5C 53 45 58 50 5C 31 5C 57 69 6E 73 cord\SEXP\1\Wins
00E0 00 52 65 63 6F 72 64 5C 53 45 58 50 5C 31 5C 4C .Record\SEXP\1\L
00F0 6F 73 73 65 73 00 52 65 63 6F 72 64 5C 53 45 58 osses.Record\SEX
0100 50 5C 31 5C 44 69 73 63 6F 6E 6E 65 63 74 73 00 P\1\Disconnects.
0110 52 65 63 6F 72 64 5C 53 45 58 50 5C 31 5C 52 61 Record\SEXP\1\Ra
0120 74 69 6E 67 00 52 65 63 6F 72 64 5C 53 45 58 50 ting.Record\SEXP
0130 5C 31 5C 48 69 67 68 20 52 61 74 69 6E 67 00 44 \1\High Rating.D
0140 79 6E 4B 65 79 5C 53 45 58 50 5C 31 5C 52 61 6E ynKey\SEXP\1\Ran
0150 6B 00 52 65 63 6F 72 64 5C 53 45 58 50 5C 31 5C k.Record\SEXP\1\
0160 48 69 67 68 20 52 61 6E 6B 00 52 65 63 6F 72 64 High Rank.Record
0170 5C 53 45 58 50 5C 31 5C 4C 61 73 74 20 47 61 6D \SEXP\1\Last Gam
0180 65 00 52 65 63 6F 72 64 5C 53 45 58 50 5C 31 5C e.Record\SEXP\1\
0190 4C 61 73 74 20 47 61 6D 65 20 52 65 73 75 6C 74 Last Game Result
01A0 00 00                                          ..

27/11/2004 15:54:38, SENT:
0000 FE 02 16 00 41 64 6D 69 6E 69 73 74 72 61 74 6F þ...Administrato
0010 72 00 D0 A0 88 00                              r.Ð ˆ.

27/11/2004 15:54:38, RECV:
0000 FE 04 B1 02 01 00 00 00 60 A7 88 00 41 64 6D 69 þ.±.....`§ˆ.Admi
0010 6E 69 73 74 72 61 74 6F 72 00 13 00 50 72 6F 66 nistrator...Prof
0020 69 6C 65 5C 53 65 78 00 3E 00 BC 00 0A 0A 0A 8C ile\Sex.>.¼....Œ
0030 88 86 87 80 A8 86 81 80 8B 8B AC BD A5 AA 97 B4 ˆ†‡€¨†?€‹‹¬½¥ª—´
0040 95 B7 B0 B6 A5 87 A3 86 86 AE B8 96 B0 B3 9B B2 •·°¶¥‡£††®¸–°³›²
0050 B9 AC 85 96 98 BC BC 89 9B BC BB 9B 85 BA A2 94 ¹¬…–˜¼¼‰›¼»›…º¢”
0060 AD AC 80 A7 A3 93 A9 80 92 BE AF 8F AC AF AC BC ­¬€§£“©€’¾¯?¬¯¬¼
0070 B9 81 9C 92 87 A8 B7 9C BC A7 BB 90 A2 A9 8E 95 ¹?œ’‡¨·œ¼§»?¢©Ž•
0080 84 A1 8C B2 91 B3 85 B8 BF 9F BB AF 9B B9 8E BB „¡Œ²‘³…¸¿Ÿ»¯›¹Ž»
0090 B2 9D A6 80 96 AC 95 94 B8 99 98 8E 87 9E 8F 8B ²?¦€–¬•”¸™˜Ž‡ž?‹
00A0 9C 96 B2 92 AD B0 A1 87 81 A0 BC 9A 93 BC 9B BE œ–²’­°¡‡? ¼š“¼›¾
00B0 8F 8D 8C A1 B5 BD 99 B9 A7 8A A4 B1 8B 8D A6 AD ??Œ¡µ½™¹§Š¤±‹?¦­
00C0 A7 9C 9F 87 BA 8F 98 89 A2 BD A7 83 92 86 AE AB §œŸ‡º?˜‰¢½§ƒ’†®«
00D0 8B 90 9F 95 BD 95 82 BF 9D AB AA B0 B1 8A B2 98 ‹?Ÿ•½•‚¿?«ª°±Š²˜
00E0 9F 95 A5 80 A0 84 91 00 50 72 6F 66 69 6C 65 5C Ÿ•¥€ „‘.Profile\
00F0 41 67 65 00 3E 00 01 00 00 50 72 6F 66 69 6C 65 Age.>....Profile
0100 5C 4C 6F 63 61 74 69 6F 6E 00 3E 00 01 00 00 50 \Location.>....P
0110 72 6F 66 69 6C 65 5C 44 65 73 63 72 69 70 74 69 rofile\Descripti
0120 6F 6E 00 3E 00 01 00 00 52 65 63 6F 72 64 5C 53 on.>....Record\S
0130 45 58 50 5C 30 5C 57 69 6E 73 00 80 03 00 00 52 EXP\0\Wins.€...R
0140 65 63 6F 72 64 5C 53 45 58 50 5C 30 5C 4C 6F 73 ecord\SEXP\0\Los
0150 73 65 73 00 80 03 00 00 52 65 63 6F 72 64 5C 53 ses.€...Record\S
0160 45 58 50 5C 30 5C 44 69 73 63 6F 6E 6E 65 63 74 EXP\0\Disconnect
0170 73 00 80 03 00 00 52 65 63 6F 72 64 5C 53 45 58 s.€...Record\SEX
0180 50 5C 30 5C 4C 61 73 74 20 47 61 6D 65 00 80 03 P\0\Last Game.€.
0190 00 00 52 65 63 6F 72 64 5C 53 45 58 50 5C 30 5C ..Record\SEXP\0\
01A0 4C 61 73 74 20 47 61 6D 65 20 52 65 73 75 6C 74 Last Game Result
01B0 00 80 03 00 00 52 65 63 6F 72 64 5C 53 45 58 50 .€...Record\SEXP
01C0 5C 31 5C 57 69 6E 73 00 80 03 00 00 52 65 63 6F \1\Wins.€...Reco
01D0 72 64 5C 53 45 58 50 5C 31 5C 4C 6F 73 73 65 73 rd\SEXP\1\Losses
01E0 00 80 03 00 00 52 65 63 6F 72 64 5C 53 45 58 50 .€...Record\SEXP
01F0 5C 31 5C 44 69 73 63 6F 6E 6E 65 63 74 73 00 80 \1\Disconnects.€
0200 03 00 00 52 65 63 6F 72 64 5C 53 45 58 50 5C 31 ...Record\SEXP\1
0210 5C 52 61 74 69 6E 67 00 80 03 00 00 52 65 63 6F \Rating.€...Reco
0220 72 64 5C 53 45 58 50 5C 31 5C 48 69 67 68 20 52 rd\SEXP\1\High R
0230 61 74 69 6E 67 00 80 03 00 00 44 79 6E 4B 65 79 ating.€...DynKey
0240 5C 53 45 58 50 5C 31 5C 52 61 6E 6B 00 80 03 00 \SEXP\1\Rank.€..
0250 00 52 65 63 6F 72 64 5C 53 45 58 50 5C 31 5C 48 .Record\SEXP\1\H
0260 69 67 68 20 52 61 6E 6B 00 80 03 00 00 52 65 63 igh Rank.€...Rec
0270 6F 72 64 5C 53 45 58 50 5C 31 5C 4C 61 73 74 20 ord\SEXP\1\Last
0280 47 61 6D 65 00 80 03 00 00 52 65 63 6F 72 64 5C Game.€...Record\
0290 53 45 58 50 5C 31 5C 4C 61 73 74 20 47 61 6D 65 SEXP\1\Last Game
02A0 20 52 65 73 75 6C 74 00 80 03 00 00 00 80 03 00  Result.€....€..
02B0 00                                              .
[/code]

Each of those blocks is a message that would have to be compressed individually.
November 27, 2004, 3:54 PM
Adron
What platforms should this run on? I think zlib would be a good candidate if it works on what you need it to run on.
November 27, 2004, 3:59 PM
Arta
Win32
November 27, 2004, 4:50 PM
Adron
[quote author=Arta[vL] link=topic=9622.msg90241#msg90241 date=1101574223]
Win32
[/quote]

zlib works fine on Win32. Use that :)

For a while I was thinking perhaps you were making something that needed to run on cellphones or something - a bot client. I'll assume this is for storing profile entries in your bnet server?
November 27, 2004, 5:08 PM
Arta
Yes, for account information messages between BNCS nodes and the Database Server. zlib seems to work very well :) Thanks!
November 28, 2004, 3:08 PM

Search