Hacker News new | past | comments | ask | show | jobs | submit login
Lossless Text Compression [pdf] (bilalonureskili.com)
66 points by kronxe 9 months ago | hide | past | favorite | 43 comments



I can't tell if this post is presenting a trivial idea as a novel one or not. Typical compression algorithms indeed are not good at compressing a short text, but only because they have no prior knowledge about the input distribution. There are several ways to feed that knowledge to existing algorithms (e.g. zstandard custom dictionaries).

The "Type-Based Compression Algorithm" presented in the post does not do any kind of distribution modelling. It simply assumes that every possible input is uniformly distributed, hence incompressible, and that's obviously not true. The post gives an example of Turkish car number plates; its first two digits indicate the province code, so you will see certain two-digit prefixes more frequently. TBCA can't see this distribution.

TBCA can also try to limit the range of first two digits for the better coding, but it will break with a new province code. Typical compression algorithms might be inefficient for those inputs but can surely handle them. This is perhaps the biggest drawback of TBCA: a wrong assumption of the input data results in a hard failure (unless you have an escape code). It is not the same kind of algorithm you can compare with general compression algorithms. It is just a clever database schema that gives some compression without those general compression algorithms.


My one of the main goals is protect the database structure as it is, because in big data age we should get data as fast as possible. I see your point to see this algorithm as a basically some sort of look-up table but actually it is not.

For example, we can think a well-developed city carries all data related to the city and the people to a big database. Then this city uses a TBCA that specialized for only the this big database's needs like a framework or engine method. However, this specific TBCA is not totally different than other TBCAs used in other types of databases like a game database since they have common propertries like people's names and surnames, the structure of a name database is generally same but TBCA plays huge rol in here, you can configure your algorithm with your needs like an optimizing. Today I am not sure how it should be done, maybe with an ML algorithm.

I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.

In the future, There may communities share their specificated algorithms for some structures and their datasets( frequency analysis). It will becomes a pool that you can choose best engine(structural method) and best dataset(freq. analysis) from there.


> I wrote too much I know but my point is TBCA is not an specific algorithm like gzip or LZW it is a sub branch of compression like an universal set.

Yes, TBCA is a scheme not a specific algorithm (I thought I was clear enough in the reply, but sorry if it wasn't). In fact I've actually done the same thing with my own database as well in a semi-automated fashion based on a pattern. For example I had a string with three parts: a number 1--214, an optional ', a number -9--99. My code accepts a pattern `[1..214]['?].[-9..99]` and generates a code that packs this into 8 + 1 + 7 = 16 bits. This works because I was dealing with the Unicode Character Database (the example being kRSJapanase), so I knew its exact pattern without an exception and had no migration problem since upgrading to newer UCD is a non-trivial problem anyway. Also I wanted to put the entirety of UCD to the shared memory, so I controlled most of the data structure to make this compression actually worthy.

My issue with TBCA presented in this way is that it looks like a direct replacement to general compression algorithms when it isn't. I regard this as a database schema because it is akin to RDBMS normalization: if you compress a name "<given> <sur>" (or vice versa) in this way, you can equally have a separate name parts table and two indices in the original table. The only difference is that you have hard-coded that name parts table into your app. I believe you should instead have done caching, so name parts table still exists but you can refer to the memory if you can. That makes a better general approach than TBCA, and also shows that it can't be compared with compression algorithms.


What exactly is lossy text compression then?

EDIT: I love every single answer below


I implemented a "conceptual" lossy text compression a few months ago by using word2vec (glove) to convert each word to a list of floats, laid the result out in a square, then jpeg compressed it at different quality levels. This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.

This was a follow up to a similar demo where a colleague compressed the raw ascii bytes as jpeg, which started devolving into nonsense (that didn't even look like words anymore) below ~97 quality.

I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image formats as well.

--

Could also play with the data layout to make the jpeg curves make sense in a different way. Right now the vectors are stacked end to end, like:

  v0 = vec(words[0])
  v1 = vec(words[1])
  out = v0[0...len] + v1[0...len]
If you consider the vector to be an N dimensional space, jpeg compression probably makes more sense if you lay it out like:

  out = v0[0] + v1[0] ... v0[len-1] + v1[len-1]
This would allow the DCT to apply to continuous coordinate spaces from the vectors.

--

Another idea that would possibly have even more sensible output at good compression ratios, would be to train a language model neural network encoder/decoder pair on a text compression task. Teach it to "downscale" and "upscale" text, perhaps in an unsupervised or semi-supervised manner. Then you can also constrain the intermediate form.


Here's a quick zero shot text compression attempt with GPT3, compressing then decompressing some of my sentences picked from upthread. Looks pretty impressive imo.

Notes: These results include every attempt I made and are not curated in any way, except for the "image formats" compression when I manually asked GPT3 for additional output because it stopped short.

To construct the same request, copy the appropriate prompt and append two more lines matching the prompt, such as `{prompt}\n\nLong: your input text\nShort:` then ask GPT3 for a completion.

--

Compression:

Prompt

  Make text shorter or longer, using slang, chat speak, or just add or remove vowels and silent letters.

  Long: This text is compressed.
  Short: dis txt is shrt
Outputs (each prediction starts after Short:)

Long: I want this sentence, only shorter.

Short: I wnt dis sentnce, only shrr

Long: This ensured the decompressed output was all real words, but lowering the jpeg quality basically controlled the descent into synonyms and madness.

Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.

Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image formats as well.

Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.

--

Now I take the above outputs (without the inputs) and feed them back into GPT3 with a different prompt.

Decompression:

Prompt

  Make text longer. Expand slang and chat speak, or just add vowels, silent letters, and punctuation.

  Short: dis txt is shrt
  Long: This text is compressed.

  Short: I wnt dis sentnce, only shrr
  Long: I want this sentence, only shorter.
Outputs (each prediction starts after Long:)

Short: dis enrvd th cmpd outut b all rl wrds, but lwg th jpg qulity basically cntrlwd th dscnt into synns and mds.

Long: This enhanced the compressed output by all real words, but long the JPEG quality basically controlled the descent into synonyms and made-up words.

Short: I'm awy frm my cmpu 4 da nex wk or so bt if u ctch me aftr dat I'm hppy 2 post sum of d rzltz. Wd b intrestng 2 try w/ sum odder imjd frms as wl.

Long: I'm away from my computer for the next week or so but if you catch me after that I'm happy to post some of the results. Would be interesting to try with some other image forms as well.


For example, you can shorten the words, by removing the vowels.

An example: https://studio.code.org/projects/applab/hxXJIEGg2yza_Q7t9W04...


"Why waste time say lot word when few word do trick?"


Or, to quote the article, “there is a necessity of knowing dictionary that used in compression process”.


Words said I like. Button press up.


Thank.


+


There's a "translate-shell" package for Linux that provides a "trans" command. It makes for the most fun lossy pipeline, though there's no compression.

  $ echo "It’s Handy That I Have My Library Card Because I’m Totally Checking You Out." | trans -b en:vi | trans -b vi:en
  Completely free when you need to inspect the fish and the whole country to check the water. report.


Surely everyone had to learn how to do lossy text compression in primary school? At least everyone my age in the UK I think. It's called a précis.


If you use a smaller font then you can fit more text on the page, but it's hard to read.


Ths is losy tet comresio


Que?


A summary.

Like with audio and video, you lose some of the original details, but maybe that doesn’t matter to you if it captures enough of what you find important.


Take the English text, translate it into a language with simpler grammar, such as Bahasa Malaysia. Then do lossless compression on the result (you have to do this because e.g. the plural form of a noun in BM is simply the noun, twice).

To decompress do the process in reverse.


d way v used 2 txt b4 on fb


ᵀʰⁱˢ ˢ ᵃⁿ ᵉˣᵃᵐᵖˡ ᵒᶠ ˡᵒˢˢⁱ ᵗᵉˣ ᶜᵒᵐᵖʳᵉˢˢⁱᵒⁿ.


wow it's 50% smaller!


I always use "congrats" instead of "congralutions" because i always misspell it and the other side always get it even though "congrats" could mean other things


Any typo i make i blame on lossy compreson.


1. Start with your text

2. Remove some sentences here and there

3. ???

4. Profit


Yes, was thinking the same thing when read title. I prefere to have my Dostojevski the same after compression/decompression.

But seriously, is loosy text compression a thing (useful)?


or replacing synonyms with single representation


White space removal


Character removal


One thing I'd encourage looking at is dynamic Markov coding. It's easy enough to implement and gets you to 1/5th size for text compression. Still not at the ~6x ratio of the current best (paq8) but it's close. There's no dictionary involved. As you encode or decode you update the probabilities and build the dictionary on the fly and encode with arithmetic coding.


Prediction by partial match (PPM) is also very good. The "D" version, which comes within 7zip, gives very good compression.

The longer the text and the more "common", the better.


Yes I think Markov coding also Markov chains is such good things to learn. But in this article my point is there should be stable dictionary because for big data purposes, especially in databases it is hard to calculate every time new probs.


My favorite is Context Tree Weighting (CTW) (https://ieeexplore.ieee.org/document/382012). Mathematically it's much nicer than PPM. Modern implementations of CTW achieve ratios close to or better than PPM in most domains (see e.g. P. Volf's Thesis "Weighting techniques in data compression")


If you're compressing Turkish, you can use syllables. Turkish has a nice feature of deterministic syllable division, which can be done via a neat state machine.

I've implemented it once, however I failed to encode it very efficiently on the disk. Nevertheless, it's promising. Here's what I've done:

https://www.ccis2k.org/iajit/PDF/vol.8,no.1/12.pdf


By the way, I want to add something. I do not use any codes in paper but I tested these all in Python and they can be implemented in most languages.

Also, my other point is it is not an stable, constant algorithm. Everyone may have different TBCAs for their specific purposes but there should be generic ideas like human names or in an online game if you give '~' for 'GG' , you can save too many locations in memory. For a chat application, if you change 'hello' to some two bytes and you save both memory and time (because transmission of long bytes takes more than compression process probably if you dont have a huge dictionary).


Cool idea. I would expect the code to be very clean and simple for this, so I'm surprised there isn't a pseudocode algorithm presented (or a real implementation in some language).


These ideas all tested in Python and then I achieved results. I did not add codes because it is a 'philosophical' research for me actually.


Isn't possible to achieve a great compression ratio when you recognize grammar forms?

Statistical compression seems a little random and unpredictable.


It is possible and actively used but in this article my point is like an universal engine for ,big data and databases that need speed, as I mentioned in name Type-based. You cannot used a grammar algorithm in only number based data.


I've always thought could you compress english text easily by getting the top X number of words/phrases and then using rarely used ascii letters/ combinations to represent those words. Then extending this concept for the remainding words using the most commonly used word groups (eg -ight)

I've been too way too busy (cough.. lazy) to test this idea


That's what Re-Pair does. It repeatedly replaces the most frequent substring with a symbol until the input has no repeating substrings.

See https://en.wikipedia.org/wiki/Grammar-based_code


I had to do this by hand in the 1980s when I ran out of space in the ROM of the embedded controller I was working on for messages to be printed. There was no way to increase the size of ROM (2716 EPROM) at that stage in the development. The text was all 7 bit ASCII so I used the MSB as a flag to say that the remaining bits were a pointer into a table of frequently used fragments. This saved enough space to include the decompressor; it was recursive too so the fragments could also be compressed.


This is how LZW and Huffman compression work. LZW finds repeated symbol groups and encodes them as new symbols, and Huffman compression codes frequent symbols with fewer bits and infrequent symbols with more bits.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: