
Shoco: a fast compressor for short strings - multipass
http://ed-von-schleck.github.io/shoco/
======
dalke
I have a background project of exploring how to compress SMILES strings, which
is a notation for storing chemical information. For example, "C" is methane,
"CC" is ethane, "C=C" is ethene, "CCO" is ethyl alcohol, "C1CCCCC1" is
cyclohexane, and "c1ccccc1", which contains aromatic carbons, is benzene. The
average length of a SMILES string for real-world molecules is about 50
characters.

I previously evaluated a special purpose tool which identifies the best
n-grams and uses dynamic programming during encoding. That gets about 70%
compression on SMILES string. I also tried the off-the-shelf femtozip which
got about 60% compression but had more decompression overhead than I like.

Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and
tested with 100,000 strings from the training set, reports "average
compression ratio: 47%".

~~~
bmh100
Could you provide more information about your SMILES test? How many unique
symbols were there? How does gzip do? This is an interesting use case.

~~~
dalke
Sure. I'm switching this conversation to email though, using the gmail account
in your profile. Short version is, I trained it on the RDKit-generated SMILES
strings from ChEMBL-20. Three of the strings look like this:

    
    
        CC(C)=CCC/C(C)=C/C=C/C(=O)N1CCCC1
        CC(=O)NC(C(=O)N1CCSCC1)[C@H]1CC(C(=O)O)C[C@@H]1N=C(N)N
        O=C(CC(c1ccc(F)cc1)(c1ccc(F)cc1)c1ccc(F)cc1)N1C[C@H](O)C[C@H]1C(=O)N1CCC[C@@H]1C(=O)NC[C@@H]1CCCNC1
    

On the raw data set (on record per line), wc reports:

    
    
         1455763 1455763 82882385
    

while | gzip -c | wc -c reports 18773892.

~~~
TheLoneWolfling
> I'm switching this conversation to email though

I wish you wouldn't do that. That defeats the entire point of a website such
as this. Just because you don't think that this is interesting to random
people doesn't mean that random people don't think this is interesting.

~~~
dalke
The email I sent was 178K long, with 1000 real-world examples (to get an idea
of the character distribution), and the .h file model generated by shoco on
the entire data set.

Assuming that bmh100 is both interested in working on this and doesn't have
the domain knowledge, I gave a synopsis of the SMILES notation, its use as a
molecule identifier, a way to reproduce my data set, and a couple of possible
alternatives for getting something similar. (Each method requiring less domain
knowledge and more CS experience.)

This this is a big chunk to chew on, and this is the weekend, I figure it will
take a few days to digest and be able to response. Since HN doesn't have
notifications, how long should I actively check this thread for replies?

By sending email, I also invite a response after a couple of months, should
that be the case. (I yesterday got a followup on a topic that was 4 years
old.) So no, supporting these long-term research exchanges is not one of the
main goals of HN.

You'll note that I also answered what bmh100 asked for here. If you find it
interesting, then feel feel to ask interesting questions.

~~~
TheLoneWolfling
If it's not confidential (and I am assuming it isn't), why not just link it in
a gist or something? That way other people can also take a crack at it.

Among other things, "a synopsis of the SMILES notation, its use as a molecule
identifier, a way to reproduce my data set, and a couple of possible
alternatives for getting something similar" is something I would be interested
in. And, considering the upvotes I got for my grandparent comment, something
that other people would be interested in as well.

Also: [http://hnnotify.com/](http://hnnotify.com/)

~~~
dalke
I do not like "a gist or something" and have used such services only a handful
of times. I dislike how they decontextualize the conversation and how they
require trust in an additional resource. Eg, when I come across a gist during
a web search, it's hard to figure out the point.

By comparison, an email provides the full context, and is easier to integrate
into a workflow. For example, I can drag an attachment directly into my
editor. A gist requires additional steps.

Regarding hnnotify.com, I enjoy the ability to let go of most HN threads after
a couple of days. This thread one of a handful of exceptions. Can I really
subscribe to one-and-only-one thread? I don't see that's it's worthwhile to
set up a third-party account and active the service for a rare event. In any
case, if it takes a month for bmh100 to evaluate the code then the HN thread
will be closed, so there's only a narrow window for which this service is
useful.

I do not share your optimism in the random contributions of others. To start,
it's not like I haven't talked about this before. See
[https://bitbucket.org/dalke/smilez](https://bitbucket.org/dalke/smilez) and
[http://www.dalkescientific.com/writings/diary/archive/2007/0...](http://www.dalkescientific.com/writings/diary/archive/2007/06/25/smiles_states.html)
(under "Compressing SMILES") for two examples. Have I gotten _any_ feedback
about them? No. So why put more effort into hoping for a one-in-a-million
event, which is what you suggest, instead of optimizing the chance of getting
a followup from someone who specifically expressed interest? Experience says
that I should optimize for the latter.

What is your interest in the SMILES notation that can't be resolved through
[https://en.wikipedia.org/wiki/Simplified_molecular-
input_lin...](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-
entry_system) ? I would be glad to tell you more. I have worked with different
aspects of SMILES for over 15 years and co-authored the OpenSMILES
specification. I have also written many blog posts about different aspects of
how to work with SMILES. And gotten few followups.

What skill set do you have, that I might tailor a response? Are you
comfortable installing from source, do you prefer one of the GNU/Linux
packaging systems, or Mac/homebrew? Or are you happiest with extracting data
from a database dump? My 'synopsis .. of possible alternatives' was more an
offer to follow up on any of those options, but was of itself incomplete. It
works because email has the implied statement that I will respond to further
questions.

If you don't have specific interest, are more generically wanting to be
informed, then perhaps you can understand why I would prefer to use other
mechanism, like my blog posts, which are more likely to get the kinds of
responses I'm looking for than spending time tuning an off-topic HN comment.

~~~
TheLoneWolfling
I guess the difference is:

I consider the possibility of a random person coming across something and
finding it interesting a worthy goal in and of itself. You do not.

This is a rather fundamental difference, and as such I do not think that
anything I say will reconcile the matter.

~~~
dalke
My analysis is that there are two classes of random people, while your
analysis has only one. Class 1 is "random person coming across a page" and
class 2 is "random person who cam across a page _and expressed interest in
possibly working on the problem_." Both can become a member of the desired
goal, that is, someone who contributes concrete help.

Experience says that both categories are low. Perhaps there's a 1:10,000
change for a member of class 1, and a 1:500 chance for a member in class 2.

If I do as you suggest, I might raise that to 10:10,000 and 10:500.

However, my belief is that directed email has a higher stickiness, because of
the reasons I mentioned earlier. I believe those statistics become 1:10,000
(ie, unchanged) and 15:500, respectively.

If you work the math out, you'll see that it's overall better to send the
directed email.

Another option is to do both, which you'll see is what I did _for the question
that was asked_. Your complaint is that I should haven't sent additional
information in private mail, which is odd given that HN's own guidelines
suggest that there are HN-related questions that are inappropriate to post and
should instead be done by email.

You have also stated that I do not "consider the possibility of a random
person coming across something and finding it interesting". This simply isn't
true, as you can tell from the analysis above, and from the two pages I linked
to two pages where I have posted information meant for random strangers to
hopefully identify.

You've come across like you are irritated for being left out of the
conversation. I've suggested a few topics I could discuss, but how can I say
more when you haven't expressed any specific interest about the problem
(either on SMILES or short word compression). When writing, it's good to have
a target audience in mind. Should I assume a basic understanding of arithmetic
compression, or start from the basics? Do I need to explain state machines?
And so on.

My above analysis left out the work factor. Rather than write 40 different
essays, each aimed for a different set of strangers (chemist background, CS
background, math background, web dev background), etc. and with at best a
1:100 chance of success, it's a better use of my time to just work on the
code. I believe I could do what I want in about 2 months.

------
knodi123
Look how well it can compress "fofofofofofofofofofofo".

50%

Look how well it can compress "ababababababababababab".

0%

------
rurban
Will test against smaz for our internal JSON compressed protocol. smaz
compressed fine but was too slow. The ability to train the model sounds
convincing.

------
Khao
I get negative compression percentage when I put words with "é" in the test
box.

~~~
jozan
In default it doesn't work well with non-ASCII characters.

[https://ed-von-schleck.github.io/shoco/#how-it-works](https://ed-von-
schleck.github.io/shoco/#how-it-works)

~~~
Semiapies
Between this (an ASCII-only compressor in 2015?) and the other aspects brought
up here, it seems downright toylike.

------
techwizrd
I wonder what'd happen if you used this on base64 strings.

~~~
bmh100
I would love to see a blog post about that test, if you're willing.

------
thrownaway2424
I can't tell you how many times I've said to myself "if only these very short
ASCII strings were even shorter!"

~~~
BrandonSmith
At scale, and if you are paying for transmission costs, it can have a massive
impact.

