
Insidious Optimizations II: Machine Text - verisimilitudes
http://verisimilitudes.net/2018-06-06
======
throwaway5153
Your blog article doesn't contain any code or empirical data so you can expect
it to be rightfully ignored.

Here are some of the elements that would make it more compelling.

* Provide a reference implementation in a widely-used language. I would pick C or Rust. The code should be under a widely used open source license, have high test coverage, and be fuzzed.

* Write a specification and conformance tests. What happens when a word is not in the dictionary or the dictionary is corrupt?

* Provide size and performance benchmarks of real-world datasets. You assert that English words can be encoded compactly so try encoding Wikipedia, Project Gutenberg books, or the text files in the Calgary Corpus.

* Compare your proposal against state of the art compression algorithms like PPM or LZMA.

* Fork a text editor application and change it to use your reference implementation. What are the effects on editing latency and RAM usage?

* Shared dictionaries were tried in the 1980s (e.g. Spellswell Word Services) but failed in the marketplace. What's different now?

~~~
verisimilitudes
>Your blog article doesn't contain any code or empirical data so you can
expect it to be rightfully ignored.

That doesn't usually seem to stop them.

>Provide a reference implementation in a widely-used language. I would pick C
or Rust. The code should be under a widely used open source license, have high
test coverage, and be fuzzed.

I suppose I could go ahead and start writing an initial implementation in
Common Lisp; it would likely be under a new article, though. I use the AGPLv3,
don't need tests, and don't need fuzzing.

>Provide size and performance benchmarks of real-world datasets. You assert
that English words can be encoded compactly so try encoding Wikipedia, Project
Gutenberg books, or the text files in the Calgary Corpus.

I prove they can be encoding compactly, but I suppose a more concrete example
would help, sure.

>Compare your proposal against state of the art compression algorithms like
PPM or LZMA.

I can compare file sizes.

>Fork a text editor application and change it to use your reference
implementation. What are the effects on editing latency and RAM usage?

This would be a bit much for me at this stage, and editing text as opposed to
reading it is still under consideration; it would likely be implemented with
characters and a simple notice when invalid words are created, varying.

>Shared dictionaries were tried in the 1980s (e.g. Spellswell Word Services)
but failed in the marketplace. What's different now?

I can't find any information on this, but if it was being sold then I take it
that it was proprietary, which would help explain it. Anyway, Microsoft
products are dominant, which is evidence that the ''Free Market'' isn't a
judge of quality.

In any case, it's nice that something I submitted received any attention
whatsoever. Usually, my work goes entirely ignored, but that's hardly the only
good work Hacker News doesn't like or understand and so ignores.

