Hacker News new | past | comments | ask | show | jobs | submit login

By the way, here’s an anecdote for the flip side: at one of my internships I was working on a tool to process large log files, and by careful application of Aho-Corasick I was able to make it about 50 times faster on the dataset we were dealing with, which made using the tool change from “let’s go grab lunch while this finishes” to “let’s stream the logs through this live”. Sometimes you do know how to make things faster: just make sure to test them before you do, and asking why something is a certain way is always a good thing to do before proclaiming it’s broken.



That's a great example; the most important part of your anecdote is the end which says what was the user impact? There is no prior reason to believe that a 50x speedup is actually a win; taking an algorithm from 100K nanoseconds to 2K nanoseconds when hitting the file system takes billions of nanoseconds is not a win for the user, and taking an algorithm that takes 5000 years down to 100 years is probably not either.

But a 50x win that, as you note, goes from "let's have lunch" to "let's change this one thing ten times and re-do the analysis to see which gets us the best results" is a huge game changer; it's not just that it saves time, it's that it makes possible new ways to work with data.


Another anecdote: A few months ago, I was adding a bunch of features to an open source library. Before changing, I studied the library's code to find out where to put the change best. During that study, I found an instance where the code performed a slow operation, but I didn't attempt to change it because it was out of scope for my change, and I didn't want to introduce bugs [1].

A little bit after I have made the change, an user filed a bug [2] to the library about slow behavior when you passed a large amount of data to the library. They were using the public release instead of git master, so my code wasn't at fault thankfully. Due to the knowledge I attained from doing the change, I could quickly confirm that precisely this slow part was cause for the performance slowdown, and was able to file a PR to reduce the library's complexity from quadratic to linear [3].

It wasn't very complicated from the algorithmic point of view, I only had to create a lookup structure, but the lookup structure had to be created in a certain way so that the library still behaved the same way as tested by the test suite. It also had to support things like duplicates as the original implementation also supported them, as a very important feature in fact (toml arrays).

[1]: https://github.com/alexcrichton/toml-rs/pull/333

[2]: https://github.com/alexcrichton/toml-rs/issues/342

[3]: https://github.com/alexcrichton/toml-rs/pull/349


The open source intrusion detection system Suricata [1] used aho-corasick until intel released hyperscan [2] in open source. Hyperscan is apparently more performant than aho-corasick. If your language can handle the C libraries, have you considered trying hyperscan to see how it compares?

[1] https://suricata-ids.org/ [2] https://www.hyperscan.io/


Based on their use of past tense, I’d assume they aren’t interning there anymore.


I am not, and the company is known for having a fairly strong not-invented-here syndrome.


For interest's sake, did you try simply using a decent regex engine as an alternative? Any DFA regex engine implicitly implements Aho-Corasick for you.


I think the engineer before me put a bit of effort into this, but it didn’t work out all that well in this case. This isn’t surprising considering that the standard regex tool on the platform wasn’t known for being particularly performant.


But sometimes you can get a supervisor who doesn't care what you do, you're not changing it because they don't want to learn something new...




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: