e.g. We could spend 3 weeks using feature vectors and backtesting against prior data to figure out what signals accounts which are likely to churn send (for the purpose of proactively identifying them and attempting to derisk them, like by having the customer success team help them out directly)... or we could write a two-line if statement based on peoples' intuitive understandings of what unsuccessful accounts look like. (No login in 30 days and doesn't have the $STICKY_FEATURE enabled? Churn risk!)
The chief benefit of the second approach is that it actually ships, 100% of the time, whereas priorities change and an emergency develops and suddenly 75% complete analysis gets thrown in a git repo and forgotten about. Actually shipping is a very useful feature to have.
What hit me was the "Processors are so fast now we can Brute force grep over 100GB in a second".
We are entering a world where 20TB on a magnetic disk is viable, but randomly accessing that data could take months to extract. So how we store data on disks will become vitally important to how we use the data - not unlike tape drives of pre-1980s era where rewinding to the front of the tape cost you several minutes of (expensive) waittime.
Imagine a scenario where these guys design how to stream logs to the disk to maximise streaming reads, then optimise reading that out to SSD thence to RAM and L2 and so forth. All designed to drive text past a regex running at bazillions of times a second.
Lets call it the New Brute Force, where its just as much effort to get out the door as elegant algorithms, but it is much much much simpler. And of course, sells more hardware.
Expect to see Intel Inside wrapped around the New Brute Force any time soon :-)
Edit: And they used Java ! I was expecting low-level C optimisations all over the place
Today, we're keeping everything on SSD -- twice, in fact, to maintain a hot spare for each server. SSD prices have fallen to the point where, even renting from Amazon, we can make the cost model work; and SSD is a great crutch as we tune our storage layer. But spinning disk will be a lot cheaper.
As for Java: we've been pretty surprised ourselves. Going in, we expected to need a lot more "native" (C) code. As it turned out, currently the only non-Java code we're using is the LZ4 library, and even there we only recently moved to the non-Java implementation. We do soon expect to move a few more bits of code to C, such as the core text-search loop. But based on the performance we've been getting in pure Java, we don't anticipate any need to move more than a fraction of 1% of our overall code base.
We do have some code that lives in a sort of middle ground -- Java code written like it was in C. Our database buffers live in a set of 2GB byte arrays, soon to be replaced by native memory blocks. We do our own (very simple) memory allocation within that, and use a sprinkling of Unsafe operations to access it. This gets us closer to C performance (not all the way there), without any of the hassles of bridging across languages. This part is still well under 1% of our overall codebase.
I am reminded of John Carmack's comment on Oculus - he was amazed that the hardware had the power but the headsets performed badly - then he looked at the code and saw seas of abstractions on abstractions.
Good luck stripping out the layers guys !
HN discussion: https://news.ycombinator.com/item?id=5265513
Instead of thinking "too many abstractions are making this code slow, therefore let's get rid of the abstractions" I usually had rather pick a better language where abstractions have little or no penalty (Haskell, Scheme, etc).
So the idea is simple (cookie sessions) but the different ways of implementing it can hold complexity, errors and abstractions.
There is nothing stopping the same happening with Haskell - I can I am sure write terrible code even in the best languages (see my entire output for proof :-)
That's cute but while it's impressive what it recognizes, it's generally still stupid, and it will get beaten by bad programmers (especially by bad programmers. Becoming good at Haskell means, amongst other things, learning what the compiler will screw up).
Scheme, likewise, doesn't have free abstractions. Unless you mean macros, but those are not really free either imho.
There's one high-level language in wide use that has "free" abstractions, or at least, costs as low as possible, and that's C++.
Ironically, this may only be because grep uses very well tuned, not immediately straightforward algorithms. See in particular http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...
Great quote from that referenced post.
Ah, here it is:
And duplicate submissions are listed in those postings.
And on using Java. It's actually pretty fast, and certainly fast enough for a prototype. Once you have the stack build, you can benchmark and optimize the bits that needs to be optimized, usually this means C or handcrafted ASM.
Systems complexity are always bounded by their algorithm having the greatest complexity. As streaming is linear, there will always be a value of "n" after which the computation will take over, even with a very quasi-linear complexity (even an O(n^1.00000001) computation will be slower than the streaming for very very large values of n).
I always think of big-O in terms of algorithm analysis, while real-world performance is more a function of operation costs/memory costs. Its use in the latter is more colloquial but I get what you mean now.
The reason I talk in cache lines, is because it's taking the real world aspect into the analysis of the algorithm. A theoretical analysis, while valuable in some aspects, are very far from the real world scenarios. I have done countless theoretical analysis of algorithms, found the actual (hidden) constants in big-O ect. but it just doesn't model the real world very well. This is the exact reason you see many fast algorithms use a naive, but cache friendly algorithm once the input is sufficiently small (e.g. with a divide and conquer algorithm).
I would think that a greater than constant time overhead would not be compatible with a real time streaming operation. just a guess, though.
Thanks for mentioning http://en.wikipedia.org/wiki/Van_Emde_Boas_tree as I hadn't heard of that. The last paragraph on the page gives some nice pointers (use a trie or randomized hash table).
It reminds me a lot of the paper that Poul-Henning Kamp wrote about "1975 programming". Few of us really appreciate how to use modern hardware. https://www.varnish-cache.org/trac/wiki/ArchitectNotes
Just do high quality code that will remain maintainable into the future. If the business feels that has too much of a financial or time penalty the business needs to find money to pay for more high quality (slower) developers, or to find a different you.
Don't take someone else's problem / role on as your own to solve - it won't help you or them.
I would just like to throw in here if you are going to be featurizing users to do churn prediction, I would highly recommend doing a whole data pipeline and do profit curves.
A data driven approach is only as good as the algorithms used.
If you used profit curves to augment churn prediction (churn prediction is a pre req), you can figure out how much a user is actually worth bringing back. This also allows you to figure out an optimal budget to profit ratio on a cost of a user.
That being said, use it only if you have enough data to be accurate. And also only use it for the right reasons. Done routinely, I believe it can be a great investment.
My approach was to, essentially, concatenate all files into a single blob. The search tool would then let the regex loose on the contents of that single file. I had a simple index on the side to map byte positions to filenames. To get a line number from the position I would simply count the number of newlines from the start of the file to the regex match position.
Add some smarts to avoid duplicating work, compression and multithreading and I had a tool that could find all lines that contained "int(8|16|32|64)_t" in the Linux kernel source in a third of a second. This was about 20 times faster than grep (with a hot file cache).
It was a simple solution that scaled well enough to handle the codebases I was working on at the time without problems. I later read Russ Cox's article on how Google code search worked. I recommend that article to anyone who's interested in running regular expressions on a large corpus.
Edit: The source is available on github. Tested on Linux and Windows.
It's something hard disk drives were not optimized for.
You have to code for Windows just so. Dear POSIX people in general, there is more to cross platform coding than being able to build it with cygwin.
Every grep I've used on Windows handles wildcards, so it must be doing the expansion itself. The right way to do this sort of thing on Windows is to call FindFirstFile/FindNextFile, which is a cross between readdir and stat in that each call returns not only the name of one directory entry that matches the pattern (as per readdir+fnmatch) but also its metadata (as per stat). So, if you were going to call readdir (to get the names) and then stat each name (to get the metadata), you should really be doing all of it in just this one loop so that each bit of data is retrieved just the once.
But this is exactly what POSIX programmers would seemingly never, ever do. Probably because doing it that way would involve calling a Win32 function. Well, more fool them, because getting this data from FindFirstFile/FindNextFile seems pretty quick, and getting it any other way seems comparatively slow.
I cobbled together a native Win32 port of the_silver_searcher a while ago and it was about twice as fast as the stock EXE thanks to retrieving all the file data in one pass. In this case the readdir wrapper just needed to fill in the file type in the dirent struct; luckily it seems that some POSIX-style systems do this, so there was already code to handle this and it just needed the right #define adding. (I have absolutely no idea why MingW32 doesn't do this already.)
Prior to switching to the_silver_searcher, I used to use grep; the grep I usually used is the GNU-Win32 one (http://gnuwin32.sourceforge.net/packages/grep.htm), and it looks to call readdir to get the names, and then stat to get each name's details. I checked that just now, and after all those years I can finally imagine why it's so slow.
GNU-Win32 find is really slow, too.
But the original point is OPEN is expensive, which it really is, especially over a network connection.
On a more serious note, there's a pretty large gulf between the Windows and Unix development worlds. Try getting a game developer to use curl ...
The source code is there too, but I wouldn't look too closely. "Cobbled together", like I said. Though I've been using it all the time, in conjunction with ag-mode - https://github.com/Wilfred/ag.el - and haven't noticed any obvious problems.
So in effect, these big data serialization systems are attempting to lift the file system into userland so they can make just that optimization.
Usually the solutions end up to use techniques like principle component analysis to bit-pack each item in the dataset as small as possible. Then to buy a bunch of Tesla GPUs with as much RAM as possible. The algorithm then becomes: load the entire dataset in the GPUs memory once, brute force linear search the entire dataset for every query. The GPUs have enormous memory bandwidth and parallelism. With a bunch of them running at once, brute forcing through 30 gigabytes of compressed data can be done in milliseconds.
Also really tiny loops; what slows them down is having to make lots of conditional branches/calls.
This related article could be interesting for those who are curious to know how fast plain C (not even inline Asm) can do string searches: http://www.codeproject.com/Articles/250566/Fastest-strstr-li...
I do wish Intel/AMD would make the REP SCASB/CMPSB instructions faster, since they could enable even faster and efficient string searches limited only by memory bandwidth. (Or they could add a new instruction for memmem-like functionality, although I'd prefer the former.)
I’d certainly seen this at Google, where they’re
pretty good at that kind of thing. But at Scalyr,
we settled on a much more brute-force approach: a
linear scan through the log
Few "fancy algorithms" books actually get into the low level stuff. But Knuth knows: modern computers are extremely fast at linear data, and accessing data linearly (as well as learning algorithms that access data linearly) is a good idea.
I'll bet you that a B-tree approach would be fastest actually. B-Trees are usually good at balancing "massive linear performance" with "theoretical asymptotic complexity". IIRC, there is some research into this area: (look up cache sensitive search trees)
We use some special tricks for searches that are executed
frequently, e.g. as part of a dashboard. (We’ll describe
this in a future article.)
(You might wonder why we store log messages in this
4K-paged, metadata-and-text format, rather than
working with raw log files directly. There are many
reasons, which boil down to the fact that internally,
the Scalyr log engine looks more like a distributed
database than a file system. Text searches are often
combined with database-style filters on parsed log
fields; we may be searching many thousands of logs at
once; and simple text files are not a good fit for our
transactional, replicated, distributed data
A better title would be, "How we determined when to use brute force."
Especially given the overhead of the network: for most apps, if the user notices anything else, you've screwed up.
KISUFI [kiss-you-fee]: Keep it simple, you fantastic individual!
I think it's more that memory is getting slower relative to CPU, as we all know, so complicated data structures that involve lots of random access to memory aren't so efficient anymore vs linear scans through memory that can benefit from cache. It's still engineering, just the tradeoffs have shifted.
For our core experience, we went with a custom agent because that allowed us to viciously simplify the setup process. But we're very much open to working with other tools as well.
When did "simple" become such a dirty word? Simple isn't synonymous with lazy.
With that price, you can keep most of the data in memory-ssd and don't care to use fancy-algorithm.
But what if the pricing was 10x lower ?
Reading 300GB from an SSD is going to be way too slow. It will take minutes per search. In a simple test on linode, the SSD seems to read at about 1GB/s.
Their pricing seems like a bargain to me.
Commonly, the choice of approach is dictated by the number of end users querying the same data. If it's relatively specific queries ran frequently by many users, the aggregation approach can be made snappier and save lots of processing power.
But for power users, it becomes a nuisance real fast to be limited not only by stored base data, but also by the somewhat arbitrarily chosen aggregate structures. I'm assuming people doing log analyses of server logs fall within "power users" in most shops. At least I'd hope so :)
As an aside, the Go language (that I'm currently flapping between loving and not so much), versus Python at al., seems to be born of somewhat similar thinking.
Current title on Hacker News is misleading as the “simple code” is not compared against any “fancy algorithms”. This is about not spending time devising fancy algorithms when the simple O(n) approach is good enough.
Submitters: It's against the guidelines to rewrite titles to put your own editorial spin on things. Please don't.
I've also implemented a "web search" that was just a complete scan through a text file. But this was back in 1997, and the amount of data we had didn't justify anything more complicated.
Where they do need pass messages between threads they use the famous LMAX Disruptor which keeps data copying and blocking to a minimum.
The basic take-away is to always prefer std::vector and boost::flat_set/flat_map unless you have evidence to the contrary.
Secondly, you need to understand Algorithm analysis in little more detail. If you look at run time analysis of your brute force algorithm, you can determine, of it grows in linear time, logarithmic, exponential etc and how to improve upon the core algorithm. Anyone can add more machines, more memory, faster processor which will obviously help in improving performance but in comparison to order of growth of 'n' all of that pales in the background.