Hacker News new | past | comments | ask | show | jobs | submit login

> A German company holds a very large amount of profile data and wanted to search through it. On disk storage in the 100's of gigabytes. Smart encoding of the data and a clever search strategy allowed the identification of 'candidate' records for matches with fairly high accuracy, fetching the few records that matched and checking if they really were matches sped things up two orders of magnitude over their SQL based solution.

Right. But you've still not made a compelling argument that the "smart encoding of the data" and "clever search strategy" couldn't be done with/in SQL. I'm quite sure you could probably get a magnitude or two of improvement by diving down outside of SQL -- but there's a big difference between using the schema you have, and setting up a dedicated system.

I was perhaps not clear, but I was wondering if it's not often the case with real-world data, that you could create a new, tailored schema in SQL, and do the analysis on one system -- especially if you allow for something like advanced stored procedures (perhaps in C) and alternate storage engines.

I suppose one could argue when something stops being SQL, and start becomming SQL-like. But the initial statement was that "For some problems SQL would already be way too much overhead.". The implication being (as I understood it, at least) -- that you not only have to change how the date is stored and queried, but that SQL databases are unsuitable to that that task -- and that by going away from your dbs, you can suddenly do something that you couldn't do on a system with X resources.

I'm not saying that's wrong -- I'm just asking if a) that's what you mean, and b) if you can quantify this overhead a bit more. Are using bitfields or whatnot with postgres going to be a 20% overhead, or a 20000% overhead in your scenario? Because if it's 20%, that sounds like it can be "upgraded away".




Tough crowd.


Yeah, sorry about that :-) I really do appreciate your input -- and the previously linked page-rank example show how using manual data layout can squeeze a problem down -- but I still wonder how many of those techniques could've been made to work with SQL -- and what the resulting overhead would've been. Size on disk would probably have been a problem -- AFAIK postgres tend to compress data a bit, but probably nowhere near as much as manual rle+gzip. But I wonder how far one could've gotten with just hosting the db on a compressed filesystem...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: