websiteguy's comments

websiteguy · on July 28, 2012

Maybe I am missing something here, but The python version is: O(N) sequential disk access O(NlogN) cpu O(N) RAM

The shell script is the same for disk and cpu, however, it is O(1) of RAM, and therefor can operate on input of size limited to disk, not RAM

pdonis · on July 28, 2012

You're correct, the Python version is O(N) instead of O(1) in RAM usage for two reasons, one easily fixed and one not. Sorry if I'm belaboring the point, but I think it's worth going into a little more detail.

The easily fixable issue is that the shell pipeline buffers reads, whereas my Python version just uses sys.stdin.read() for simplicity. I could have buffered the reads by using Python's generators/iterators instead. However, that alone isn't enough to get O(1) RAM usage.

The not so easily fixable issue is that the Unix sort command uses temporary files in order to not have to have all of its data in memory at once. See, for example, here:

http://vkundeti.blogspot.com/2008/03/tech-algorithmic-detail...

Python's sorted builtin doesn't work like this; it can take a generator as input, but it returns a list. This is one respect in which Python's built-in functionality lacks a feature that the Unix utilities have. I don't know if anyone has tried to re-implement the Python sorted function to use temporary files and return a generator to reduce memory usage.

[Edit: the Python version would also have to re-implement the uniq function to take a generator as input and return a generator, which would, I believe, also require temporary files.]

websiteguy · on July 28, 2012

Unix sort is a merge sort, using fixed RAM uniq -c is line by line sed is line by line

A superior solution in every way

If you were going to do this in Python (or another similar language), you need to write a sort function that operates on a file, not a Python collection, as the collection is always bound by RAM, which would at best be re-implementing the sort command. Once you can sort a file, everything else is trivial, as counts can be done line by line, or more siply, via uniq -c.

Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.

pdonis · on July 29, 2012

> Of course, if you only care about things that fit in memory, you can do it in Python, but it is still far easier to use command line for these type of problems.

I agree; I was not trying to claim that my Python solution should be used in preference to the shell pipeline solution in any kind of "production" environment. As I noted in another comment, Knuth's Pascal solution appears to be open to the same criticism.

pdonis · on July 28, 2012

Looking at the original paper that was linked to elsewhere in the comments here, it appears that Knuth's original Pascal program is, like my Python version, O(N) in RAM. It keeps everything in memory and uses no temporary files, as far as I can tell.

websiteguy · on June 27, 2011

Hi jokull,

Would like to understand more, is there a way to contact you ?

Andy

jokull · on June 27, 2011

jokull@solberg.is

websiteguy · on June 27, 2011

(disclosure, I am the author of this post)

Hi hn_decay,

Our use case goes like this: a member database of over a 100M records, and a number of content databases each with tens or hundreds of millions of records (reviews, video, lists, wiki, this, that , the other thing, one content table with over a billion records, where all the content records had a member id. Our primary usage pattern is to grab a set of content records (say 10-200 at a time) with their member information.

Putting everything in one database and doing the join there does not scale for us, and severely reduces flexibility. We would need to continue to scale up our hardware to handle the sum of the content sets, and new content sets are being created on a regular basis. By putting these all into different databases you then have the choice (not the necessity) of keeping them on one or more machines. You can put on one machine a bunch of content sets that are relatively small, and put the big ones on their own machine. You can also scale the hardware to individual content sets - infrequently accessed content sets do not have to be on powerful machines, very frequently accessed sets can be scaled on bigger machines.

There are downsides, the two-query hit being the least significant, the extra query on a tuned database is on order of 1ms. Even if the hit was larger, I would still live it, scalability != performance

Andy

hn_decay · on June 27, 2011

I was replying to the context of the post, and specifically spoke to horizontal scalability so I am confused that you felt it appropriate to "correct" that.

Having said that, hundreds of millions of records equals a small dataset. I still don't understand when that's held as some sort of edge case when it's easily accommodated on commodity low-end hardware.

websiteguy · on June 27, 2011

(disclosure - I am the author of this article)

"At the end of the day hire good ppl. They'll do the right thing most of the time." - absolutely agree.

No system is perfect, ours is not. You need to decide what is important to your organization and emphasize it - they way we do things here at TA has its downsides, but overall, it works very well for us and what we want to do. The benefits for us significantly outweigh the drawbacks.

It is critical to have a boss who understands how software development works, and that mistakes get made. If anything, my boss wants my team to make more mistakes than we do :-)

websiteguy · on June 27, 2011

(disclosure - I am the author of this article) I think this comes down to judgment.

Design appropriately. Short sighted designs are as bad as ivory tower designs. Every situation is different.

If a person does not own something, no one does - everything comes down to the individual. This does not mean that there is no overlap in knowledge.

Source control is a must. So are scripts that detect changes in your area. So are tests.

websiteguy · on June 27, 2011

(disclosure - I am the author of this article) I think it is a matter of judgment. Bugs that bring the site down,I agree. Minor bugs that most people will not notice on a consumer site where we can quickly patch them, I would rather get more projects out. Where you land in between these positions depends on a lot of things.

websiteguy · on June 27, 2011

(disclosure - I am the author of this article) I am not sure about "faster", but it is more scalable and more flexible - especially considering that the datasets to be "joined" are small, page-sized datasets. We have a lot of different content types, with a central member database. Not having to have all of our content on one machine and not having the memory demand of doing joins allows us to scale more easily. This is not to say we "never" do joins (our member database has multiple tables), it is a matter of find the most appropriate modularity.

websiteguy · on June 3, 2011

@latch - Yes, we saw it - would be a fun project, but we are software hackers :-) There are also concerns about heat, vibration, hot swapping, etc. I would happily spend $78K for that same storage instead of $7.8K if it was reliable and was shipped as a complete package.

Closest company doing something like this is CORAID, which might be an option, but they have a number of proprietary interfaces/systems to buy in to.

staunch · on June 4, 2011

(You don't need to write @latch here. You can reply directly to a thread)

websiteguy · on June 3, 2011

For a web startup - large monitors, powerful dev machines, extra servers for testing and running stuff, free drinks and free food

websiteguy · on June 3, 2011

Thanks for the replies -

Need disk storage, would like to use 7200rpm for real-time access to logs, 15k for database/email apps

Would be a lot of fun to roll our own, but we are resource constrained, and while money is not an issue for us, I don't like complicated/expensive solutions, I like simple/inexpensive

Looking at NetApp for example, but having a hard time accepting the price point.