Hacker News new | past | comments | ask | show | jobs | submit login

This is classic case of "Algorithm/Problem Selection" if your algorithm/problem is tailored to a task such as PageRank, surely a single threaded highly optimized code will beat a cluster designed for ETL tasks. In real organizations where there are multiple workflows/algorithms, distributed systems always win out. Systems like Hadoop take care of Administration, Redundancy, Monitoring and Scheduling in a manner that a single machine cannot. Sure you can "Grep" faster on a laptop than AWS EMR with 4 Medium instances, but in reality where you have 12 types of jobs which are run by team of 6 people, you are much better off with a distributed system.



Ditto for computationally intensive work: if it is CPU dominated, more CPU's calculating in parallel will be of advantage, even if the data could fit some RAM.

There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

The recent problem is, some people love "clouds" so much today that they push there the work that could really be done locally.


Part of the problem is that a lot of problems that are CPU dominated on a single system becomes IO dominated once you start distributing it at very low node counts without very careful attention to detail.


Yes, and starting from the distributed version first, one can even miss the fact that everything can be immensely faster once the communication is out of the way.

There's no a single simple answer, except: don't decide about the implementation first, instead competently and without prejudice evaluate your actual possibilities.

Also don't decide the language like "Python" or "Ruby" first. If you expect that the calculations are going to take a month, then code them in Ruby and wait the month for the result, you can miss the fact that you could have had the results in one day by just using another language, most probably without sacrifying the readability of the code much. Only if the task is really CPU-bound, of course.

On another side, if you have a ready solution in Python, and you'd need a month to develop the solution for other language, the first thing you have to consider is how often you plan to repeat the calculations afterwards. Etc.


I'd also toss in ease of correctness if you don't have a really well-known problem with a good test suite: I've seen a number of cases where people developed “tight” C code which gave results which appeared reasonable and then ran it for months before noticing logic errors which would either have been impossible or easier to find in a higher-level language or, particularly, one which had better numeric error handling characteristics.

There is a lot to be said for confirming that you have the right logic and then looking into something like PyPy/Cython/etc. or porting to a lower-level language.


> one which had better numeric error handling characteristics

Which language do you suggest which matches this?

I don't know many people who use the language "numeric error handling characteristic" but I know the benefit of already having good written high-level primitives (or libraries).


The first example which comes to mind are integer overflows, which are notoriously easy to miss in testing with small / insufficiently-representative data. Languages like Python which use a arbitrary-precision system will avoid incorrect results at the expense of performance and I believe a few other languages will raise an exception to force the developer to decide the correct way to handle them.

The other area I've seen this a lot is caused by the way floating point math works not matching non-specialists’ understanding – unexpectedly-failing equality checks, precision loss making the results dependent on the order of operations, etc. That's a harder problem: switching to a Decimal type can avoid the problem at the expense of performance but otherwise it mostly comes down to diligent testing and, hopefully, some sort of linting tool for the language you're using.


The entire reason for the popularity of distributed systems is because application developers in general are very bad at managing I/O load. Most developers only think of CPU/memory constraints, but usually not about disk I/O. There's nothing wrong with that; because if your services are stateless then the only I/O you should have is logging.

In a stateless microservice architecture, disk I/O is only an issue on your database servers. Which is why database servers are often still run on bare metal as it gives you better control over your disk I/O - which you can usually saturate anyway on a database server.

In most advanced organizations, those database servers are often managed by a specialized team. Application servers are CPU/memory bound and can be located pretty much anywhere and managed with a DevOps model. DBAs have to worry about many more things, and there is a deeper reliance on hardware as well. And it doesn't matter which database you use; NoSQL is equally as finicky as a few of my developers recently learned when they tried to deploy a large Couchbase cluster on SAN-backed virtual machines.


For web services, distributed usually makes sense, because the resource usage of a single connection is rarely high. It's largely uninteresting to consider huge machines for that use case for the reasons you outline.

But that's not really what the discussion is about. In the web services case, your dataset often/generally fits in memory because your data set is tiny. You don't need large servers for that, most of the time. Even most databases people have to work with are relatively small or easily sharded.

In the context of this discussion, consider that what matters is the size of the dataset for an individual "job". If you are processing many small jobs, then the memory size to consider is the memory size of an individual job, not the total memory required for all jobs you'd like to run in parallel. In that case many small servers is often cost effective.

If you are processing large jobs, on the other hand, you should seriously consider if there are data dependencies between different parts of your problem, in which case you very easily become I/O bound.


I own a ASUS gaming laptop...

It has two flaws: One, it has nVidia Optimus (that just suck, whoever implemented it should be shot).

Two, the I/O is not that good, even with a 7200 RPM disk, and Windows 8.1 make it much worse (windows for some reason keep running his anti-virus, superfetch, and other disk intensive stuff ALL THE TIME).

This is noticeable when playing emulated games: games that use emulator, even cartridge ones, need to read from the disc/cartridge on the real hardware, on the computer they need to read from disc quite frequently, the frequent slowndowns, specially during cutscenes or map changes is very noticeable.

The funny thing is: I had old computers, with poorer CPU (this one has a i7) that could run those games much better. It seems even hardware manufacturers forget I/O


I can bet in your case it's not the disk I/O actually, unless you have some very strange emulator the file access should still go through the OS disk cache. VMware for example surely benefits from it. How many GB do you have on the machine? How much is still left free when you run the emulator and the other software you need?


I noticed it as disk i/o because I would leave task manager running on second screen and every time the game lagged memory and CPU use were below 30% and disk was 100%, and if I left sorting by disk usage, the first places are the windows stuff, and after them, the emulator.



> There's no a single simple answer, but sure, whenever less computers are enough, less should be used.

This - if you're using this website as your sole calculation then something has gone seriously wrong.

It was more intended to provoke discussion and thinking around overengineering things which can easily be done with, say, awk or a few lines of <scripting language>.

If you have large CPU requirements then sure, use Hadoop/Spark/some other distributed architecture. If you have a >6TB dataset but you only need 1GB of it at a time, then, well, maybe you still only need one computer.


The problem is getting 100% of the work to be parallel. If I have to do that last 10% in one machine there isn't much point to more than 10 machines.


You are talking about https://en.wikipedia.org/wiki/Amdahl's_law , am I right?


In most cases it's probably better to have one machine each for different jobs, then all cpu on every machine can be used for the current job. There are probably very few jobs that take extreme amount of cpu without also using a lot of data which would have to be moved around a cluster anyway.


Cloud seems to have become a Wall Street flag on par with outsourcing. It is something a company does to grab market attention...


Some things take 1 second on 1 machine beefy machine or 6 hours on a cluster due to latency issues. And yes I do mean 20,000 times slower, though 2,000 is far more common due to latency inside a datacenter being around ~500,000 ns vs ~100 ns for main memory vs 0.5 nm from L1 cache.

PS: Not that most systems are built around these kinds of edge cases, but 'just use a cluster' is often not a good option unless each node is sufficiently beefy.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: