Hacker News new | past | comments | ask | show | jobs | submit login

I'm surprised that anyone thinks it's OK when a program has trouble with a data set with over 100 small entries. Can you point to an example of that?

In some Sharepoint 2010 deployments I worked on, it was possible to create Sharepoint workflows that would would bog down, fail to process new entries, etc. once the list of items grew beyond 100-150 entries.

Admittedly, this was probably related to misconfiguration and database issues (i.e. having zero oversight or administrative maintenance of the underlying MS SQL Server). That specific local minimum might not apply to the context of the article (optimization in code and systems design).

I've seen Hive take minutes (OMG!) to count a table with 5 rows ... but (other) people still think it's OK, because it scales well. It's latency sucks for small data sets, but it can handle very large data sets.

It's true, the startup costs of a MapReduce job are immense. I'm surprised by minutes, but I'm not sure this counts since there are and always will be different solutions and different tradeoffs for problems of different orders of magnitude. Any solution built for massive scale is often considers cumbersome for a small scale problem.

For instance, I find test cases that spin up in-memory, in-process Spark extremely slow, but the spin up is quite fast overall in the context of a job that processes gigabytes of data per task.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact