
Command-line Tools can be 235x Faster than a Hadoop Cluster (2014) - hd4
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
======
tluyben2
I have rewritten incredibly overarchitected stuff, Cassandra, Hadoop, Kafka,
Node, Mongo etc with a plethora of the ‘the latest cool programming languages’
running on big clusters of Amazon and Google to simple, but not sexy, c# and
mysql or pgsql. Despite people commenting on the inefficiency of ORMs and
unscalable nature of the solution I picked, it easily outperformes in every
way for the real worldcases these systems were used for. Meaning; far easier
achitecture, easier to read and far better peformance in both latency and
throughput for workloads that will probably never happen. Also; one language,
less engineers needed, less maintenance and easily swappable databases. I
understand that all that other tech is in fact ‘learning new stuff’ for RDD,
but it was costing these companies a lot of money with very little benefit. If
I need something for very high traffic and huge data, I still do not know if I
would opt for Cassandra or Hadoop; even with proper setup, sure they scale but
at what cost? I had far better results with kdb+, which requires very little
setup and very minimal overhead if you do it correctly. Then again, we will
never have to mine petabytes, so maybe the use case works there: would love to
hear from people who tried different solutions objectively.

~~~
nikanj
A lot of architecture is designed from the "what would enable me to job-hop to
the next position on the seniority pole" vantage point.

~~~
davio
We call that "LinkedIn Driven Development" (LIDD)

~~~
cfontes
Just perfect!

------
3pt14159
I once converted a simulation into cython from plain old python.

Because it fit in the CPU cache the speedup was around 10000x on a single
machine (numerical simulations, amirite?).

Because it was so much faster all the code required to split it up between a
bunch of servers in a map reduce job could be deleted, since it only needed a
couple cores on a single machine for a ms or three.

Because it wasn't a map-reduce job, I could take it out of the worker queue
and just handle it on the fly during the web request.

Sometimes it's worth it to just step back and experiment a bit.

~~~
vvanders
Yeah, back when I was in gamedev land and multi-cores started coming on the
scene it was "Multithread ALL THE THINGS". Shortly there after people realized
how nasty cache invalidation is when two cores are contending over one line.
So you can have the same issue show up even in a single machine scenario.

Good understanding of data access patterns and the right algorithm go a long
way in both spaces as well.

~~~
notacoward
Even earlier, when SMP was hitting the server room but still far from the
desktop, there was a similar phenomenon of breaking everything down to use
ever finer-grain locks ... until the locking overhead (and errors) outweighed
any benefit from parallelism. Over time, people learned to _think_ about
expected levels of parallelism, contention, etc. and "right size" their locks
accordingly.

Computing history's not a circle, but it's damn sure a spiral.

~~~
akvadrako
A spiral is pretty optimistic; it suggests we are converging on the optimal
solution.

~~~
notacoward
I usually think of it more as a three-dimensional spiral like a spring or a
spiral staircase. Technically that's a helix, but "history is a helix" just
doesn't sound as good for some reason.

------
mpweiher
"You can have a second computer once you've demonstrated you know how to use
one".

------
ikeboy
Recently was sorting a 10 million line CSV by the second field which was
numerical. After an hour went by and it wasn't done, I poked around online and
saw a suggestion to put the field sorted on first.

One awk command later my file was flipped. Run same exact sort command on this
but without specifying field. Completed in 12 seconds.

Morals:

1\. Small changes can have a 3+ orders of magnitude effect on performance

2\. Use the Google, easier than understanding every tool on a deep enough
level to figure this out yourself ;)

~~~
dorfsmay
csv files are extremely easy to import in postgres, and 10 M rows (assuming
not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep
it in mind if you've got something slightly more complicated to analyse.

~~~
hobs
If SQL is your game but you dont want to get PG setup - try SQLite -

    
    
      wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" rows.csv
      sqlite3
      .mode csv
      .import ./rows.csv newyorkdata
      SELECT *
      FROM newyorkdata
      ORDER BY `COUNT PARTICIPANTS`;

~~~
jillesvangurp
Or use csvkit ...

~~~
hobs
Oh nice, I hadn't seen this before, so a similar query would be a bit shorter!

    
    
      wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" 
      csvsql --query "SELECT * FROM newyorkdata ORDER BY `COUNT PARTICIPANTS`" rows.csv > new.csv

------
0xcde4c3db
See also: Scalability! But at what COST? [1] [2]

> The COST of a given platform for a given problem is the hardware
> configuration required before the platform outperforms a competent single-
> threaded implementation. COST weighs a system’s scalability against the
> overheads introduced by the system, and indicates the actual performance
> gains of the system, without rewarding systems that bring substantial but
> parallelizable overheads.

[1]
[http://www.frankmcsherry.org/assets/COST.pdf](http://www.frankmcsherry.org/assets/COST.pdf)

[2]
[https://news.ycombinator.com/item?id=11855594](https://news.ycombinator.com/item?id=11855594)

~~~
nisa
Keep in mind it's for graph processing - Hadoop/HDFS still shines for data-
intensive streaming workloads like indexing a few hundred terabytes of data -
where you can exploit the parallel disk io of all disks in the cluster - if
you have 20 machines with 8 disks in a cluster that's 20 * 8 * 100mbyte/s =
16gbyte/s throughput - for 200 machines it's 160gbyte/s.

However for iterative calculations like pagerank the overhead for distributing
the problems is often not worth it.

------
mpweiher
For my performance book, I looked at some sample code for converting public
transport data in CSV format to an embedded SQLite DB for use on mobile. A
little bit of data optimization took the time from 22 minutes to under a
second, or ~1000x, for well over 100MB of Source data.

The target data went fro almost 200MB of SQLite to 7MB of binary that could
just be mapped into memory. Oh, and lookup on the device also became 1000x
faster.

There is a LOT of that sort of stuff out there, our “standard” approaches are
often highly inappropriate for a wide variety of problems.

~~~
mcguire
Normal developer behavior has gone from "optimize everything for machine
usage" (cpu time, memory, etc.) to "optimize everything for developer
convenience". The former is frequently inappropriate, but the latter is, as
well.

(And some would say that it then went to "optimize everything for resume
keywords," which is almost always inappropriate, but I don't want to be too
cynical.)

~~~
mpweiher
Oh, it was also less code.

------
bufferoverflow
Previous posts:

2016:
[https://news.ycombinator.com/item?id=12472905](https://news.ycombinator.com/item?id=12472905)

2015:
[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462)

2018:
[https://news.ycombinator.com/item?id=16810756](https://news.ycombinator.com/item?id=16810756)

~~~
jwilk
The last one has zero comments, so there's no point linking to it.

~~~
albemuth
Unless you're trying to make a point about reposting.

~~~
slededit
3 times over 4 years isn't unreasonable for interesting content. With the
exception of the unlucky post this has generated good discussion each time and
I think is safely in the interesting category.

------
notacoward
A lot of people are saying how they've worked on single-machine systems that
performed far better than distributed alternatives. Yawn. So have I. So have
thousands of others. It should almost be a prerequisite for working on those
distributed systems, so that they can understand the real point of those
systems. _Sometimes_ it's about performance, and even then there's no "one
size fits all" answer. Just as often it's about capacity. Seen any _exabyte_
single machines on the market lately? Even more often that that, it's about
redundancy and reliability. What happens when your single-machine wonder has a
single hardware failure?

Sure, a lot of tyros are working on distributed systems because it's cool or
because it enhances their resumes, but there are also a lot of professionals
working on distributed systems because they're the only way to meet
requirements. Cherry-picking examples to favor your own limited skill set
doesn't seem like engineering to me.

------
makapuf
in other words, "Too big for excel is not big data"
[https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)

~~~
mseebach
I heard a variation on this: it's not big data until it can't fit in RAM in a
single rack.

~~~
zackelan
The version I've heard is that small data fits on an average developer
workstation, medium data fits on a commodity 2U server, and "big data" needs a
bigger footprint than that single commodity server offers.

I like that better than bringing racks into it, because once you have multiple
machines in a rack you've got distributed systems problems, and there's a
significant overlap between "big data" and the problems that a distributed
system introduces.

~~~
mmt
It's frustrated me for the better part of a decade that the misconception
persists that "big data" begins after 2U. It's as if we're all still living
during the dot-com boom and the only way to scale is buying more "pizza
boxes".

Single-server setups larger than 2U but (usually) smaller than 1 rack can give
_tremendous_ bang for the buck, no matter if your "bang" is peak throughput or
total storage. (And, no, I don't mean spending inordinate amounts on brand-
name "SAN" gear).

There's even another category of servers, arguably non-commodity, since one
can pay a 2x price premium (but only for the server itself, not the storage),
that can quadruple the CPU and RAM capacity, if not I/O throughput of the
cheaper version.

I think the ignorance of what hardware capabilities are actually out there
ended up driving well-intentioned (usually software) engineers to choose
distributed systems solutions, with all their ensuing complexity.

Today, part of the driver is how few underlying hardware choices one has from
"cloud" providers and how anemic the I/O performance is.

It's sad, really, since SSDs have so greatly reduced the penalty for data not
fitting in RAM (while still being local). The penalty for being at the end of
an ethernet, however, can be far greater than that of a spinning disk.

~~~
zackelan
That's a good point, I suppose it'd be better to frame it as what you can run
on a $1k workstation vs. a $10k rackmount server, or something along those
lines.

As a software engineer who builds their own desktops (and has for the last 10
years) but mostly works with AWS instances at $dayjob, are there any resources
you'd recommend for learning about what's available in the land of that
higher-end rackmount equipment? Short of going full homelab, tripling my power
bill, and heating my apartment up to 30C, I mean...

~~~
mmt
> I suppose it'd be better to frame it as what you can run on a $1k
> workstation vs. a $10k rackmount server, or something along those lines.

That's probably better, since it'll scale a bit better with technological
improvements. The problem is, it doesn't have quite the clever sound to it,
especially with the numbers and dollars.

Now, the other main problem is that, though the cost of a workstation is
fairly well-bounded, the cost of that medium-data server can actually vary
quite widely, depending on what you need to do with that data (or, I suppose,
how long you might want to retain data you don't happen to be doing anything
to right at that moment).

I suppose that's part of my point, that there's a mis-perception that, because
a single server (including its attached storage) can be _so_ expensive, to the
tune of many tens of thousands of (US) dollars, that somehow makes it "big"
and undesireable, despite its potentially close-to-linear price-to-performance
curve compared to those small 1U/2U servers. Never mind doing any reasoned
analysis of whether going farther up the single-server capacity/performance
axis, where the price curve gets steeper is worth it compared to the cost and
complexity of a distributed solution.

> are there any resources you'd recommend for learning about what's available
> in the land of that higher-end rackmount equipment?

Sadly, no great tutorials or blogs that I know of. However, I'd recommend
taking a look at SuperMicro's complete-server products, primarily because, for
most of them, you can find credible barebones pricing with a web search. I
expect you already know how to account for other components (primarily of
concern for the mobos that take only exotic CPUs).

As I alluded in another comment, you might also look into SAS expanders
(conveniently also well integrated into some, but far from all, SuperMicro
chassis backplanes) and RAID/HBA cards for the direct-attached (but still
external) storage.

------
nisa
See also the wonderful COST paper:
[https://www.usenix.org/conference/hotos15/workshop-
program/p...](https://www.usenix.org/conference/hotos15/workshop-
program/presentation/mcsherry)

But the article is kind of wrong. It depends on your data size and problem -
you can even use commandline-tools with Hadoop Map/Reduce and the Streaming
API and Hadoop is still useful if you have a few terabytes of data that you
can tackle with map and reduce algorithms and in that case multiple machines
do help quite a lot.

anything that fits on your local ssd/hdd probably does not need hadoop...
however you can run the same unix commands from the article just fine on a
20tb dataset with Hadoop.

Hadoop MapReduce/HDFS is a tool for a specific purpose not magic fairy dust.
Google did build it for indexing and storing the web and probably not to
calculate some big excel sheets...

------
jandrese
Most people who think they have big data don't.

~~~
Groxx
At an _absolute_ minimum, I'd say "big data" begins when you can't buy
hardware with that much memory.

Apparently, in 2017, that was somewhere around 16 terabytes.
[https://www.theregister.co.uk/2017/05/16/aws_ram_cram/](https://www.theregister.co.uk/2017/05/16/aws_ram_cram/)
Heck, you can trivially get a 4TB instance from Amazon nowadays:
[https://aws.amazon.com/ec2/instance-
types/](https://aws.amazon.com/ec2/instance-types/)

The biggest DBs I've worked on have been a few tens of billions of rows, and
several hundreds of gigabytes. That's like... nothing. A laughable start. You
can make trivially inefficient mistakes and for the most part e.g. MySQL will
still work fine. Absolutely nowhere _near_ "big data". And odds are pretty
good you could just toss existing more-than-4TB-data through grep and then
into your DB and still ignore "big data" problems.

~~~
gaius
10 years ago I was working on a 50Tb datawarehouse. Now I see people who think
50Gb is “big data” because Pandas on a laptop chokes on it.

~~~
Swinx43
This, a million times this!!! I see people doing consulting promoting the most
over engineered solutions I have ever seen. Doing it for data that is a few
hundred GB or maybe at best a TB.

It makes me want to cry, knowing we handled that with a single server and a
relational database 10 years ago.

Lets also not forget that everyone today forgets that the majority of data
actually has some sort of structure. There is no point in pretending that
every piece of data is a BLOB or a JSON document.

I have given up on our industry ever becoming sane, I now fully expect each
hype cycle to be pushed to the absolute maximum. Only to be replaced by the
next buzzword cycle when the current starts failing to deliver on promises.

~~~
therealdrag0
Yep. I left consulting after working on a project that was a Ferrari when the
customer only needed a Honda. Our architect kept being like "it needs to be
faster", and I'm like "our current rate would process the entire possible
dataset (everyone on earth) in a night, do we really need a faster Apache
Storm cluster?" :S

"Resume driven development" is a great term.

------
Something1234

        cat *.pgn | grep "Result" | sort | uniq -c
    

This pipeline has a useless use of cat. Over time I've found cat to be kind of
slow as compared to actually passing a filename to a command when I can. If
you rewrite it to be:

    
    
        grep -h "Result" *.pgn | ...
    

It would be much faster. I found this when I was fiddling with my current log
processor to analyze stats on my blog.

~~~
tzahola

       Argument list too long!

~~~
pwg
Then do:

find -name \\*.pgn -print0 | xargs -0 grep -h "Result"

Each grep invocation will consume a maximum number of arguments, and xargs
will invoke the minimum number of greps to process everything, with no "args
too long" errors.

------
eb0la
Un My case it was 2015. I was struggling with a 28GB CSV file that i needed to
cut grabbing only 5 columns.

Tried spark on my laptop: waste if time. After 4h I killed al processes
because it didn't read 25% of the file yet.

Same for hadoop, python and pandas, and a shiny new tool from google whose
name I forgot long time ago.

Finally I installed cygwin con My laptop and 20 minutes later 'cut' gave me
the results file I needded.

~~~
jzwinck
cut solved your problem. Let's talk about why.

cut is line oriented, like most Unix style filters. It needs to keep only one
line in memory at most.

If you say:

    
    
        pd.read_csv(f)[[x,y,z]]
    

It has to read and parse the entire 28GB into memory (because it is not lazily
evaluated; cf Julia).

If you actually need to operate on three columns in memory and discard the
rest, you should:

    
    
        pd.read_csv(f, usecols=[x,y,z])
    

Then you get exactly what you need, and avoid swapping.

The lack of lazy evaluation does inhibit composition--just look at the myriad
options in read_csv(), some of which are only there to enable eager evaluation
to remain efficient.

~~~
_wmd
While Pandas csv parser is quite slow anyway, the reason Pandas is
particularly slow in this case is because it insists on applying type
detection to every field of every row read. I have no clue how to disable it,
but it's default behaviour.

Parsing isn't actually a tough problem –
[https://github.com/dw/csvmonkey](https://github.com/dw/csvmonkey) is a
project of mine, it manages almost 2GB/sec throughput _per thread_ on a decade
old Xeon

------
raghava
For many, the incentive from the gig is

a) resume enrichment by way of buzzword addition b) huge budget grants and
allocation, purportedly for lofty goals while management is really unaware of
real technology needs/options

Much has been talked about this already; sharing them again!

[1]
[https://news.ycombinator.com/item?id=14401399](https://news.ycombinator.com/item?id=14401399)

[2]
[https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)

[3] [http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
progra...](http://widgetsandshit.com/teddziuba/2010/10/taco-bell-
programming.html)

[4]
[https://www.reddit.com/r/programming/comments/3z9rab/taco_be...](https://www.reddit.com/r/programming/comments/3z9rab/taco_bell_programming/)

[5] [https://www.mikecr.it/ramblings/taco-bell-
programming/](https://www.mikecr.it/ramblings/taco-bell-programming/)

------
hans_castorp
So there are 2197188 games in that file.

Extracting only the lines with "[Result]" in them into new file (using grep)
takes about 3 seconds.

Importing that into a local Postgres database on my laptop takes about 1.5
seconds.

Then running a simple:

    
    
       select result, count(*)
       from results
       group by result;
    

Takes about 0.5 seconds.

So the total process took only 5 seconds (and now I can run much more
aggregation queries on that)

------
exelius
I feel that we have spent 30 years replicating Bash across multiple computer
systems.

The further my clients move to the cloud, the more shell scripts they write at
the exclusion of other languages. and just like this, I have clients who have
ripped out expensive enterprise data streaming tools and replaced them with
bash.

The future of enterprise software is going to be a bloodbath.

~~~
hoffbrau99
I want to believe, but instead it will be javascript-driven blockchain at the
command line

~~~
konradb
how else would you control your ML IoT cluster?

------
jgord
anecdota - We used fast csv tools and custom node.js scripts to wrangle import
of ~500GB of geo polygon data into a large single vol postgresql+postGIS host.

We generate svg maps in psuedo-realtime from this data-set : 2MB maps render
sub-second over the web, which feels 'responsive'.

I only mention this as many marketing people will call 50Million rows or 1TB
"Big Data" and therefore suggest big / expensive / complex solutions. Recent
SSD hosts have pushed up the "Big Data" watermark, and offer superb
performance for many data applications.

[ yes, I know you can't beat magnetic disks for storing large videos, but
thats a less common use-case ]

~~~
sitkack
How do you partition your data and/or build your indexes?

~~~
jgord
two kinds of indexes, for different purposes ...

a) basically a preprocessed static inverted index on keywords [ using GIN /
tsv_ etc ]

b) geo location index on GIS geometry field - relying on postGIS to be
performant over geo queries

The data changes infrequently, so these are computed at time of import /
regular update.

We do have a fair amount of ram set aside for postgres [ circa 20GB ]

------
icc97
Relevant article from 2013: 'Don't use Hadoop - your data isn't that big' [0]
and the most recent HN discussion [1].

[0]:
[https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html](https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html)

[1]:
[https://news.ycombinator.com/item?id=14401399](https://news.ycombinator.com/item?id=14401399)

------
adamdrake
Hi all, author here!

Many thanks for the feedback and comments. If you have any questions, I'm also
happy to try to answer them.

I'm also working on a project,
[https://applybyapi.com](https://applybyapi.com) which may be of interest to
anyone here hiring developers and drowning in resumes.

~~~
badlogic
I gave your service a quick whirl. Love the design language as a developer.
What I don't like is that I have to provide my credit card information to get
any detailed info (assuming there is such a thing behind the Stripe checkout).

The info available to non-signed-up people is simply not enough. It would be
great if you had a demo account so one can get a feel for how this works.

E.g. I couldn't find info on why I have to add a job description. Wouldn't I
post job board, with the posting liniing to the ApplyByAPI test?

I could also not find any info on the tests candidates have to do. Yes, APIs,
I get it. But an example wouldn't hurt.

Interesting idea!

~~~
adamdrake
Hi, and thank you for the great feedback!

We will certainly consider those points, and try to make things both more
clear from a communication perspective and also consider adding some demo
account (great idea!).

To answer the question about the job description, you're right that most
customers so far have a post on a jobs board which they use to get traffic
over to their ApplyByAPI posting. Once the candidate is there, they get a page
with the job description and some API information they can use to generate
their application.

We'll work on improving the language and information presentation to potential
customers, and thank you again for taking the time to give your perspective!

------
jakosz
You can get very good improvements over Spark too. I've been using GNU
Parallel + redis + Cython workers to calculate distance pairs for a
disambiguation problem. But then again, if it fits into a few X1 instances,
it's not big data!

------
dorfsmay
If you need to do something more complicated where SQL would be really handy,
and like here you're not going to update your data, give monetDB a try. Quite
often when I'm about to break down and implore the hadoop gods, I remember
about monetDB, and most of the time it's sufficient to solve my problem.

MonrtDB: columnar DB for a single machine with a psql like interface:
[https://www.monetdb.org/](https://www.monetdb.org/)

------
walshemj
Well yes in the edge case where you don't really have big data of course it
will.

Where MapReduce Hadoop etcera shine is when a single dataset one of many you
need to process is bigger than the biggest single disk availible - this
changes with time

Back when I did M/R for BT the data set sizes where smaller - still having all
of the Uk's larges single PR!ME superminis running your code was dam cool -
even though I used a 110 baud dial up print terminal to control it.

------
libx
Once, I saw a presentation of Unicage, a big data solution that had (perhaps
still has) a free version
[https://youtu.be/h_C5GBblkH8?t=2566](https://youtu.be/h_C5GBblkH8?t=2566) It
seems that it has evolved to a company now:
[http://www.unicage.com/products.html](http://www.unicage.com/products.html)

Did anyone try the unicage solution?

------
ram_rar
I think much of this issue can be attributed to 2 most underrated things

1\. Cache line misses. 2\. So called definition of BigData. (if data can be
easily fit into memory, then its not Big period! )

Many times, I have seen simple awk / grep commands will outperform Hadoop
jobs. I personally feel, its lot better to spin up larger instances, compute
your jobs and shut it down than bearing the operational overhead of managing
hadoop cluster.

------
verytrivial
Anchoring the search would allow the regex to terminate faster for non-
matching lines:

    
    
        ... | mawk '/^\[Result' ...

------
rurban
Esp. if you use multipipe-enhanced coreutils, like dgsh.
[https://www2.dmst.aueb.gr/dds/sw/dgsh/](https://www2.dmst.aueb.gr/dds/sw/dgsh/)

------
noobermin
It would benefit people to actually understand algorithmic complexity of what
they are doing before they go on these voyages to parallelize everything. It
also helps to know what helps to parallelize and what doesn't.

------
insaneirish
Something, something, Joyent Manta: [https://apidocs.joyent.com/manta/job-
patterns.html](https://apidocs.joyent.com/manta/job-patterns.html)

------
ash_gti
Shouldn't this be marked 2014? The articles date is January 18, 2014.

~~~
gaius
It’s probably even truer and more relevant now

------
xvilka
Partially because of the bloated architecture of those Hadoop, Kafka, etc. And
of course Java. Implementing modern and lighter alternative to those in C++,
Go or Rust would a step forward.

~~~
jfoutz
Wierd. I’d always thought of go as a slower Java with an shallower pool of
good libraries and weak dev tools. Is the compiler and gc that good? Maybe
I’ll have to give it another try.

~~~
Groxx
Depends on what you're targeting. For raw computation, Go's similar and
sometimes noticeably faster, even after hotspot has its turn. For garbage
collection _pause time_ , Go's rather amazing, typically measuring _less than
a millisecond_ even for gigabytes of heap mem[1]. For bin startup time (e.g.
for a CLI tool) try Go instead just because it'll be done _way_ before the JVM
even hands control to your code.

For dev tools, oh hell yes. Stick to Java. Go comes with some nice ones out-
of-the-box which is always appreciated, but the ecosystem of stuff for Java is
among the very best, and even the standard "basic" stuff vastly out-does Go's
builtins.

[1]:
[https://twitter.com/brianhatfield/status/634166123605331968?...](https://twitter.com/brianhatfield/status/634166123605331968?lang=en)
and there's also a blog/video? about these same charts and how it got there.
pretty neat transformations.

------
mirceal
People are suckers for complex, over-engineered things. They associate
mastering complexity with intelligence. Simple things that just work are
boring / not sexy.

I’ll be in my corner doing asking “do we really need this?”/“have you tested
this?”

------
dunk010
'Nuff said:
[https://github.com/erikfrey/bashreduce](https://github.com/erikfrey/bashreduce)

------
blablabla123
I've heard this so often, especially from my boss - whoever this is. I come to
the conclusion this is just a lame excuse because the people feel out of
control and overwhelmed when using this kind of software. It is heavily Java
based and to those who never touched javac, the Stack traces must look both
intimidating and ridiculous.

On the other hand, processing data over several steps with a homegrown
solution needs a lot of programming discipline and reasoning, otherwise your
software turns into an unmaintainable and unreliable mess. In fact this is the
case where I work right now...

------
forinti
This reminds me of the early EJBs which were really complicated and most
people who used them didn't really need them.

------
koomi
It can be colder at night than outside.

------
erazor42
I closed at "Since the data volume was only about 1.75GB". (And probably would
have until 500GB+)

~~~
imtringued
500GB is still laptop SSD territory... unless you meant 500TB but that's still
doable with a single 4U Backblaze Storage Pod with 60 x 10TB HDDs.

~~~
dajonker
Just reading 500TB from 60 Hdds will take at least 12 hours, assuming a very
optimistic constant 200MB/s throughput per HDD. If you need things to go
faster you won't be able to do that without resorting to distributed
computing.

------
hajderr
Ah yes this is so true!

------
angel_j
the hadoop solution may have taken more computer time, but that is far less
valuable than a person's

the command-line solution probably took that person a couple or more hours to
perfect

------
megaman22
If I do one thing this year, I need to learn more about these tried and tested
command-line unix shell utilities. It's becoming increasingly obvious that so
many complicated and overwrought things are being unleashed on the world
because (statistically) nobody knows this stuff anymore.

~~~
eb0la
You just need to master a few commands:

Find (take a look at -exec option)

Cut (or awk)

Sed (for simple text/string substitutions)

Xargs

Dd (that beast can do charset transtation from ASCII to EBCDIC to use in
mainframes and can also wipe disks)

And bash of course :)

~~~
jake_morrison
One fun thing is that xargs supports parallelization.

    
    
        -P max-procs
        Run up to max-procs processes at a time; the default is 1.
        If max-procs is 0, xargs will run as many processes as
        possible at a time.  Use the -n option with -P; otherwise
        chances are that only one exec will be done.
    
        -n max-args
        Use at most max-args arguments per command line.
        Fewer than max-args arguments will be used if the size
        (see the -s option) is exceeded, unless the -x option is
        given, in which case xargs will exit.
    

So here is a "map/reduce" job which takes the log files in a directory and
processes them in parallel on eight CPUs, then combines the results.

    
    
        find . -name "*.log" | sort | xargs -n 1 -P 8 agg_stats.py | sort | merge_periods.py

~~~
Groxx
xargs is one of my favorites, it's so easy to split something apart to speed
it up by a few multiples.

gotta SSH into 100 machines and ask them all a simple question? xargs will
trivially speed that up by at least 10x, if not better, just by parallelizing
the SSH handshakes.

