Hacker News new | past | comments | ask | show | jobs | submit login
Command-line Tools can be 235x Faster than a Hadoop Cluster (2014) (adamdrake.com)
438 points by hd4 on May 23, 2018 | hide | past | favorite | 222 comments



I have rewritten incredibly overarchitected stuff, Cassandra, Hadoop, Kafka, Node, Mongo etc with a plethora of the ‘the latest cool programming languages’ running on big clusters of Amazon and Google to simple, but not sexy, c# and mysql or pgsql. Despite people commenting on the inefficiency of ORMs and unscalable nature of the solution I picked, it easily outperformes in every way for the real worldcases these systems were used for. Meaning; far easier achitecture, easier to read and far better peformance in both latency and throughput for workloads that will probably never happen. Also; one language, less engineers needed, less maintenance and easily swappable databases. I understand that all that other tech is in fact ‘learning new stuff’ for RDD, but it was costing these companies a lot of money with very little benefit. If I need something for very high traffic and huge data, I still do not know if I would opt for Cassandra or Hadoop; even with proper setup, sure they scale but at what cost? I had far better results with kdb+, which requires very little setup and very minimal overhead if you do it correctly. Then again, we will never have to mine petabytes, so maybe the use case works there: would love to hear from people who tried different solutions objectively.


The tree swing cartoon has never been more true [1]

Large companies would benefit from dev not over-engineering simple apps and spending less time on their own tools and ticket systems, and spending more time instead on solving/automating more problems for the company.

[1] https://www.tamingdata.com/wp-content/uploads/2010/07/tree-s...


I have a very similar experience, in particular with my current project one of the users (a sysadmin actually) asked if I was using elasticsearch or similar because he noticed that searching a record and filtering was very fast.

My response: nope, MySQL! (plus an ORM, a very hated programming language and very few config optimisations on the server side).

This project DB has a couple of tables with thousands of records, not billions, and, for now, a few users (40-50).. a good schema and a few well done queries can do the trick.

I guess some people are so used to see sluggish application that as soon as they see something that goes faster than average they think it must use some cool latest big tech.


More than once I've run across situations of everyone sitting around going "we need to scale up the server or switch the database or rewrite some of this frontend stuff or something it's so slow and there's nothing we can do" and solved their intractable performance problems by just adding an index in MySQL that actually covers the columns they're querying on.

Lots of people seem to want some silver bullet magical software/technology to solve their problems instead of learning how to use the tools they have. That's not software development.


kdb+ is very cool. Just a single binary, less than 1 MB and yet an extremely fast and powerful tool for analytics. There's a very nice talk on CMU's series of database videos if you want to know more. https://www.youtube.com/watch?v=AiGdfmxEP68 And it scores extremely well on the 1 billion taxi drives benchmarks (no cluster but some seriously big machine): http://tech.marksblogg.com/billion-nyc-taxi-kdb.html


A lot of architecture is designed from the "what would enable me to job-hop to the next position on the seniority pole" vantage point.


RDD probably stands for 'Resumee driven developemnt'. so yes, the parent agrees.


We call that "LinkedIn Driven Development" (LIDD)


Just perfect!


> senority pole

Well, I'm not sure "seniority" is the right word - the more tech stuff you know, in general, the _less_ seniority you're going to achieve in terms of org charts, decent seating, respect and actual pull within an organization. You can achieve job security and higher pay that way, though.


And if your data is really big then so long as it is structured something like BigQuery lets you carry on using standard SQL queries...


I once converted a simulation into cython from plain old python.

Because it fit in the CPU cache the speedup was around 10000x on a single machine (numerical simulations, amirite?).

Because it was so much faster all the code required to split it up between a bunch of servers in a map reduce job could be deleted, since it only needed a couple cores on a single machine for a ms or three.

Because it wasn't a map-reduce job, I could take it out of the worker queue and just handle it on the fly during the web request.

Sometimes it's worth it to just step back and experiment a bit.


Yeah, back when I was in gamedev land and multi-cores started coming on the scene it was "Multithread ALL THE THINGS". Shortly there after people realized how nasty cache invalidation is when two cores are contending over one line. So you can have the same issue show up even in a single machine scenario.

Good understanding of data access patterns and the right algorithm go a long way in both spaces as well.


Even earlier, when SMP was hitting the server room but still far from the desktop, there was a similar phenomenon of breaking everything down to use ever finer-grain locks ... until the locking overhead (and errors) outweighed any benefit from parallelism. Over time, people learned to think about expected levels of parallelism, contention, etc. and "right size" their locks accordingly.

Computing history's not a circle, but it's damn sure a spiral.


> Computing history's not a circle, but it's damn sure a spiral.

I think that is my new favorite phrase about computing history. Everything old is new again. There's way too much stuff we can extract from past for current problems. It's kind of amazing.


I wish I could take credit for it. I know I got it from someone else, could have been lots of people, but Fun Fact: one famous proponent of the idea (though obviously not in a computing context) was Lenin.

https://platypus1917.org/2011/06/01/lenins-liberalism/

A similar sentiment is that history doesn't repeat but it does rhyme.

https://quoteinvestigator.com/2014/01/12/history-rhymes/


Lenins views on this comes directly from Marx - it's the concept of "dialectical materialism" (though Marx did not use that term) - and Hegel. Specifically Marx noted in Capital that (paraphrased) capitalist property is the first negation (the antithesis) of feudalism (the thesis), but that capitalist production necessarily leads to its own negation in the form of communism - the negation of the negation (the synthesis) - basically applying Hegelian dialectics to "the real world".

In that example the idea is that feudalism represents a form of "community property", which though owned by a feudal lord in the final instance is in practice shared. Capitalism then "negates" that by making the property purely private, before communism negates that again and reverts to shared property but with different specific characteristics.

The three key principles of Hegels dialectics comes from Heraclitus (the idea of inherent conflict within a system pulling it apart), Aristotle (the paradox of the heap; the idea that quantitative changes eventually lead to qualitative change), and Hegel himself (the "negation of the negation" that was popularised by Marx in Capital; the idea that a driven by inherent conflict, qualitative changes will first change a system substantially, before reversing much of the nature of the initial change, but with specific qualitative differences).

The idea is known to have been simultaneously arrived at by others in 19th century too, at least (e.g. at least one correspondent of Marx' came up with it independently), and it's quite possible variations of it significantly predates both Marx and Hegel as well.


"Computing history's not a circle, but it's damn sure a spiral."

If it was a circle we'd have better code and framework reuse. Sigh.


A spiral is pretty optimistic; it suggests we are converging on the optimal solution.


Assuming an inward spiral ?


I usually think of it more as a three-dimensional spiral like a spring or a spiral staircase. Technically that's a helix, but "history is a helix" just doesn't sound as good for some reason.


What kind of games? I always thought that e.g. network-synced simulation code in RTSes or MMOs would be extremely amenable to multithreading, since you could just treat it like a cellular automata: slicing up the board into tiles, assigning each tile to a NUMA node, and having the simulation-tick algorithm output what units or particles have traversed into neighbouring tiles during the tick, such that they'll be pushed as messages to that neighbouring tile's NUMA node before the next tick.

(Obviously this wouldn't work for FPSes, though, since hitscan weapons mess the independence-of-tiles logic all up.)


This was back in the X360/PS3 era and mostly FPS/Third-person(think unreal engine 3 and the like).

What's really ironic is the PS3 actually had the right architecture for this, the SPUs only had 256kb of directly addressable memory so you had to DMA everything in/out which forced you to think about memory locality, vectorization, cache contention, etc. However X360 hit first so everyone just wrote for its giant unified memory architecture and then got brutalized when they tried to port to PS3.

What's even funnier in hindsight is all that work you'd do to get the SPUs to hum along smoothly translated to better cache usage on the X360. Engines that went <prev gen> -> PS3 -> X360 ran somewhere between 2-5x faster than the same engine that went <prev gen> -> X360 -> PS3. We won't even talk about how bad the PC -> PS3 ports went :).


Hadoop has its time and place. I love using hive and watching the consumed CPU counter tick up. When I get our cluster to myself and it steps 1hr of CPU every second it's quite a sight to see.


Yeah, I got a bit of a shock when I first used our one and my fifteen minute query took two months of CPU time. Thought maybe I'd done something wrong until I was assured that was quite normal.


Twitter?


"You can have a second computer once you've demonstrated you know how to use one".


Recently was sorting a 10 million line CSV by the second field which was numerical. After an hour went by and it wasn't done, I poked around online and saw a suggestion to put the field sorted on first.

One awk command later my file was flipped. Run same exact sort command on this but without specifying field. Completed in 12 seconds.

Morals:

1. Small changes can have a 3+ orders of magnitude effect on performance

2. Use the Google, easier than understanding every tool on a deep enough level to figure this out yourself ;)


csv files are extremely easy to import in postgres, and 10 M rows (assuming not very large) isn't much to compute even in a 6 or 7 year old laptop. Keep it in mind if you've got something slightly more complicated to analyse.


If SQL is your game but you dont want to get PG setup - try SQLite -

  wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" rows.csv
  sqlite3
  .mode csv
  .import ./rows.csv newyorkdata
  SELECT *
  FROM newyorkdata
  ORDER BY `COUNT PARTICIPANTS`;


Nice! Talk about being a productive engineer! Kudos!


Or use csvkit ...


Oh nice, I hadn't seen this before, so a similar query would be a bit shorter!

  wget "https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.csv" 
  csvsql --query "SELECT * FROM newyorkdata ORDER BY `COUNT PARTICIPANTS`" rows.csv > new.csv


What did you use to sort these 10 million lines?


sort in bash.

Specific command:

LC_ALL=c sort -n filename -r > output


This is really interesting to me - I can't seem to reproduce this on a couple different platforms. In the past (and just now) I've definitely seen 100x differences with LC_ALL=C, but you used that on both, I'm sure.

How large was the 10 million line file, and do you know whether sort was paging files to disk?


About 300MB. I just tried to replicate using random data and failed, will look more later, mine is power-law distributed and might make a difference.

Used these commands:

cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 10000000 > text

cat /dev/urandom | od -DAn |tr -s ' ' '\n' |head -10000000 | sed -e 's/$/\/9.0/'> num

bc -l num > num2

(that's to create decimal numbers)

paste -d, text num2 > textfirst

paste -d, num2 text > numfirst

time LC_ALL=c sort -n numfirst -r > output

real 0m10.712s user 0m21.959s sys 0m1.644s

time LC_ALL=C sort --field-separator=',' -k2 -n -r textfirst > output

real 0m9.638s user 0m21.940s sys 0m1.293s


but what did you use to sort by the second field, in your original setup?


LC_ALL=C sort --field-separator=',' -k2 -n


This uses field 2 to the end of the line as the sort field; would

  k2,2
have been faster, as it limits to just the second field?


There were only 2 fields so I expect not.


See also: Scalability! But at what COST? [1] [2]

> The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. COST weighs a system’s scalability against the overheads introduced by the system, and indicates the actual performance gains of the system, without rewarding systems that bring substantial but parallelizable overheads.

[1] http://www.frankmcsherry.org/assets/COST.pdf

[2] https://news.ycombinator.com/item?id=11855594


Keep in mind it's for graph processing - Hadoop/HDFS still shines for data-intensive streaming workloads like indexing a few hundred terabytes of data - where you can exploit the parallel disk io of all disks in the cluster - if you have 20 machines with 8 disks in a cluster that's 20 * 8 * 100mbyte/s = 16gbyte/s throughput - for 200 machines it's 160gbyte/s.

However for iterative calculations like pagerank the overhead for distributing the problems is often not worth it.


For my performance book, I looked at some sample code for converting public transport data in CSV format to an embedded SQLite DB for use on mobile. A little bit of data optimization took the time from 22 minutes to under a second, or ~1000x, for well over 100MB of Source data.

The target data went fro almost 200MB of SQLite to 7MB of binary that could just be mapped into memory. Oh, and lookup on the device also became 1000x faster.

There is a LOT of that sort of stuff out there, our “standard” approaches are often highly inappropriate for a wide variety of problems.


Normal developer behavior has gone from "optimize everything for machine usage" (cpu time, memory, etc.) to "optimize everything for developer convenience". The former is frequently inappropriate, but the latter is, as well.

(And some would say that it then went to "optimize everything for resume keywords," which is almost always inappropriate, but I don't want to be too cynical.)


Oh, it was also less code.


Ha! I did the same for my BSc thesis: On-device route planning from GTFS data for Budapest’s public transport network

iOS used to kill processes above 50 megs of RAM usage. Good times!


Cool! Link?


Unfortunately it's no longer in the AppStore (ubiquitous cellular data made it obsolete) and the thesis itself is written in Hungarian :D


:-(

Not so sure it's really obsolete, because connections are often spotty and local data can provide some initial information while waiting for the servers.

Github?


It wasn't put on GitHub back then, and I don't want to publish it now.

But I've sent it to you in email - check it out if you're interested.



The last one has zero comments, so there's no point linking to it.


Unless you're trying to make a point about reposting.


3 times over 4 years isn't unreasonable for interesting content. With the exception of the unlucky post this has generated good discussion each time and I think is safely in the interesting category.


in other words, "Too big for excel is not big data" https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html


I heard a variation on this: it's not big data until it can't fit in RAM in a single rack.


The version I've heard is that small data fits on an average developer workstation, medium data fits on a commodity 2U server, and "big data" needs a bigger footprint than that single commodity server offers.

I like that better than bringing racks into it, because once you have multiple machines in a rack you've got distributed systems problems, and there's a significant overlap between "big data" and the problems that a distributed system introduces.


It's frustrated me for the better part of a decade that the misconception persists that "big data" begins after 2U. It's as if we're all still living during the dot-com boom and the only way to scale is buying more "pizza boxes".

Single-server setups larger than 2U but (usually) smaller than 1 rack can give tremendous bang for the buck, no matter if your "bang" is peak throughput or total storage. (And, no, I don't mean spending inordinate amounts on brand-name "SAN" gear).

There's even another category of servers, arguably non-commodity, since one can pay a 2x price premium (but only for the server itself, not the storage), that can quadruple the CPU and RAM capacity, if not I/O throughput of the cheaper version.

I think the ignorance of what hardware capabilities are actually out there ended up driving well-intentioned (usually software) engineers to choose distributed systems solutions, with all their ensuing complexity.

Today, part of the driver is how few underlying hardware choices one has from "cloud" providers and how anemic the I/O performance is.

It's sad, really, since SSDs have so greatly reduced the penalty for data not fitting in RAM (while still being local). The penalty for being at the end of an ethernet, however, can be far greater than that of a spinning disk.


That's a good point, I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.

As a software engineer who builds their own desktops (and has for the last 10 years) but mostly works with AWS instances at $dayjob, are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment? Short of going full homelab, tripling my power bill, and heating my apartment up to 30C, I mean...


> I suppose it'd be better to frame it as what you can run on a $1k workstation vs. a $10k rackmount server, or something along those lines.

That's probably better, since it'll scale a bit better with technological improvements. The problem is, it doesn't have quite the clever sound to it, especially with the numbers and dollars.

Now, the other main problem is that, though the cost of a workstation is fairly well-bounded, the cost of that medium-data server can actually vary quite widely, depending on what you need to do with that data (or, I suppose, how long you might want to retain data you don't happen to be doing anything to right at that moment).

I suppose that's part of my point, that there's a mis-perception that, because a single server (including its attached storage) can be so expensive, to the tune of many tens of thousands of (US) dollars, that somehow makes it "big" and undesireable, despite its potentially close-to-linear price-to-performance curve compared to those small 1U/2U servers. Never mind doing any reasoned analysis of whether going farther up the single-server capacity/performance axis, where the price curve gets steeper is worth it compared to the cost and complexity of a distributed solution.

> are there any resources you'd recommend for learning about what's available in the land of that higher-end rackmount equipment?

Sadly, no great tutorials or blogs that I know of. However, I'd recommend taking a look at SuperMicro's complete-server products, primarily because, for most of them, you can find credible barebones pricing with a web search. I expect you already know how to account for other components (primarily of concern for the mobos that take only exotic CPUs).

As I alluded in another comment, you might also look into SAS expanders (conveniently also well integrated into some, but far from all, SuperMicro chassis backplanes) and RAID/HBA cards for the direct-attached (but still external) storage.


Actually it is sometimes faster to fetch data over a network loading from SSD than it is to read from a local spinning disk. Source: current work.


Well do notice I did say the penalty "can be" not "always is" far greater.

That's primarily because I'm aware of the variability that random access injects into spinning disk performance and that 10GE is now common enough that it takes more than just a single (sequentially accessed) spinning disk to saturate a server's NIC.

Plus, if you're talking about a (single) local spinning disk, I'd argue that's a trivial/degenerate case, especially if compared to a more expensive SSD. Does my assertion stand up better if I it had "of comparable cost" tacked on? Otherwise, the choice doesn't make much sense, since a local SSD is the obvious choice.

My overall point is that, though one particular workload may make a certain technology/configuration appear superior to another [1], in the general case, or, perhaps most importantly, in the high performance case, to have an eye on the bottlenecks, especially the ones that carry a high incremental cost of increasing their capacity.

It may be that people think the network, even 10GE now, is too cheap to be one of those bottlenecks, arguably a form of fallacy [2] number 7, but that ignores the question of aggregate (e.g. inter-switch) traffic. 40G and 100G ports can get pricey, and, at 4x and 10x of a single server port, they're far from solving fallacy number 3 at the network layer.

The other tendency I see is for people not to realize just how expensive a "server" is, by which I mean the minimum cost, before any CPUs or memory or storage. It's about $1k. The fancy, modern, distributed system designed on 40 "inexpensive" servers is already spending $40k just on chasses, motherboards, and PSUs. If the system didn't really need all 80 CPU sockets and all those DIMM sockets, it was money down the drain. What's worse, since the servers had to be "cheap", they were cargo-cult sized at 2U with low-end backplanes, severely limiting existing I/O performance. Then, to expand I/O performance, more of the same servers [3] are added, not because CPU or memory is needed, but because disk slots are needed and another $4k is spent to add capacity for 2-4 disks.

[1] This has been done on purpose for "competitive" benchmarks since forever

[2] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

[3] Consistency in hardware is generally something I like, for supportability, except it's essentially impossible anyway, given the speed of computer product changes/refreshes, which means I think it's also foolish not to re-evaluate when it's capacity-adding time after 6-9 month.


Actually my example is far simpler and less interesting. Having a console devkit read un-ordered file data from a local disk ends up being slower than reading the same data from a developer's machine from an SSD over a plain gigabit network connection. Simply has to do with the random access patterns and seek latency of the spinning disk versus the great random access capabilities of an SSD. Note this is quite unoptimised reading of the data.


Yes, that is, indeed, a degenerate case, as I suspected.

Is it safe to say that such situations are often found with embedded or otherwise specialized hardware?


I am on a completely different end of the spectrum (embedded devices) - How would I go about learning the capabilities of modern servers?


Good question.

At a theoretical level, as a sysadmin, I learned the theoretical capabities by reading datasheets for CPUs, motherboards (historically, also chipsets, bridge boards, and the like, but those are much less irrelevant), and storage products (HBAs/RAID cards, SAS expander chips, HDDs, SSDs). Make sure you're always aware of the actual payload bandwidth (net of overhead), actual units (base 2 or base 10) and duplex considerations (e.g. SATA).

From a more practical level, I look at various vendors' actual products, since it doesn't matter (for example) if a CPU series can support 8 sockets if the only mobos out there are 2- and 4-socket.

I also look at whatever benchmarks were out there to determine if claimed performance numbers are credible. This is where sometimes even enthusiast-targeted benchmark sites can be helpful, since there's often a close-enough (if not identical) desktop version of a server CPU out there to extrapolate from. Even SAS/SATA RAID cards get some attention, not in a configuration worthy of even "medium" data, but enough for validating marketing specs.


If you have two dozen nodes each with 6 TB of RAM + a few petabytes of HDD storage you're definitively going to need a big data solution.


A C/C++ program can go through that amount data. You just have to use a simple divide and conquer strategy. It also depends on how the data is stored and the system architecture, but if you have any method that lets you break that data up into chunks or even just ranges. Then spin your program up for each chunk across each node or processing module that has quick access to the data your processing. Then take those results and have one or more threads merge results depending on the resulting data. I guess it also depends if this a one off job or continually if you would want to do this.

Assuming the locality of data is not a big issue. It can be extremely fast. However, depending on the system architecture reading from the drives can be a bottle neck. If your system has enough drives for enough parallel reads you will turn through data pretty quickly. Moreover from my experience most systems or clusters with a few petabytes have enough drives that one can read quite a lot data in parallel.

However, the worst is when the data is referencing non-local pieces of data. So your processing thread will have to fetch data from either a different node or data not in main memory to finish processing. This can be pain since that means either the task is just nor really parallelize-able or the person who originally generated the data did not take into account certain groups of data may be referenced with each other. Sadly, what happens here is that commutation or read costs for each thread start to dominate the cost of the computation. If it's a common task and your data is fairly static it makes sense to start duplicating data to speed up things. Also restructuring data can also be quite helpful and pay off in the long run.


Isn't all this just reinventing Hadoop and MapReduce in C++ instead of Java?


Research programmers using MPI have been dividing up problems for several decades now but what they often don't get about hadoop until they've spent serious time with it, is that the real accomplishment in it is hdfs. Map-reduce is straight forward entirely because of hdfs.


Cloud, big data, "AI", i wonder what will be the next "me too, look at me" kind of buzzword for the corporate world...


"blockchain"


Ah yes, how did i forget...


I keep coming back to that article everytime someone talks about big data :)


A lot of people are saying how they've worked on single-machine systems that performed far better than distributed alternatives. Yawn. So have I. So have thousands of others. It should almost be a prerequisite for working on those distributed systems, so that they can understand the real point of those systems. Sometimes it's about performance, and even then there's no "one size fits all" answer. Just as often it's about capacity. Seen any exabyte single machines on the market lately? Even more often that that, it's about redundancy and reliability. What happens when your single-machine wonder has a single hardware failure?

Sure, a lot of tyros are working on distributed systems because it's cool or because it enhances their resumes, but there are also a lot of professionals working on distributed systems because they're the only way to meet requirements. Cherry-picking examples to favor your own limited skill set doesn't seem like engineering to me.


See also the wonderful COST paper: https://www.usenix.org/conference/hotos15/workshop-program/p...

But the article is kind of wrong. It depends on your data size and problem - you can even use commandline-tools with Hadoop Map/Reduce and the Streaming API and Hadoop is still useful if you have a few terabytes of data that you can tackle with map and reduce algorithms and in that case multiple machines do help quite a lot.

anything that fits on your local ssd/hdd probably does not need hadoop... however you can run the same unix commands from the article just fine on a 20tb dataset with Hadoop.

Hadoop MapReduce/HDFS is a tool for a specific purpose not magic fairy dust. Google did build it for indexing and storing the web and probably not to calculate some big excel sheets...


For many, the incentive from the gig is

a) resume enrichment by way of buzzword addition b) huge budget grants and allocation, purportedly for lofty goals while management is really unaware of real technology needs/options

Much has been talked about this already; sharing them again!

[1] https://news.ycombinator.com/item?id=14401399

[2] https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

[3] http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...

[4] https://www.reddit.com/r/programming/comments/3z9rab/taco_be...

[5] https://www.mikecr.it/ramblings/taco-bell-programming/


Most people who think they have big data don't.


At an absolute minimum, I'd say "big data" begins when you can't buy hardware with that much memory.

Apparently, in 2017, that was somewhere around 16 terabytes. https://www.theregister.co.uk/2017/05/16/aws_ram_cram/ Heck, you can trivially get a 4TB instance from Amazon nowadays: https://aws.amazon.com/ec2/instance-types/

The biggest DBs I've worked on have been a few tens of billions of rows, and several hundreds of gigabytes. That's like... nothing. A laughable start. You can make trivially inefficient mistakes and for the most part e.g. MySQL will still work fine. Absolutely nowhere near "big data". And odds are pretty good you could just toss existing more-than-4TB-data through grep and then into your DB and still ignore "big data" problems.


15 years ago I was working at a utility and was re-architecting our meter-data solution as the state government was about to mandate a change from manually read meters to remotely read interval meters. Had to rescale the database from processing less than 2 million reads a year, to over 50 million reads _a day_ (for our roughly 1M meters). Needed to keep 16 months online, and several years worth in near-line storage. We went from a bog-standard Oracle database to Oracle Enterprise with partitions by month. This was being fed by an event-driven microservices[0] architecture of individually deployable Pro*C/C++ unix executables. The messaging was Oracle A/Q.

At the time, I thought "wow - this is big!", and it was for me.

[0] we didn't call it microservices back then though that's clearly what it was. The term I used was "event-driven, loosely coupled, transactional programs processing units of work comprised of a day's worth of meter readings per meter". Doesn't roll off the tongue quite so easily.


10 years ago I was working on a 50Tb datawarehouse. Now I see people who think 50Gb is “big data” because Pandas on a laptop chokes on it.


This, a million times this!!! I see people doing consulting promoting the most over engineered solutions I have ever seen. Doing it for data that is a few hundred GB or maybe at best a TB.

It makes me want to cry, knowing we handled that with a single server and a relational database 10 years ago.

Lets also not forget that everyone today forgets that the majority of data actually has some sort of structure. There is no point in pretending that every piece of data is a BLOB or a JSON document.

I have given up on our industry ever becoming sane, I now fully expect each hype cycle to be pushed to the absolute maximum. Only to be replaced by the next buzzword cycle when the current starts failing to deliver on promises.


Yep. I left consulting after working on a project that was a Ferrari when the customer only needed a Honda. Our architect kept being like "it needs to be faster", and I'm like "our current rate would process the entire possible dataset (everyone on earth) in a night, do we really need a faster Apache Storm cluster?" :S

"Resume driven development" is a great term.


It's important to remember what drives this - employers often like to think their problems are 'big data' and by god, they need the over-engineered solution. Your peers who interview you will toss your resume in the trash if you are not buzzword compliant. Hate the game not the player.


I have given up on our industry ever becoming sane, I now fully expect each hype cycle to be pushed to the absolute maximum. Only to be replaced by the next buzzword cycle when the current starts failing to deliver on promises

I sometimes wish I was unscrupulous enough to cash in on these trends, I’d be a millionaire now. Instead I’m just a sucker who tries to make solid engineering decisions on behalf of my employers and clients. It’s depressing to think what professionalism has cost me in cash terms. But you’ve got to be able to look at yourself in the mirror.


Yeah, or R. When I need to do stuff bigger than RAM on a laptop I just reach for sqlite (assuming it can even fit on disk).


The funny thing is we didn’t even consider our 50Tb to be particularly big, we knew guys running 200Tb...

BTW in R data.table is better for larger datasets than the default dataframe... or use the Revolution Analytics stuff on a server.


Oh yeah, the default dataframe is practically only for toy problems / very-already-filtered data (which accounts for a LOT of useful scenarios!). It's more that I've run into "big data" people who (╯°□°)╯︵ ┻━┻ when the default settings + dataframe choke, which is around 200MB of heap.


There's a big difference between a 50Tb data warehouse and a data warehouse which has a number of 10Tb tables. Our data warehouse used to have 20k tables and 18Tb data. Our big data instance has 4k tables and 800Tb data


> Now I see people who think 50Gb is “big data” because Pandas on a laptop chokes on it.

IIRC the term "big data" was coined to refer to data volumes that are too large for exiting applications to process without having to rethink how the data and/or the application was deployed and organized. Thus, although the laptop reference is on point, large enough data volumes that make Pandas choke in a desktop environment does fit the definition of big data.


That definition might be a bit loose. If you have an existing application designed to run on a Raspberry Pi, it'll hit that limit pretty quickly.

This is obviously an over-exaggeration, but I don't think a dataset that breaks Pandas in a desktop environment even comes close to big data.


large enough data volumes that make Pandas choke in a desktop environment does fit the definition of big data

What about too large for Excel, is that “big data” too?

(Neither are)


Sorta, depends on your workload and if you are a solo hero dev or if you are an enterprise requiring shared data resources.

For 16TB you'll definitely get benefits if you are doing something embarrassingly parallel and processor bound over a cluster of a handful of machines. It's just parallelism, and it's a good thing.

I totally agree about the hundreds of Gigs - that is unless you are in a setting where many teams need to access that database and join it with others, in which case a proper data lake implemented on something beefy is a good idea. Hadoop has the benefit of distribution and replication, but another data warehouse might work better - say Oracle or Teradata if you are a small shop.


I'd say more than the boundary is not on a 1 dimensional axis but more on a size x time x [access - pattern ] space :

100Gb of data may not be that much to you, but it's another story if you have to process all of it every minute. Also everything is fine if you have nicely structured ( and thus indexed ... ) data, but I do have a few 100s Gb of images as well...


I'm pretty sure you can buy IBM's newest mainframe, the z14, with 32TB of memory. I don't think that mainframes are that popular in trendy "big data" though.


But we have millions of rows!

-- my actual employer

(the actual problem is our SAN is getting us a grand total of 10 iops. No, that's not wrong, 10. The "what the christ, just buy us a system that actually works" momentum has been building for 2 years now but hasn't managed to take over. Hardware comes out of a different budget than wasted time, yo.)


So you have big data. Now you just need the big hardware /s


A cynical person might point out that for empire-building managers the goals are :

1) minimize hardware spend at all costs

2) maximize lost time (if spent under the manager's control)


I don’t know why these ‘you don’t have big data’ type articles bother me so much, but they really do. I know it isn’t saying NO ONE has big data, but I feel defensive anyway. Some of us DO work for companies that need to process millions of log lines a second. I know the article is not for me, but it still feels like a dismissal of our real, actual, big data problem.


These articles are not targeted at you indeed but to the thousands of companies trying to setup a massive architecture to process 3GB of data per year. All these big data solutions are still necessary of course, it's just not for everyone.


Agreed they are tools designed to fill a niche. This doesn't make them bad, or uneccessary, just specialized. A spanner drive screw isn't worse than a hex drive, it's just designed for a different more specific use case.

Really this just goes to show how impressive RDMSs like Postgres are. There's nothing out there that's drastically better in the general case. So alternative database systems tend to just nibble around the edges.

My rule of thumb is always try to implement it using a relational model first, and only when that proves itself untenable, look into other more specialized tools.


Yeah but the whole cloud paradigm is predicated on big data, so large actors are pushing it where it makes no sense, and it makes everyone less effective. Not more.

Ever notice how with cloud it's actually in the interest of cloud providers to have the worst programmers, worst possible solutions for their clients ? Those will maximize spend on cloud, which is what these companies are going for.

Of course, they hold up the massive carrot of "you can get a job here if you ...".


For every programmer like you there's ten more at places where I've been employed trying to claim they need big data tools for a few million (or in some cases a few hundred thousand) rows in mysql. I get why you could feel attacked when this message is repeated so often, but apparently it still isn't repeated anywhere near enough.


"Relational databases don't scale".

I hope the NoSQL hype is over by now and people are back to choosing relational as the default choice. (The same people will probably chasing block-chain solutions to everything by now...)


Why should relational be the default choice? There are many cases where people are storing non-relational data and a nosql database can be the right solution regardless of whether scaling is the constraint.

Most nosql SaaS platforms are significantly easier to consume than the average service oriented RDBMS platform. If all the DBMS is doing is handling simple CRUD transactions, there's a good chance that relational databases are overkill for the workload and could even be harmful to the delivery process.

The key is to take the time to truly understand each workload before just assuming that one's preferred data storage solution is the right way to go.


What's non relational data?

You can have minimally relational data such as URL : website, but that's still improved by going URL : ID, ID : website because you can insert those ID's into the website data.

Now plenty of DB's have terrible designs, but there I have yet to year of actually non relational data.


That's fair and I'll concede that my terminology is incorrect. I suppose I'm really considering data for which the benefits of normalization are outweighed by the benefits that are offered by database management systems that do not fall under the standard relational model (some of those benefits being lack of schema definition/enforcement* and the availability of fully managed SaaS database management systems).

I'm also approaching this from the perspective of someone who spends more time in the ops world than development. I won't argue that NoSQL would ever "outperform" in the realms of data science and theory, but I question whether a business is going to see more positive impact from finding the perfect normal form from their data or having more flexibility in the ability to deliver new features in their application.

* I'm fully aware that this can be as much of a curse as a blessing depending on the data and the architecture of the application, which reenforces understanding the data and the workload as a significant requirement.


Wait, why is that an improvement? If there is a 1 to 1 mapping of url to website, splitting it into two tables is just bad database design.


Because there are URL's in the website data and or you want do do something with it. Looking up integers is also much faster than looking up strings. And as I said you can replace URL's in the data with strings saving space.

But, there are plenty of other ways to slice and dice that data, for example a URL is really protocol, domain name, port, path, and parameters, etc. So, it's a question of how you want to use it.

PS: Using a flat table structure (ID, URL, Data) with indexes on URL and ID is really going to be 2 or 3 tables behind the scenes depending on type of indexes used.


> The key is to take the time to truly understand each workload before just assuming that one's preferred data storage solution is the right way to go.

Although that is true in principle, in reality that results in the messes I see around me where a small startup (but this often goes for larger corps too) has a plethora of tech running it fundamentally does not need. If your team’s expertise is Laravel with MySql then even if some project might be a slightly better fit for node/mongo (does that happen?), I would still go for what you know vs better fit as it will likely bite you later on. Unfortunately people go for more modern and (maybe) slightly better fit and it does bite them later on.

For most crud stuff you can just take an ORM and it will handle everything as easily as nosql anyway. If your delivery and deployment process have a rdbms, it will be natural anyway and likely easier than anything nosql unless it is something that is only a library and not a server.

Also, when in doubt, you should take a rdbms imho, not, like a lot of people do, a nosql. A modern rdbms is far more likely to fit whatever you will be doing, even if it appears to fit nosql better at first. All modern dbs have document, json/doc storage built in or added on (plugin or orm) : you probably do not have the workload that requires something scaleout like nosql promises. If you do, then maybe it is a good fit, however if you are conflicted it probably is not anyway.


> There are many cases where people are storing non-relational data

No, there are not. In 99% of applications, the data is able to be modeled relationally and a standard RDBMS is the best option. Non-relational data is the rare exception, not the rule.


Because its tried and tested and has worked for decades.


Just curious what do you need to process so many log lines for? Is this an ELK type setup, and/or do you use the log processing for business logic?


I was working in a company where they brought in some consultants to do the analytics and they were going to use Hadoop, and they said straight up "We don't need it for the amount of data, but we prefer working in Hadoop"


It's far more common for your assumptions/thinking/design to be the root cause of any real problem in things you build than your choice of tech. It doesn't matter if you use the 'wrong' tech to solve a problem if the solution works effectively, and very often using something you know well means you'll solve the problem faster and better than you would with the 'right' tool. Those consultants were probably doing the right thing.


I actually don't think they had great expertise in Hadoop though, since it was quite new at the time. but maybe


I remember a Microsoft talk about running big data in Excel...


    cat *.pgn | grep "Result" | sort | uniq -c
This pipeline has a useless use of cat. Over time I've found cat to be kind of slow as compared to actually passing a filename to a command when I can. If you rewrite it to be:

    grep -h "Result" *.pgn | ...
It would be much faster. I found this when I was fiddling with my current log processor to analyze stats on my blog.


Yes, it does. Spoiler alert: I remove it at the end. ;)


Still I like the article. I just learned about useless use of cat the hard way. The biggest way I learned was

    cat /dev/zero | pv > /dev/null

    pv > /dev/null < /dev/zero
The second one is much faster.


   Argument list too long!


Then do:

find -name \*.pgn -print0 | xargs -0 grep -h "Result"

Each grep invocation will consume a maximum number of arguments, and xargs will invoke the minimum number of greps to process everything, with no "args too long" errors.


Increase your stack size then.

    ulimit -s 65536
https://unix.stackexchange.com/a/45584/115135


Un My case it was 2015. I was struggling with a 28GB CSV file that i needed to cut grabbing only 5 columns.

Tried spark on my laptop: waste if time. After 4h I killed al processes because it didn't read 25% of the file yet.

Same for hadoop, python and pandas, and a shiny new tool from google whose name I forgot long time ago.

Finally I installed cygwin con My laptop and 20 minutes later 'cut' gave me the results file I needded.


> I was struggling with a 28GB CSV file

Okay, sounds reasonable, that's larger than most machines' memory...

> Tried spark on my laptop

Yikes, how did we get here? Not to shame you or anything, but that's like two orders of magnitude smaller than the minimum you might consider reaching for cluster solutions...and on a laptop...I'm legitimately curious to hear the sequence of events that led you to pick up Spark for this.

In my opinion this isn't something you should be leaving the command line for. I'm partial to awk; if you needed to get, say, columns 3, 7, 6, 4 and 1 for every line (in that order):

    awk '{print $3"\t"$7"\t"$6"\t"$4"\t"$1}' file.csv
...where you're using tab as the delimiter for the output.

While I'm at it, this is my favorite awk command, because you can use it for fast deduplication without wasting cycles/memory on sorting:

    awk '!x[$1]++' file.csv
I actually don't know of anything as concise or as fast as that last command, and it's short enough to memorize. It's definitely not intuitive, however...


cut solved your problem. Let's talk about why.

cut is line oriented, like most Unix style filters. It needs to keep only one line in memory at most.

If you say:

    pd.read_csv(f)[[x,y,z]]
It has to read and parse the entire 28GB into memory (because it is not lazily evaluated; cf Julia).

If you actually need to operate on three columns in memory and discard the rest, you should:

    pd.read_csv(f, usecols=[x,y,z])
Then you get exactly what you need, and avoid swapping.

The lack of lazy evaluation does inhibit composition--just look at the myriad options in read_csv(), some of which are only there to enable eager evaluation to remain efficient.


While Pandas csv parser is quite slow anyway, the reason Pandas is particularly slow in this case is because it insists on applying type detection to every field of every row read. I have no clue how to disable it, but it's default behaviour.

Parsing isn't actually a tough problem – https://github.com/dw/csvmonkey is a project of mine, it manages almost 2GB/sec throughput _per thread_ on a decade old Xeon


Or just use python and iterate your file using csv lib to cater for strange cases while doing your computations along.


28GB is really too small for Hadoop to get out of bed for, in general. Though I would wonder why it was _that_ slow with Spark (or Hadoop, for that matter).


Spark is going to try and ingest all the data, and it won’t fit in RAM. Wrong tool for the job basically.


Depends on exactly how you do it, I suppose, but it shouldn't necessarily. Most Hadoop-y work can also be accomplished in Spark without much fuss.


Because it was running on a laptop instead of on a cluster :-)


Ehh... sounds like you might've over-eng'd something with python, tbh. I've dumped 50gb+ worth of CSV into a sqlite DB with something like 20 lines of code, and it only took about 30 seconds. (until I added some indexes)


Re: pandas, you could have done it streaming, a'la the `chunksize` parameter.


So there are 2197188 games in that file.

Extracting only the lines with "[Result]" in them into new file (using grep) takes about 3 seconds.

Importing that into a local Postgres database on my laptop takes about 1.5 seconds.

Then running a simple:

   select result, count(*)
   from results
   group by result;
Takes about 0.5 seconds.

So the total process took only 5 seconds (and now I can run much more aggregation queries on that)


I feel that we have spent 30 years replicating Bash across multiple computer systems.

The further my clients move to the cloud, the more shell scripts they write at the exclusion of other languages. and just like this, I have clients who have ripped out expensive enterprise data streaming tools and replaced them with bash.

The future of enterprise software is going to be a bloodbath.


I want to believe, but instead it will be javascript-driven blockchain at the command line


how else would you control your ML IoT cluster?


anecdota - We used fast csv tools and custom node.js scripts to wrangle import of ~500GB of geo polygon data into a large single vol postgresql+postGIS host.

We generate svg maps in psuedo-realtime from this data-set : 2MB maps render sub-second over the web, which feels 'responsive'.

I only mention this as many marketing people will call 50Million rows or 1TB "Big Data" and therefore suggest big / expensive / complex solutions. Recent SSD hosts have pushed up the "Big Data" watermark, and offer superb performance for many data applications.

[ yes, I know you can't beat magnetic disks for storing large videos, but thats a less common use-case ]


How do you partition your data and/or build your indexes?


two kinds of indexes, for different purposes ...

a) basically a preprocessed static inverted index on keywords [ using GIN / tsv_ etc ]

b) geo location index on GIS geometry field - relying on postGIS to be performant over geo queries

The data changes infrequently, so these are computed at time of import / regular update.

We do have a fair amount of ram set aside for postgres [ circa 20GB ]


Relevant article from 2013: 'Don't use Hadoop - your data isn't that big' [0] and the most recent HN discussion [1].

[0]: https://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

[1]: https://news.ycombinator.com/item?id=14401399


Hi all, author here!

Many thanks for the feedback and comments. If you have any questions, I'm also happy to try to answer them.

I'm also working on a project, https://applybyapi.com which may be of interest to anyone here hiring developers and drowning in resumes.


I gave your service a quick whirl. Love the design language as a developer. What I don't like is that I have to provide my credit card information to get any detailed info (assuming there is such a thing behind the Stripe checkout).

The info available to non-signed-up people is simply not enough. It would be great if you had a demo account so one can get a feel for how this works.

E.g. I couldn't find info on why I have to add a job description. Wouldn't I post job board, with the posting liniing to the ApplyByAPI test?

I could also not find any info on the tests candidates have to do. Yes, APIs, I get it. But an example wouldn't hurt.

Interesting idea!


Hi, and thank you for the great feedback!

We will certainly consider those points, and try to make things both more clear from a communication perspective and also consider adding some demo account (great idea!).

To answer the question about the job description, you're right that most customers so far have a post on a jobs board which they use to get traffic over to their ApplyByAPI posting. Once the candidate is there, they get a page with the job description and some API information they can use to generate their application.

We'll work on improving the language and information presentation to potential customers, and thank you again for taking the time to give your perspective!


You can get very good improvements over Spark too. I've been using GNU Parallel + redis + Cython workers to calculate distance pairs for a disambiguation problem. But then again, if it fits into a few X1 instances, it's not big data!


If you need to do something more complicated where SQL would be really handy, and like here you're not going to update your data, give monetDB a try. Quite often when I'm about to break down and implore the hadoop gods, I remember about monetDB, and most of the time it's sufficient to solve my problem.

MonrtDB: columnar DB for a single machine with a psql like interface: https://www.monetdb.org/


Well yes in the edge case where you don't really have big data of course it will.

Where MapReduce Hadoop etcera shine is when a single dataset one of many you need to process is bigger than the biggest single disk availible - this changes with time

Back when I did M/R for BT the data set sizes where smaller - still having all of the Uk's larges single PR!ME superminis running your code was dam cool - even though I used a 110 baud dial up print terminal to control it.


I think much of this issue can be attributed to 2 most underrated things

1. Cache line misses. 2. So called definition of BigData. (if data can be easily fit into memory, then its not Big period! )

Many times, I have seen simple awk / grep commands will outperform Hadoop jobs. I personally feel, its lot better to spin up larger instances, compute your jobs and shut it down than bearing the operational overhead of managing hadoop cluster.


Anchoring the search would allow the regex to terminate faster for non-matching lines:

    ... | mawk '/^\[Result' ...


Esp. if you use multipipe-enhanced coreutils, like dgsh. https://www2.dmst.aueb.gr/dds/sw/dgsh/


It would benefit people to actually understand algorithmic complexity of what they are doing before they go on these voyages to parallelize everything. It also helps to know what helps to parallelize and what doesn't.


Something, something, Joyent Manta: https://apidocs.joyent.com/manta/job-patterns.html


Shouldn't this be marked 2014? The articles date is January 18, 2014.


It’s probably even truer and more relevant now


Partially because of the bloated architecture of those Hadoop, Kafka, etc. And of course Java. Implementing modern and lighter alternative to those in C++, Go or Rust would a step forward.


Wierd. I’d always thought of go as a slower Java with an shallower pool of good libraries and weak dev tools. Is the compiler and gc that good? Maybe I’ll have to give it another try.


Depends on what you're targeting. For raw computation, Go's similar and sometimes noticeably faster, even after hotspot has its turn. For garbage collection pause time, Go's rather amazing, typically measuring less than a millisecond even for gigabytes of heap mem[1]. For bin startup time (e.g. for a CLI tool) try Go instead just because it'll be done way before the JVM even hands control to your code.

For dev tools, oh hell yes. Stick to Java. Go comes with some nice ones out-of-the-box which is always appreciated, but the ecosystem of stuff for Java is among the very best, and even the standard "basic" stuff vastly out-does Go's builtins.

[1]: https://twitter.com/brianhatfield/status/634166123605331968?... and there's also a blog/video? about these same charts and how it got there. pretty neat transformations.


I’m not saying don’t look at other languages, but don’t do so given what was said. Java still runs better than Go. Hadoop, et al won’t necessarily run better as C++ or Rust. Distributed system performance is often dominated by setup and other negotiating between the components in the system. Java still has better, well tested in prod libs. Shiny is not better.


Anecdotally from my own use cases, I’ve found Go to be good for problems easily made parallel. This is especially true for things that commonly fall under “microservices.”

On the other hand, for more intensive work, I’ve found it doesn’t perform as well as JVM-based applications. A lot of this is either JRuby or data models in Java doing interesting things with data structures.

Finally, for sheer processing speed, we’ve started converting some of our components to Rust. This has a 3-4x speed advantage to Java and Go for similar implementations.

Again - all anecdotes and I’m not really in a position to deeply explain how it all works (proprietary, etc). Just my two cents.


No, it's because those things are for working with large amounts of data. So some overhead is acceptable, as it ends up being a tiny percentage of the overall time spent when you're dealing with actually large data.


Kafka has the opposite of bloated architecture. It's actually pretty simple architecture compared to a lot frameworks.


People are suckers for complex, over-engineered things. They associate mastering complexity with intelligence. Simple things that just work are boring / not sexy.

I’ll be in my corner doing asking “do we really need this?”/“have you tested this?”



I've heard this so often, especially from my boss - whoever this is. I come to the conclusion this is just a lame excuse because the people feel out of control and overwhelmed when using this kind of software. It is heavily Java based and to those who never touched javac, the Stack traces must look both intimidating and ridiculous.

On the other hand, processing data over several steps with a homegrown solution needs a lot of programming discipline and reasoning, otherwise your software turns into an unmaintainable and unreliable mess. In fact this is the case where I work right now...


This reminds me of the early EJBs which were really complicated and most people who used them didn't really need them.


It can be colder at night than outside.


I closed at "Since the data volume was only about 1.75GB". (And probably would have until 500GB+)


500GB is still laptop SSD territory... unless you meant 500TB but that's still doable with a single 4U Backblaze Storage Pod with 60 x 10TB HDDs.


Just reading 500TB from 60 Hdds will take at least 12 hours, assuming a very optimistic constant 200MB/s throughput per HDD. If you need things to go faster you won't be able to do that without resorting to distributed computing.


Ah yes this is so true!


the hadoop solution may have taken more computer time, but that is far less valuable than a person's

the command-line solution probably took that person a couple or more hours to perfect


If I do one thing this year, I need to learn more about these tried and tested command-line unix shell utilities. It's becoming increasingly obvious that so many complicated and overwrought things are being unleashed on the world because (statistically) nobody knows this stuff anymore.


This applies to much of the software world. Many of the places I've worked have a bad habit of ditching perfectly fine tech in order to chase the latest thing. Part of it is driven by RDD (resume driven development) and positioning the company so they can hire the latest crop of fad chasers. You end up with a bunch of specialists that are siloed into their particular tech. You got the Docker guy. You got the Redis guy. You may even have one of those gray-haired DBAs that still talk about this "Oracle" thing all day long, and no one knows what the hell he's going on about. You got the Ruby guys, but now you guy a bunch of React or Angular guys. What to do with the Ruby guys? Can't just fire them yet. You have a dozen of 'em, and it took you years to hire them. Fine, just banish them to the backend somewhere. Give them some fidget spinners and a few pointless projects to keep them busy.


Let's not forget when you're interviewing: The new places expect the latest technologies. Don't have time to learn spark, flink, kafka streams, and repeat scala spec whilst having intimate knowledge of all of the REST libraries? BAD DEVELOPER.


Hey now, Ruby ain't dead. I am one of these "Ruby" guys and, in the 48-or-so hours since my profile was activated at a new recruiting shop this week, I've gotten several potential candidate employers (some in seed stage) who want to get in touch.

The real issue is why do we silo ourselves so? It's fucking stupid. I do so much more than Ruby but conveying that to the new breed of tech people is just impossible. All they see is your biggest resume badge, and all they care about is how wide you will spread your legs for them.


This is exactly right. This is real life!! And it hurts.


I have a repo dedicated to cli text-processing..

short-descriptions and tons of examples.. I had fun and learned a lot doing this

link: https://github.com/learnbyexample/Command-line-text-processi...


This looks exactly like what I was looking for. Thank you.


you're welcome, happy learning :)


You just need to master a few commands:

Find (take a look at -exec option)

Cut (or awk)

Sed (for simple text/string substitutions)

Xargs

Dd (that beast can do charset transtation from ASCII to EBCDIC to use in mainframes and can also wipe disks)

And bash of course :)


One fun thing is that xargs supports parallelization.

    -P max-procs
    Run up to max-procs processes at a time; the default is 1.
    If max-procs is 0, xargs will run as many processes as
    possible at a time.  Use the -n option with -P; otherwise
    chances are that only one exec will be done.

    -n max-args
    Use at most max-args arguments per command line.
    Fewer than max-args arguments will be used if the size
    (see the -s option) is exceeded, unless the -x option is
    given, in which case xargs will exit.
So here is a "map/reduce" job which takes the log files in a directory and processes them in parallel on eight CPUs, then combines the results.

    find . -name "*.log" | sort | xargs -n 1 -P 8 agg_stats.py | sort | merge_periods.py


xargs is one of my favorites, it's so easy to split something apart to speed it up by a few multiples.

gotta SSH into 100 machines and ask them all a simple question? xargs will trivially speed that up by at least 10x, if not better, just by parallelizing the SSH handshakes.


And for cases where xargs parallelism isn't flexible enough, there is GNU parallel.


But that approach is potentially dangerous. See:

https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_w...


"master a few commands"

"awk"

brb 15 years


Not really. awk is pretty simple, basically I just assume it's plain C with a huge regexp pattern 'switch' case at the top level, then I google again for the arguments to substr() :-)


Brb in 20 years then.


Might be true. The sed&awk from O'Reilly still lingers on my desk from time to time. And I bought it 20 years ago.


Some of these have come up for me recently, particular sed, awk, and xargs. Xargs in particular is really gnarly and opens up a lot of possibilities from the command line when used in conjunction with other tools. Love it!

Not a program, but another fun bash thing I've learned about recently is brace expansion: https://www.linuxjournal.com/content/bash-brace-expansion


'man' is probably the most indispensable.


I find 'apropos' pretty indispensable, because it lets you find the thing that does the thing you need, so you can read its man page.


That sounds interesting. Is it documented anywhere?


No, it's something neckbeards pass along to junior sysadmins, and it's hidden very deep in /dev/null.


awk is so beautiful. A good awk one-liner can beat fancy pandas all day long in some cases.


I hate how awk gets so much praise for one liners when it's so good at building things like state machines too: https://two-wrongs.com/awk-state-machine-parser-pattern.html

I've got some similar ones for work that take data csv like data and output sql. A little bit of vim-foo on the csv (yy10000p) and I've got all the test data I need.


If anything, awk should get some praise for it's flexibility to go from one liners to Doom-style 3d games[1]!

[1] https://github.com/TheMozg/awk-raycaster


An example here would be nice.


Don't forget grep!


Also, 'parallel' if you've got any idle cores on your infrastructure.


And perl -pe or -ne ! :)


^ the moment you do anything even slightly non-trivial with sed, abandon it and use perl instead. for even simple regexes with a couple non-"match this char" segments I regularly see almost 100x speedup, it's rather crazy. it's also a LOT more portable in my experience because the regex engine is so much more capable and consistent, even across fairly major perl versions.


Honorable mention to jq. Also get comfy with Unix pipes and streams.

e.g. testing in CI a JSON file is properly formatted:

  jq . < $f | diff -u $f -


Personally, I'm not too interested in any commands that take string arguments in their own homegrown programming language[0]. I'd rather use something like this[1] than jq, to get some extra ROI out of one of the languages I already know.

[0]: https://stedolan.github.io/jq/manual/

[1]: https://github.com/antonmedv/fx


adding socat into the mix if you do network related things! it’s amazing what you can do when you replace bits and pieces of a pipe chain with network endpoints


I made a list of a good set of commands for a full stack engineer:

https://github.com/nemild/cli_for_full_stack


This is awesome! thanks.


Companies aren't making these complicated and overwrought things because nobody knows this stuff anymore. The ones actually doing the making generally know all the tried and tested ways to do what they want, because they used them to scale. Then they eventually outgrow the existing solutions and created these complicated systems to account for whatever specific technical or business issues that necessitated hopping from existing solutions to creating a solution.

The problem tends to come along with the visibility that comes when one of these get released publicly. People use these things without doing the same level of due diligence to see if something less complicated fits their needs. Either deliberately (resume driven development) or simply because it doesn't occur to them to do so and "if it's good enough for this unicorn it's good enough for me". Or in some cases they do due diligence, but incorrectly evaluate their needs and rule out simpler solutions.

But the most common reason I've ran into is simply that it's more fun to play with these projects than it is to use crufty old stuff, no matter how tried and tested and potentially appropriate it is.

At a previous company, I had a python script running on a cron job every 5 minutes to do some data processing needs. Once or twice a month, the batch would be so large that it took 6-7 minutes to complete and the cron job would trigger the script again before it finished, causing the second instance of the script to see the lock file, log an error, and exit. It didn't cause any problem for these periodically skipped runs, because the business need only required data to be processed within 24 hours of coming in. The 5 minute cron job was just to even out resource usage throughout the day instead of doing a larger nightly batch job. A piece of data not getting processed for 8 minutes instead of 5 did not make any material impact.

Another team had noticed the errors popping up in the log and were in the process of testing out a whole bunch of real time data pipelines like Kafka, leveraging my error messages to justify the need for a "real time" system without ever even asking me about the errors. After I found out other people were noticing those superfluous errors, I moved the cron job to a 30 minute window to stop them from happening anymore. Turns out they didn't have any other justification for their data pipeline greenfield project and weren't happy to go back to their normal day to day work. I offered to let them maintain and expand my python scripts if they were interested in data processing work, but for some reason they never took me up on that offer. :(


> more fun to play with

I think this is a dangerous, irritating mythology that programmers permit to their detriment. Skipping work entirely to go play pool or watch a movie is "fun". Evaluating a new technology stack to see if it fits business needs (present or future) might be _intellectually stimulating_, but deriding it as "fun" - and allowing management to write it off as time-wasting - hurts everybody. This isn't "fun", it's research, just the same as particle physics experiments are, and it's a big part of what we went to college to learn to do effectively.


Companies aren't making these complicated and overwrought things because nobody knows this stuff anymore. The ones actually doing the making generally know all the tried and tested ways to do what they want, because they used them to scale

Right, Yahoo made Hadoop because (probably) they needed it. 99% of companies... just don’t.


Read The AWK Programming Language by Alfred Aho, Peter Weinberger, and Brian Kernighan

"This is an absolute classic which although ancient in Computing years is an absolute gem full of relevance yet."


You can write many of these yourself. Just reading records, and sending them across pipes in ways you can compose is a good start. I find PyPy is excellent for this kind of work, fast to write, if the pipeline runs for more than 10 seconds, it has plenty of time for the JIT to warm up and amortize the cost. If you don't need fancy libs, LuaJIT is amazing and great to doing transforms in a pipeline.

I rewrote a Hadoop job to use Streaming (pipes, basic unix model) from a Java job to native. Just that switch was 10x faster. Mostly because of Hadoop overheads.


They try to make the cloud relevant for a lot of things that could easily be accomplished locally... Most of the time, they do that just to get your data.


Vendor lock-in is what they're after in the cloud case. Every time we talk to our cloud provider reps they ask us which of their SaaS offerings we're using. Their heads spin when we tell them we're just using VMs and ask us why.


I started with Unix Shell Programming http://a.co/au9OUrB


So much internal tooling at a company can be avoided if you can just get people comfortable with basic *nix programs. No, we don’t need a web interface to show that data. No, we don’t need to code a dropdown to sort it differently.


And yet the UX people, particularly around Linux, wants to bury the _nix cli so damned deep...


CLIs have a discoverability problem. Knowing how to use any given one is a skill that takes years to hone. Learning a single command can take up to an hour. Most webapps are more constrained, but a common operation can usually be accomplished by someone who hasn't seen it before in less than 3 minutes.


The 'apropos' command is very handy on machine or OS you aren't familiar with.

If you want to find out what tools are available on the CLI that operate on jpegs, you'd try 'apropos jpeg' and get a list of things that mention jpeg in their man page.


Can't say I blame them. The base concept is sound but the implementation is archaic and obtuse as hell.


Once, I saw a presentation of Unicage, a big data solution that had (perhaps still has) a free version https://youtu.be/h_C5GBblkH8?t=2566 It seems that it has evolved to a company now: http://www.unicage.com/products.html

Did anyone try the unicage solution?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: