
We Don't Need No Stinkin' Databases - ingve
https://btorpey.github.io/blog/2017/05/10/join/
======
mosselman
Title is a bit clickbait-y.

Better: "With unix's join command you can join files somewhat similar to
database joins"

~~~
jenkstom
Clickbait-y or not, it sums up the feelings of a lot of developers. I don't
understand it myself, but it seems a lot of developers want to reinvent the
database to create something "better" or just find some way to never have to
use SQL.

~~~
j45
Using SQL and being able to think both relationally and non-relationally are a
valuable skill.

Working with text based datasets is a skill I don't see enough of even though
it is quite prevalent. (I have spent more time than I've wanted using tools
like Monarch to make text databases relational through an automated workflow)

The reality of development is working with an existing, or multiple datasets
homogeneously. Cool use of the JOIN command, a title like "Join text files
using the command line" probably would receive a positive response here.

It'd be sweet if everything was a nice and clean API data call with a bow on
it, but part of being a developer is enabling that where it need to be.

~~~
wallstprog
Like I said above, the title is bit of a joke, and a tease.

If you read my blog, you'll see that I try to keep things light. (No cat
pictures, though -- at least not yet ;-)

Having said that, the description in the header is "Data manipulation with
plain text files". Maybe I'll see if I can find a way to show that on GitHub
...

~~~
j45
Oh, I read the post, I assure ya. My beef is the binary link baityness of
folks being one way or the other.

------
combatentropy
Reminds me of:

\- Command-line tools can be faster than your Hadoop cluster
[[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462)]

\- Going Deep
[[https://news.ycombinator.com/item?id=8902739](https://news.ycombinator.com/item?id=8902739)]

~~~
wallstprog
Interesting comment -- thanks for that.

It also reminds me of one of my favorite aphorisms: "Perfect is the enemy of
good enough".

What some other commenters appear to have missed is the use case I'm
discussing here: scraping log files to generate text files to feed to gnuplot.

And while the approach presented here may not work at Google scale, it works
just fine for my situation. It takes me about 10 minutes on my MacBook to
churn through around 50GB of logs to produce a few dozen charts. It took about
a week to put the scripts together, and now anybody on my team with a shell
prompt can do the same thing (including the QA folks).

Sometimes you need to dot all the i's and cross all the t's, but sometimes
it's just about getting the job done quickly, with minimal dependencies.

------
CaptainZapp
Good luck with getting rid of your steenking database, when it comes to the
real world.

Like figuring out a reasonable access path between tables when you're dealing
with billions of rows

Like guaranteeing consistency between your relations. While the join command
sounds useful (I didn't know it and I'll look into it) it's not really a
replacement for all the constraints, which a database provides to ensure that
you don't fubar your data

Backup, recovery? Just hope to you really grab everything. No only the data,
but also your carefully crafted queries. Sounds like a bummer to figure them
out anew

Implementing ACID also sounds like quite a challenge

A database is not just a shipping container, where you cram in your data and
provides a few tools to organize it. There's a lot more under the hood, which
is just not supported in plain old files.

~~~
haney
You're assuming quite a lot about the use case. I personally don't have a
problem that's solved by this particular utility, but it's not uncommon for
our analytics team to use pandas to join CSV files on their way into our data
warehouse (or to otherwise reshape small datasets). I could definitely see the
utility in plain text relational tools.

------
marvel_boy
Also availaible on Mac. The join command conforms to IEEE Std 1003.1-2001
(``POSIX.1'').

~~~
wallstprog
Thanks for pointing that out! I should have mentioned that in the article, and
in fact I do much of this type of work on my MacBook.

Hmmm...that might be a good topic for another article.

In the meantime, here are a few packages you can install using HomeBrew (brew
install <package>) that help bridge the gap between Mac and Linux for other
tools:

bash binutils coreutils gawk gnu-sed gnuplot

------
kevrone
I had no idea this tool existed. Thanks! Very cool!

~~~
wallstprog
Glad you liked it! Like I said in the article, I continue to be amazed at the
things you can do with a simple bash prompt.

~~~
avmich
[http://www.leancrew.com/all-this/2011/12/more-shell-less-
egg...](http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/)

> Knuth wrote his program in WEB, a literate programming system of his own
> devising that used Pascal as its programming language. His program used a
> clever, purpose-built data structure for keeping track of the words and
> frequency counts; and the article interleaved with it presented the program
> lucidly. McIlroy’s review started with an appreciation of Knuth’s
> presentation and the literate programming technique in general. He discussed
> the cleverness of the data structure and Knuth’s implementation, pointed out
> a bug or two, and made suggestions as to how the article could be improved.

> And then he calmly and clearly eviscerated the very foundation of Knuth’s
> program.

