* How do you deal with words like "---" in your text that look like the match separator of grep?
* What if your filenames contain spaces?
* Are the sed/awk/perl one liners really all that readable and correct?
* How to catch and report failure conditions ... in pipe steps?
This stuff is great for interactive use and one-off ETL, not for applications.
Not sure what real alternatives are that give you:
- parallel execution
- seamless composition (like |)
- object passing not byte streams
- Quick to write.
Most of the time I switch to Python for this, but it does not give you sane parallelity.
Sure you can do this with Java + Akka, but this takes days to build out...
Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.
For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.
If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.
Everybody I show it to likes it even more than working with data frames once they grok it.
For this one you can use the "--" flag to signal that everything else should be treated as an argument.
$ grep -rn -- -
However intermixing text with separators is not trivial. There are reasons we use JSON/XML for exchanging structured data.
1. store data locally for offline retrieval
2. Support indexing big sites including stack overflow, Wikipedia, Reddit, news.ycombinator.com, microsoft.com docs, and a bunch of other domains.
3. Be easy to add a single URL into the index from command line and optionally browser plugin. Only index that page, this would replace my bookmarks.
4. Optionally auto store browser history for a custom period of time, purge when expires.
Does anything like that exist?
Code is open and there are a ton of already created data dumps + indexes. You don't have spend time rebuilding a Wikipedia/wikidata/stackoverflow dump and index by yourself.
But once I was done, I realized that I had it backwards. Therefore I wrote a PHP page to grep the list, and figure out in which box a specific item was.
Even if you can't ship a bash script to production, they're great tools for ad-hoc exploration and validation.
So this kind of thing should be quite doable in a short ruby script - or a few short scripts - albeit written in "shell" style, with eg '-n or -p (wrap code in "while gets...end",-p with "puts _"), probably along with -a (automatically split lines).
Its in some senses an entirely different dialect of ruby, though.
Some examples here:
It even has it as a feature!
CREATE VIRTUAL TABLE something USING fts5(x, y, z)