'sort' and 'uniq' should also be near the top of the list. And once your doing more on the command-line, 'join' and 'comm' can help you merge data from multiple files.
I'm guessing regex's are beyond Data Scientist [0], but throw sed and vim into the mix and there are very few one off problems that can be managed by a single CPU that you can't do, and what's more do more efficiently than any other tool chain. The overhead of throwing it into a SQL database or whatever is do big, these simple tools simply blow them away is you are doing it just once.
[0] I'm guessing a "Data Scientist" is someone who knows a lot about the data and the scientific domain that created it, and to whom a computer is just a just another hammer you hit the data with. A hammer that someone deliberately made insanely and unnecessarily complex for job security, or something.
I can't tell you how many times a combination of sort and join, with a bit of awk has saved by bacon. Seems to be a rather rare skill to have among the various Unix admins I've worked with in the past.
One thing to note, set LANG=C before doing operations with sort and join. I'm not sure if it is a bug, or if it is in all versions, but if you have for example LANG=en_US.utf8 then sort will use one order ("_" comes before '-"), but join uses ASCII order. Note, you don't have to export LANG=C, just put it prior to the command your are launching to export it to just that one command.
However, in terms of data science, don't most CSV files contain a header? Most of mine do, and the example 'data.csv' has the header "var_x".
Using "shuf" means there's a very high chance that the sampled CSV file either won't include the header line, or will have it somewhere other than the first line.
Doesn't that mean that most data scientists will rarely use 'shuf' for sampling from CSV files?
FWIW that's "xsv sample", and it handles headers by default (with an option for headers-less CSV):
Randomly samples CSV data uniformly using memory proportional to the size of
the sample.
When an index is present, this command will use random indexing if the sample
size is less than 10% of the total number of records. This allows for efficient
sampling such that the entire CSV file is not parsed.
This command is intended to provide a means to sample from a CSV data set that
is too big to fit into memory (for example, for use with commands like 'xsv
frequency' or 'xsv stats'). It will however visit every CSV record exactly
once, which is necessary to provide a uniform random sample. If you wish to
limit the number of records visited, use the 'xsv slice' command to pipe into
'xsv sample'.
Usage:
xsv sample [options] <sample-size> [<input>]
xsv sample --help
Common options:
-h, --help Display this message
-o, --output <file> Write output to <file> instead of stdout.
-n, --no-headers When set, the first row will be consider as part of
the population to sample from. (When not set, the
first row is the header row and will always appear
in the output.)
-d, --delimiter <arg> The field delimiter for reading CSV data.
Must be a single character. (default: ,)
There is no standard for CSV per se, although experience suggests most files have it. Use head/tail as needed to preserve or avoid it in pipelines.
The GNU utils are usually the way to go; the Mac/Darwin/BSD variants have some weird quirks and usually aren't worth the bother to fight with. awk and sort are notably deficient.
Ease off the "data science" a tad--this is just scripting, a perfectly honorable pastime! Why else would GNU coreutils exist in the first place?
There is a CSV standard, RFC 4180. There are also many variants. I used products which specifically say that follow that RFC.
It's awkward to preserve a header using shuf, even with head/tail. Here's what I came up with:
head -1 x.csv && (awk 'NR>1' x.csv | shuf -n 5)
My point is that "shuf" is GNU specific, while the author said it was a Unix tool like the other ones. Linux != Unix. Also, there are no Mac/Darwin/BSD variants of shuf.
"Ease off"? Why? The term "data scientist" appear in the title and each of the first three paragraphs, and the comment I replied to was "I would be a little bit shocked if any of the data scientists at my day job didn't know all seven of these". I want to know why data scientists who work with CSV files should be using shuf and not some more appropriate tool like csvkit.
That is, the entire topic is about scripting as applied to data science, not "just scripting".
RFC 4180 is really a well-intentioned attempt to document popular ways CSV is used in the wild, but it does not set a standard. There is no such thing as CSV well-formedness in the way that exists for XML. Can a CSV field value span lines or not? How are linebreaks encoded within a field? Can backslash escape quote (or other) characters, or is the double-doublequote approach contemplated in the RFC proper? How is one to know definitively if the first line is a header or a data row? What about comments?
My beef with "data science" is that this is not science--it's even less science than political science--it's munging.
shrug Sure, yes, I pointed out there are many variants. But you said there wasn't a standard when there is. You meant to say that there is no widely used consistent CSV format, I think, which is a bit different.
RFC 4180 says a CSV field can span multiple lines. Section 2.6:
> Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
You asked "Can backslash escape quote (or other) characters". 4180 says "no", and that double-double quoting is used. The grammar in 2.7 is:
The RFC says that the MIME type should use header=present or header=absent to indicate if the first line is a header. "Implementors choosing not to use this parameter must make their own decisions as to whether the header line is present or absent."
I have never seen a CSV file with comments.
Note also that the Python csv module, which supports a large number of CSV dialects, also does not support comments.
While you have a beef, that doesn't change that you wanted to drag me off topic (from "data science" to "just scripting"), when my point was that that example wasn't a good data science example in the first place. It also isn't a good scripting example.
One command I didn't see mentioned that was surprising was cut[1], especially the -d flag. (Maybe it's less necessary if you know awk?)
It's really great for cleaning up the output of poorly structured, wordy scripts. It's also a quick and dirty way to generate csvs with the –output-delimiter flag.
However they did include ‘cat’ and only mentioned it’s most basic common use rather than its primary reason for being.
Ultimately guides like the one submitted are just one persons braindump of stuff they think might be helpful for others. It’s probably best not to over think why something was added or omitted.
makes that tsv an interactive (if you think ncurses is interactive) spreadsheet with plotting and pivot tables and mouse support :)
You can also save your keypresses in vd to a file and then re-run them at a later stage – I've got some scripts to re-run an analysis and then run vd on it and immediately set all columns to floats and open the frequency table so I can see if I managed to lower the median this time.
If you have a lot of files that may be processed by a `find` command and speed is important, it’s worth knowing about the plus-sign variation of the `-exec` expression. The command in the original article
The difference is that the first version (the `-exec` expression is terminated with a semi-colon) forks a new process to run the `grep` command for each individual file “found” by the preceding expressions. So, if there were 50 such `setup.py` files, the `grep` command would be invoked 50 times. Some times this is desired behaviour but in this case, `grep` can accept multiple pathnames as arguments.
With the second version (expression is terminated with a plus-sign), the pathnames of the files are collected into sets so that the `grep` command is only called once for each set (similar to how the `xargs` utility works to avoid exceeding the limits on the number of arguments that can be passed to a command). This is much more efficient because only 1 `grep` child process is forked – instead of 50.
This functionality was added to the POSIX specification [1] a number of years ago and I’ve been using it for at least 10 year on GNU/Linux systems. I imagine it should be available on other Unix-like environments (including BSD [2]) that data scientists are likely to be using – though the last time I had to work on a friend’s Mac the installed versions of the BSD utilities were quite old.
> Prints on the screen (or to the standard output) the contents of files. Simple like that.
While it's not exactly false, it's also not a good explanation for cat. If you just want to operate on the contents of a single file, you should use redirection. The cat utility is for concatenating files.
tldr: grep, cat, find, head / tail, wc, awk, shuf with bonuses of xargs, and man.
I've never needed shuf, and awk is a bit out of place in the list, but head and tail have saved me from many a large file. The interesting data is usually in head, tail, or grep anyway.
awk '{print $2,$4;}' is useful and easy to remember. $NF refers to the last field. FS is the variable to override the default field separator of white space. Here's an example of it's use. OFS is the output field separator.
> awk '{print $2,$4;}' is useful and easy to remember. $NF refers to the last field. FS is the variable to override the default field separator of white space. Here's an example of it's use. OFS is the output field separator.
Yeah also cut is just inconvenient, awk works much better when working with the usual tabulated data.
> Works on CSV files too with the right field separator.
I have used awk several times in different ways. I still don't find it easy to remember a command with as many punctuation characters as letters, like your first one.
And I'm not even sure what the second one is trying to do. Print the user name, followed by the number of fields separated by colons? What good is that? Perhaps you've forgotten that not everyone can read awk syntax right off the page.
When it comes to learning awk, my suggestion is to read the man file of one of the simpler implementations; original-awk[0] or plan9's[1] awk. They lack the GNU additions (many of which admittedly are useful), but they decrease the learning curve a lot. I pushed off learning awk for years since I got overwhelmed every time I typed 'man awk', but after stumbling on the man page from plan9 I learned it in no time. Sure, I don't know all the GNU extensions in gawk, but now I can look them up when I need them.
There is a lot of low-hanging fruit available in teaching people from research backgrounds about what we would consider baseline computer use and software development practices. It turns out there is a large population of brilliant people who can accomplish things in Python that I can barely comprehend in abstract, much less line for line, who have also never heard of Git, unit testing, or modularity.
I guess I was under the impression that the label indicated some level of expertise in the basic tools of the trade that would differentiate it from statistics or applied math.
Of course there is nothing wrong with specialization or lacking experience when it is not particularly relevant to your chosen field, but something like 'grep' would seem to be the bread and butter of data science.
Everyone starts from somewhere. No matter how “developed” a field is. I think we should encourage the learning no matter what level people are instead of being scornful of how “advanced” a field of study or the people within it are. There are many reasons people might not be familiar with these commands. One super simple one is if they come from a windows background where often guis take precedence over the command line.