Hacker News new | past | comments | ask | show | jobs | submit login

big.tsv is basically the same line repeated 2 000 000 times:

  $ awk \
 'BEGIN{for (i=0; i<2000000; i++){print "abcdef\tghijk\tlmnop\tqrstuv\twxyz1234\tABCDEF\tHIJK\tLMNOP\tQRSTUV\tWXYZ123"}}' \
   > big.tsv

  $ cat big.tsv |sort|uniq -c
  2000000 abcdef ghijk lmnop qrstuv wxyz1234 ABCDEF HIJK LMNOP QRSTUV WXYZ123
You can first sort all the lines, and count them. The first column is now the number of instances of that line.

The idea is that if 2 lines are identical, the number of "count++" emitted will be exactly the same.

I took your gawk code, but I am starting the for loop at 2 instead of 1, and I removed the -F'\t' option.

  $ time cat big.tsv |sort | uniq -c \
  | gawk '{for (i=2; i <= NF; i++) {if (index(tolower(substr($i, 1, 3)), "bc") != 0) {count += $1}}}END{print count}'
  4000000

  real 0m0.724s
  user 0m0.605s
  sys 0m0.322s

Edit: added backslashes to split lines.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: