Hacker News new | past | comments | ask | show | jobs | submit login
LINQ Ruined My Favorite Interview Question (scottchamberlin.tumblr.com)
133 points by scottcha on July 11, 2013 | hide | past | favorite | 196 comments



I hate to be the bearer of bad news, but I think there may be even simpler solutions to this problem:

    (take 10 (reverse (sort-by (comp first rest) (frequencies (string/split ... #"\+s"))))

The above is a Clojure one-liner example that I believe satisfies the original problem. So while LINQ may have simplified from the C-language family solutions he had seen, it's clearly possible to take it one step further with the expressivity of modern languages like Clojure...

Edit: remember to sort! (Forgot my coffee this morning...)

Edit again: and aphyr's solution is even more concise and idiomatic, where `s` is the first paragraph of the blog post:

    => (->> (string/split s #"\s+") frequencies (sort-by val) reverse (take 10))
    (["I" 7] ["to" 6] ["the" 5] ["a" 5] ["of" 4] ["candidates" 3] ["is" 3] ["question" 3] ["in" 3] ["their" 2])


The OP blog-post doesn't really represent it succinctly in C#. This is the most succinct way I could think of:

  var top = (from w in text.Split(' ') 
             group w by w into g 
             orderby g.Count() descending 
             select g.Key).Take(10);


Though that will result in a lot more calls to Count(). A let would fix that.


I really really really should learn more Linq. Where would I use let, and why?


    from word in text.Split(' ') 
    group word by word into g
    let count = g.Count()
    orderby count descending 
    select g.Key
Under the covers when it is compiled it's turned into a select. That way it it only does N counts, not potentially N log N (for each comparison in the orderby), where N is the number of items you enumerate (so if you .Take() just 10 it doesn't really matter).


Indeed! It wouldn't be quite as concise though, which was kind of my point :)


While we're golfing, (comp first rest) is equivalent to the stdlib function called second, or, perhaps more idiomatic for map entries, val. We can also flatten deeply nested chains of computation using the -> and ->> macros, like so:

  (->> (string/split ... #"\s+")
       frequencies
       (sort-by val)
       reverse
       (take 10))


Think you also need a (map key) there to be equivalent to the C# code. At which point it's starting to look very similar to well written C# code solving the same problem.

Much as I love Clojure, it's worth pointing out that Clojure's lazy sequences (include intermediate sequences) get cached, while LINQ evaluates more like Clojure's reducers. (You can even get it to do so in parallel.)


You're right; it does need a (map key) line to be equivalent. I was thinking of http://www.leancrew.com/all-this/2011/12/more-shell-less-egg..., which wants the counts too. :)

On a side note, I'd like to point out a key difference between this Clojure example and corresponding variants in C#, Python, &c: the Clojure variant has no variables. This isn't just a matter of concision: coming up with descriptive names is hard, and usually means duplicated information, either in the name or the type declaration. Often, those names are re-used over and over again with subtly different meanings, for each stage in a pipeline--forcing the reader to reason carefully about the declaring scope at each use.

Consider, for example:

  Swimmer swimmer = new Swimmer("foo");
  swimmer.setStyle("butterfly");
  swimmer.swim();
  return swimmer;

  (doto (Swimmer. foo)
        (.setStyle "butterfly")
        .swim)
Same operation--but with (doto), four uses of a variable and one type declaration have been cleared away. Consider the original post:

  var words = s.Split(' ');
  var wordCounts = words.GroupBy(x => x).Select(x => new { Name = x.Key, Count = x.Count() }).OrderByDescending(x => x.Count);  
  var countedWords = wordCounts.Select(x => x.Name).Take(10).ToList();
  return ExtractTopTen(countedWords);
Six uses of three formal variables, including the bewilderingly confusable countedWords and wordCounts, plus nine uses of the delightfully generic "x". Even in the more compact C# example from this thread, consider:

  var top = (from w in text.Split(' ') 
             group w by w into g 
             orderby g.Count() descending 
             select g.Key).Take(10);
This variant refers to the temporary variables w (for "words") and g (for "groups"?) six times. Both authors felt the need to reduce the repetition of variables, but the best they could do was to choose single-character names.

This is the real power of ->, .., ->>, doto, and friends: eliminating names for things. By thinking about the composition of transformations, instead of the intermediate results, you can make an algorithm easier to understand and change.


Yeah, I tend to prefer the lambda notation over the sqlish notation, but either way the lambda form of your solution would be similarly terse. The select call that constructed an anonymous-typed object was unnecessary, as were all the temporary variables... C# coders should never be afraid to embrace the Rubyish leading dot approach.

  return s.Split(' ')
    .GroupBy(word => word)
    .OrderByDescending(wordGroup => wordGroup.Count())
    .Select(wordGroup => wordGroup.Key)
    .Take(10).ToList();


Pointless style gets a bad rap, but the idea that you're thinking of just the transformations instead of the data is a powerful mindset to have.


In C# that could be:

  return new Swimmer("foo")
  { Style = "butterfly"}
  .swim();
It is unidiomatic to have Swimmer.swim return 'this', though.

  var result = new Swimmer("foo")
  { Style = "butterfly"};
  result.swim();
  return result;


The thing being that C# has a feature for a whole bunch of idioms, but none of them are extensible and writing new ones requires a new version of the language. The Clojure reducers, doto, ->>, go macro could all have been implemented in 1.0. (And two of them were.)

I speak as someone who codes in C# for work and Clojure as a hobby. There's lots of things that just aren't quite possible in C#, and _everything_ is possible in Clojure.

Pointfree is beautiful to read and grok. It's a little trickier to debug, since pretty much every debugger on earth needs points to inspect values.


> Six uses of three formal variables, including the bewilderingly confusable countedWords and wordCounts, plus nine uses of the delightfully generic "x".

I totally agree with you on this.

>Even in the more compact C# example from this thread, consider: var top = (from w in text.Split(' ') group w by w into g orderby g.Count() descending select g.Key).Take(10);

> This variant refers to the temporary variables w (for "words") and g (for "groups"?) six times. Both authors felt the need to reduce the repetition of variables, but the best they could do was to choose single-character names.

> This is the real power of ->, .., ->>, doto, and friends: eliminating names for things. By thinking about the composition of transformations, instead of the intermediate results, you can make an algorithm easier to understand and change.

You lost me somewhere along the way....are you saying the "from w in text"... snippet is bad, and something more along the lines of "This is the real power of ->, .., ->>" is more appropriate?

I ask because that code seems extremely readable to me. Personally, I don't give a shit if it's 30% more verbose or runs 50% slower, for 99% of code (written in the world), optimum performance doesn't matter. Maintainability does matter though. All of this software being written today has to be either maintained by someone, or replaced by something else. And the top ~2% of programmers like you sure as hell aren't going to be taking maintenance jobs any time soon.

I'm curious what the thoughts of a technically smart person such as yourself are on the subject of what companies will be left with 5 to 10 years down the road when consultants have come through and implemented using the currently most optimum platform/language/algorithms?

And I honestly don't mean for this question to be disrespectful. I'm just coming from a situation where I'm a former developer but on a project where I'm not coding, and I ask for features and the developers say they can't do it, or it will be a performance problem. And I know these guys are far more like you than me intelligence/education wise, but the things I ask for I've done tons of times in the past with 10 to 1000 times the data size, without a problem, on far older hardware.

I'm just curious what kind of a support problem you ultra smart people are leaving behind, or if you ever think about the idea that almost no on else is as smart as you?


Over the last ten years, I've read a lot of code. I maintain my own code, and I maintain others' code. Because I am not that smart, I try to simplify everything--to make large problems tractable for my little brain to reason about. In this process, I've come to the conclusion that good, maintainable code is:

1. As simple as humanly possible, so that the algorithm is easy to understand and change. Each component is isolated and can be understood and modified in isolation. Boundaries between component and environment are clearly thought-out.

2. As simply expressed as possible: broken up into distinct, well-organized functions, with descriptive, regular names for functions and variables, in context.

3. Well-documented; each function and each namespace come with contextual docs explaining their motivation, arguments, consequences, invariants, etc., with examples.

4. As short as possible, because humans have trouble holding large amounts of context in their head. Minimize the amount of scrolling or jumping between files necessary to understand the algorithm.

5. Well-tested, so that changes can be made freely. The test suite needs to be fast, so one can get feedback within seconds of making a change to the file; ideally a few milliseconds. A balance of typechecking, logical tests with mocks, integration tests, and full stress tests provides a continuum of safety.

I don't see these goals as particularly constrained to any language, but I will say that I feel best able to achieve them in a Lisp. Dunno whether that helps you project your problems at work onto me personally, though. ;-)


If the aim is golf, it's probably best to keep #" ". If the aim is accuracy, we might get better results by using #"[^\w]+". Though that does nasty things to words like "it's".

Edit: Actually, we can also get rid of the 'reverse' by replacing (sort-by val) with (sort-by (comp - val)). Not going to be shorter in terms of character count, though we win on line count.


sort-by takes an optional comparator, so doing (comp - val) is a bit unneeded. Just add in > as an argument to sort-by:

  (->> (string/split s #"\s+")
       frequencies
       (sort-by val >)
       (take 10))
Just mentioning it here because I find it vastly more readable than (comp - val).


Oh, we can use our favorite language in our new job? Here is a Python solution.

    from collections import Counter
    Counter(s1.split(' ')).most_common(10)


Well, sure, once you're allowed to use external libraries anything is a one line solution. In JS:

   doStuff = require("doStuff");
   var result = doStuff(theString);
isn't JS so efficient?!?


It looks like collections is part of the standard library: http://docs.python.org/2/library/collections.html. I think this disqualifies your JS solution, as long as doStuff isn't some fancy nodejs core module I missed.


I think the point still stands though, when you think of it as an interview question. The point is to get to the bottom of how the candidate would process such a query themselves, not to test their knowledge of the core libraries of their favourite language. I feel like that's what the post author meant when he said that LINQ had "ruined" the question- because it kind of does the same thing.


Isn't it kind of axiomatic that any question that provides for a concise and measurable solution will eventually be refactored into a library? It seems like kind of a weakness in these kinds of "programmer" questions, that they all regress to trivia. Why not an open-ended business problem and/or deliverables-type question? I'm sure that would speak more to the daily job, where I've never heard of someone being tasked with "OK, find the top 10 words by frequency in a random Wikipedia article."


Surely the 'trivia' in this instance is knowing the names of libraries, as opposed to be able to think through programming concepts?

Just like math homework back in the day, the teacher didn't do it to check your answer, they did it to check how you arrived at your answer.


My memory of the math homework you describe brings up two images: geometry proofs and story problems. Never did I have to explain multiplication when doing my times tables, nor quadratics.


collections.Counter.most_common() uses heapq.nlargest() which is slightly more efficient than a full n*log n sort. But implementing a heap correctly from scratch is I guess outside the scope the OP intends for the interview question.

I would expect a good candidate to:

1. know that a heap is optimal here (remembering whether Counter uses it is optional)

2. express reluctance to implement it from scratch because surely the stdlib can do it better.

So IMHO using libraries like this ruins the question only in the sense of showing its not a challenge for the candidate. Something so simple should not take a page of code.


> collections.Counter.most_common() uses heapq.nlargest() which is slightly more efficient than a full n*log n sort.

Creating a heap is O(n) if I remember correctly, so that may well be the most efficient solution.


"Sorry, we're stuck with Python 2.3.7 and collections isn't part of the standard library."


"""Thank you, I think I've heard enough. I don't think continuing the interview would be a productive use of time."""


I like the fact that you triple-quoted that statement.


Because of an older version of python?!


If they're stuck on a version of software that hasn't been supported in 5 years for bugs or security with a set of features that was originally released nearly 10 years ago, it means that there are some culture issues in the workplace that will most likely drive me insane.

Therefore, I am unlikely to take the job.


Most companies are not startups, installing software as soon as new releases get available.

Just to give another example, our consulting company still gets requests for projects to be deployed against Java 1.4!


Without imports:

    d = {}
    for word in s1.split(' '):
    	try:
    		d[word] += 1
    	except KeyError:
    		d[word] = 1
    print [(x, d[x]) for x in sorted(d, key=d.get, reverse=True)][:10]


Or, use a defaultdict(int), or more in line with yours:

d[word] = d.get(word,0) + 1

dictionary.get is quite useful.

Also, I'd consider s.split(None), instead of s.split(' '). It will group whitespace, so that any double space or other whitespace is collapsed into one delimiter.


or just `s.split()`


I didn't know about split(None). Thanks!


I'm a newbie, but the question seemed approachable so I went for it. This is what I came up with. (Python, btw.)

  def top_ten(s): 
    words = s.split(' ') 
    word_list = set(words) 
    return sorted(word_list, key=lambda x: words.count(x))[:10]
The question didn't ask for word counts, so I didn't see the need for a dictionary. I'd appreciate any advice on my solution. I'd be thrilled if I'm not too far off from being capable of starting to apply for jobs.


Nice. Just a little fixing and polishing:

    def top_ten(s):
        words = s.split()
        return sorted(set(words), key=words.count, reverse=True)[:10]
(to get the most common words instead of the least).


The need for the dictionary shows up when your text gets significantly large. Each call to `words.count` is going to re-examine each word of the text to count 'em up, so, if n is the number of words in the text, and m is the number of distinct words in the text, then this solution is at least O(mn + nlog(n)) whereas the dictionary-based solution is O(nlog(n)). That is, we're re-reading the word list over and over, whereas the dict-based solution only reads it once. It's therefore more likely to be more efficient.

But I like the readability of this solution and there's a strong argument to be made for it on that basis, especially if the string is short. If this were a job interview, this would be a totally acceptable solution, though it'd be important to be able to discuss why other solutions might be faster and why you prefer this one anyway.


Isn't the difference between O(mn + nlog(n)) vs O(nlog(n)) running time going to get less significant as the value of n gets larger?

I thought the whole point of Big O / asymptotic analysis is that you can ignore lower-order terms and constant factors because they are insignificant for any appreciably large input size. And also because the lower order terms and constant factors vary too much depending on the programming language, the compiler or VM, the hardware, etc.


I'd never heard of Big O until this thread, but from what I can tell it'll only be when a solution is in the O(logn) that "running time will get less significant as n gets larger". This is simply the way you describe logarithmic growth, so maybe that's where you got confused. Also, wouldn't it be fair to say that a logarithm (nlogn) is a lower order term than mn? In which case your definition stands.

At any rate, I wanted to test this out, so I made a naive benchmark for running these functions. The dict solution was ten times faster (0.0011s vs 0.015s) than the list version with ~1350 words. The dict solution ran in 0.13s at ~162,000 words, while I waited a couple minutes before killing the list version on that input.


Oh, you're right, m is not a constant but also varies somewhat independently of n, so you can't exclude it from the big O notation. And whether mn or nlogn is the lower order of the two terms depends highly on the value of m, which is the number of unique words in the text. A really long text that's just the same word over and over will have a big n but a small m.


Worst-case big-O runtime is therefore O(n^2), and best case is O(n log(n)) :)


Hey, thanks much! I'm going to have to spend some time with these logarithmic evaluations to really get what you're saying, but I dig the basic concept. Very helpful.


Ooh, this is fun!

In Ruby, without imports/requires:

    def toptenwords(str)
    	words = str.split
	words.sort_by{|word| words.count(word)}.uniq.reverse.take(10)
    end
or as a one-liner, without any variable declarations in the function scope:

    def toptenwords(str) str.split.sort_by{|word| str.split.count(word)}.uniq.reverse.take(10) end


I think this version and spenuke's version are impractically slow, but yours is actually O(N²). If you give it an 824 000 word input, it will do 0.6 quadrillion word comparisons, which you will probably not be willing to wait for. A more practical solution:

    counts = Hash.new { 0 }
    IO.read('bible-pg10.txt').split.each { |w| counts[w] += 1; }
    counts.keys.sort_by { |w| -counts[w] }.take 10
This is still O(N lg N) instead of O(N lg 10) like the Python version, but it's good enough this time; it still gave me ["the", "and", "of", "to", "And", "that", "in", "shall", "he", "unto"] reasonably quickly.

I'd be interested to see if there's a way to do this in a single expression in Ruby.


I actually was working on a hash-based solution first, using group_by, but I switched to using just array after seeing the Python version. I didn't really think about the time complexity of the count operation inside the sort_by operation, thanks for pointing that out.

You definitely can do a hash-based solution in a single expression in Ruby. Here's a very ugly and kludgy example that you could probably improve on if you wanted to. I don't think it's n^2 because the group_by just counts the occurrences of each word and returns a hash where the count is the key:

str.split.group_by{|w| str.split.count(w)}.sort_by{|k,v| k}.reverse.flatten.uniq.keep_if{|w| w.is_a?(String)}.take(10)

I'm also trying to work out a better way to do this using "chunk" because although hashes are fast to access, they are not fundamentally sortable, and sort_by returns a 2d array just like chunk does anyway.


> I'd appreciate any advice on my solution.

You should always compare your results to what is expected. For a problem like this, use a small set of test data that can easily be counted and sorted in your head or on paper.

You forgot reverse=True and your results show the 10 least common words. ;)

This kind of error happens to all of us. That's why we have unit tests and QA teams. If you made this mistake during an interview I wouldn't give it much importance and we would have a good laugh about it.


Ha, nice! Funny you should mention that. I did exactly that when tooling around in the repl and forgot the reverse kw, saw something fishy, and fixed it. Promptly forgot it again when I typed it in this comment. :)


This is a great approach. Even better because you carefully examined the requirements and found an interpretation which allowed you to make an optimization (words only, not counts, therefore sets are useful)


This isn't an optimization. He turned a O(n) algorithm into a O(n^2) one.


In case anybody was wondering for a second when JS got so awesome -- it didn't, this is a snippet of Python. D'oh. Sigh.


JS isn't that bad if w/ underscore or d3:

    d3.entries((s.split(" ").reduce(function(p, v){ 
        v in p ? p[v]++ : p[v] = 1;
        return p;}, {})))
      .sort(function(a, b){ return a.value > b.value; })
      .map(function(d){ return d.key;})
      .slice(-10);


collections isn't an external library. It's part of the Python standard library.


Agreed, if I was looking for a candidate that will always "reinvent the wheel" then I would want them to implement from scratch. I find it perfectly valid for the interviewee to say "Hey! I wrote a module for <language> for this exact purpose! Want to check out my github page?".


My first thought after LINQ was a JS or CoffeeScript Map/Reduce/Sort/Splice.


Pfft <?php function top10($s) { $a = array_count_values(preg_split('/\b\s+/', $s)); arsort($a); return array_slice($a, 0, 10); }


You managed to create an example of why I don't care for PHP...

    array_count_values(...)
    arsort(...)
    array_slice(...)
Thinking.. two of these things belong together, two of these things are kind of the same.. but one of these things is doing his own thing...

Seriously built in method naming and argument ordering don't follow any consistent convention in PHP, that's always been the most irksome thing to me...


Welp oops: <?php function top10($s) { $a = array_count_values(preg_split('/\b\s+/', $s)); rsort($a); return array_slice(array_keys($a), 0, 10); }


I find it silly everyone arguing below but when:

s1 = 'a a a a a a a a a a a a a a a a A A A A A A A A A A A A A a. a. a. a. a. a. a. a. a.'

those two lines give:

[('a', 16), ('A', 13), ('a.', 9), ('', 1)

(HN might collapse the double space in s1)


Indent with four spaces for any bits of text you want quoted and monospaced on HN.

    Like      this      here


In a similar spirit, but without libraries and with a programming language designed 25 years ago.

    Commonest[StringSplit[string], 10]


To be fair, Commonest appeared in Mathematica 6.0 about six years ago.


meh, just use sum instead of counter

sorted([(i,sum([i==j for j in s.split()])) for i in list(set(s.split()))],key=lambda x:x[1],reverse=True)[:10]


expressivity of modern languages like Clojure

Uh, no. Your solution just uses a bunch of standard library functions (at least, I hope they're not syntactic forms… and why a function as specific as "frequencies" not in some namespace boggles my mind). I could write that in C with an appropriate standard library.

Ironically, expressing this in something like SQL actually speaks to the expressivity of the language because the solution was produced entirely with syntactic forms.


That may be true, but it's still a pretty idiomatic Clojure solution. Ask 10 Clojure programmers to solve this problem and most of them will come up with something similar. Ask 10 C programmers and they will likely come back with a program much more similar to the one in the post.

The fact that such functions are at a Clojure programmer's fingertips is exceptionally important. If you strip out the core functions (and macros) from any lisp or Clojure you are left with almost nothing( see http://stackoverflow.com/questions/3482389/how-many-primitiv...).


`frequencies` is surprisingly useful, enough to justify its inclusion into core. I once pondered why.

Consider that in Clojure, a common act transforming one datastructure into another. Typically only a few kinds: maps, sets, and some kind of sequence.

What is a highly common generic transformation of a sequence to a map? `frequencies`. Particularly since you'd only be doing this with pretty finite sequences, given the finite nature of the built-in maps. Finite here means countable. What generic, domain-independent thing would you be counting? Often, the items themselves.


Ironically, expressing this in something like SQL actually speaks to the expressivity of the language because the solution was produced entirely with syntactic forms.

What? Pure untyped lambda calculus does things "only with syntactic forms". Don't tell me that your solution wouldn't use some basic operators other than rudimentary syntax. Keep in mind that in Lisp-family languages, many things that are a part of syntax in other languages are actually simple functions - arithmetic operations being a case in point - so you really can't avoid using library functions.

I believe that the major point of Clojure etc. is that they provide concise basic operations that you can use everywhere to compose complex operations better than you can do with, say, C#.


So, the line of code in the OP that's missing from the LINQ version of the solution is:

    Imports System.Linq


  s.Split(' ').GroupBy(x => x).OrderByDescending(x => x.Count()).Select(x => x.Key).Take(10).ToList();
I'd say 13 characters isn't really a huge differentiator. They're both using exactly the same algorithm, almost exactly the same built-in library functions, and have exactly the same flexibility for programmers to add their own (C# is using extension methods which lets you add methods that look like they belong to the class, but are actually static methods defined elsewhere).

There's a lot of benefits to Clojure over C#, but the ability to chain functional list operators really isn't one of them.


For a sys-admin job:

tr 'a-z' 'A-Z' | sed 's/[^A-Z][^A-Z]*/\ /g' | grep -v '^$' | sort | uniq -c | sort -nrk1 | head -10

edit: I don't know enough about HN,there should be a newline after the backslash in the sed command.


You can use tr instead of sed and grep, too. Also -k1 and -n 10 are default for sort and head:

    $ < bible-pg10.txt tr -cs a-zA-Z '\n' | sort | uniq -c | sort -nr | head
      62265 the
      38915 and
      34588 of
      13474 to
      12846 And
      12589 that
      12387 in
       9762 shall
       9668 he
       8942 unto
I have a `bins` script that is just `sort | uniq -c | sort -nr`, so that reduces to `tr -cs a-zA-Z '\n' | bins | head`.

Unix for Poets, man. It's the shit.


I ALWAYS forget about squeeze, thanks!


This is too complicated...

echo $sentence | rs -T | sort | uniq -c | sort -rn|head -10


Ooh that's a clever use of reshape, but wrong sadly. Yet you're right I could make it simpler, it was just what came first to mind, I did not even check it. Imagine typical sentences:

  $ echo Foo foo foo. | rs -T
  Foo
  foo
  foo.

  $ echo Foo foo foo. | tr 'a-z' 'A-Z' | sed 's/[^A-Z]/\
  /g' | grep -v '^$'
  FOO
  FOO
  FOO
Also the link mentioned using wiki articles for testing, so they would have paragraphs and that's where reshape shines for things like emails, but fails here:

  $ cat foo
  Foo
  
  foo

  foo.
  $ <foo rs -T
  Foo   foo   foo.  
  $ <foo tr 'a-z' 'A-Z' | sed 's/[^A-Z]/\
  /g' | grep -v '^$'
  FOO
  FOO
  FOO
That that's though makes me think, should I treat contractions special? What about plurals? It's starting to get silly now.


I think it goes beyond the original question now. :) If one wants to do proper thing, then definitely contractions, word variations, etc should be considered. For this, I guess something like OpenNLP can be used.

What's interesting though is how UNIX shell, being essentially a symbolic FP language, allows one to solve the problem in a clear and concise way. And if one desires, the program can be easily modified to read sentences, say, from network from ssh tunneled via HTTPS. That kind of flexibility can rarely be achieved in languages like C# with tight coupling between the units of abstraction.


I like how you snuck in "modern languages (like Closure)", as if C# is not a "modern language". Both languages have their place, but let's not pretend that Closure is somehow superior for all applications.


You'll need to sort the result of frequencies: it's an unordered map.

In practice, if you use ->> and write neater LINQ, the code is very close. Although LINQ only has group and sum/reduce, no dedicated frequencies function.


how does FREQUENCIES know how to compare strings?

In Common Lisp you'd usually supply a TEST keyword parameter (#'STRING= or #'EQUAL in this case). Does Clojure use Java's type system to infer?


No expert, but from the source it looks like it just uses whatever a 'map' uses to match keys:

  (defn frequencies
    "Returns a map from distinct items in coll to the number of times
    they appear."
    {:added "1.2"
     :static true}
    [coll]
    (persistent!
     (reduce (fn [counts x]
               (assoc! counts x (inc (get counts x 0))))
             (transient {}) coll)))


Clojure uses a unified compare test, sometimes called equiv. I think its passed on this paper, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.9...



Since we're already on the JVM, here's my solution for Scala:

def top10(s: String) = s.split(' ').groupBy(identity).mapValues(_.size).toList.sortBy(-_._2).take(10).map(_._1)


That's almost identical to what I came up with, but yours is better...I didn't know about the mapValues function.

def mostCommon(str: String, num: Int) = { str.split(" ").groupBy { s => s} .map { case (k,v) => k -> v.length }.toList .sortBy { _._2 }.reverse.take(num).map { _._1 } }


That works as well. On the other hand, I'm not a fan of these huge one-liners (including the one I posted). I split mine up a bit and posted it here: https://coderwall.com/p/ha-c2w


LINQ tends to get thought of as "database syntax sugar", but it's way more than that. It's C#'s version of the lazy collection operations you find in most functional languages, just given friendlier sqlish names. (IE, "Select" instead of "map" and "Where" instead of "filter")

I almost feel a bit gross when I have to write a "foreach" loop at this point, because there's almost always an equivalent way to do it in LINQ (although it's a tradeoff, as the one downside of LINQ is that given it's deferred nature, it's harder to debug).


>It's C#'s version of the lazy collection operations you find in most functional languages, just given friendlier sqlish names. (IE, "Select" instead of "map" and "Where" instead of "filter")

Actually........

It's more generic than that. It's C#'s monadic comprehension syntax ala Scala's for and Haskell's Do.

It works on more than just IEnumerable, it actually works on anything that implements Select, SelectMany, and Where...


This is super obvious if you watch some of Erik Meijer's excellent videos on Channel 9. Recommended viewing for anyone reading this who hasn't. Watching him bounce between Haskell and C# felt like someone explaining etymology by moving between Latin and English.


The most annoying thing about LINQ is that they use the SQL language. So unless you use C# and LINQ all of the time it's hard to figure out what functions translate to the ones used by every functional language in the world.

I wish they would at least alias map, reduce, some, etc.


Many C# users have experience with SQL, but not necessarily with functional programming languages. The number of these users likely far outnumbers the number of C# programmers who know functional programming languages, but not SQL.

It makes sense to use terms that many C# (and SQL) users are already familiar with, rather than terms that they may not know of.


I understand the rationale, but in my experience knowing some SQL didn't help me understand LINQ at all. It was only after picking up some functional experience elsewhere and coming back to C# much later that LINQ made sense to me. I'd be really curious to see how the SQLesque naming's worked for others, but it was a bit of a miss for me personally.


You really only have to figure it out once. And for users not familiar with the "standard" names, I think linq's naming is actually more intuitive.


tbh I don't think either are particularly intuitive in a "can guess it" sense. Both are intuitive in a "oh, that's what it's called, I can remember that now" sense.


Well, SQL is by far the most used functional language in the world.


I do Lazy LINQ, by which I mean I write the looping code with "foreach"es and get it debugged & working, but then allow Resharper to convert it to LINQ if it can.

I think this is mostly because I'm still not comfortable with the syntax, and partly because of the set/iterative impedance mismatch that is there. "It's not you, LINQ. It's me."


You would really benefit from learning a functional language (any) to help you roll your own LINQ. Some of Resharpers refactoring to LINQ is awful and does not make for cleaner code.


I almost feel a bit gross when I have to write a "foreach" loop at this point

Agreed. I've transitioned to spending most of my time in JS, and wherever I can I use .map(), but the chaining it's not quite the same as LINQ. Someday I intend to write a library of Array addons to provide GroupBy and so on, but I can't imagine it'll be super efficient.


This is the problem with doing functional list stuff in languages that permit side effects: the order of execution for each function has to be strictly defined (in both senses of the word), so the compiler can't optimise the code. In Haskell the solution (see my comment on the original page) would probably (I haven't actually checked the GHC core output) be folded into a single loop.


In Underscore, you can do something like:

    _([ ... ]).map( ... ).filter( ... ).value()
It's not exactly extending the native array...but there are far fewer side effects to doing it this way.


The problem with this is that it's doing multiple iterations. LINQ on the other hand builds up an expression tree that is "compiled" into a single loop when `ToList()` (or some other method which gets the results) is eventually called.


Underscore has groupBy and some of the others you may be looking for.


Yes.. I do understand why extending the base Array object in JS is deeply frowned upon but at the same time I often wish that these sort of extensions were on the Array object. It feels clunky to use them otherwise.


You should check out Sugar, by far my favourite standard JS lib to use in any complex project: http://sugarjs.com/


This may be what you're looking for. I think it was developed my MS specifically for this reason:

https://github.com/Reactive-Extensions/RxJS


I am a big fan of LINQ but it gives and takes away complexity at the same time.

It makes a 'whole class of things that you would have to do with loops' go away. Once you are comfortable with the syntax, it makes code a lot more readable.

But you lose track of when and where things are getting executed.

For example, it is easy to make something that you intend to execute inside of SQL server run inside of C# code. And then, all of a sudden, string comparison is case-sensitive.

You wind up having to context switch between procedural and set mentality without the same kind of visual cues you used to get.


It's also really easy to shoot yourself with unless you understand everything and think each problem through.

Four killer issues I've seen so far:

We had a major production performance issue which turned out to be a stray ToList which was causing a massive memory ballooning. Didn't get noticed in test as the test cases passed but it hit prod and 2000 users bashed it and tried to allocate 20Mb each causing our cluster to shit a brick.

Null reference exceptions! There are so many dereferencing operations in an average LINQ expression that you really have no idea which one is blowing if it goes pop in production in release config.

People using .Single(...) and getting more or less than one result back. So frustrating.

If you push an IEnumerable<T> over an interface boundary the performance and memory semantics are not preserved and you end up with a leaky abstraction. These are shits to resolve. Example: queries executing inside the view which is outside the transaction scope.

We've had to ban it in some circumstances.


I agree with your pain points.

We try to use ICollection in our APIs instead of IEnumerable, since the latter can have surprising semantics like being a wrapper for some operation which may not be valid anymore, or might be slower than you expect to do things like .Count(). IMO it's really not best for transporting across interface boundries in the most common case; only when you're specifically trying to avoid having the whole collection in memory or something like that.

Another thing that can help is this wonderful May<T> library[1]. It a great option type[2] for .NET. It helps make operations more composable.

[1] https://github.com/Strilanc/May

[2] http://en.wikipedia.org/wiki/Option_type


But you lose track of when and where things are getting executed.

That is the curse of modern software development. But LINQ is minor offender here compared to some kinds of remoting, code reflects from endless xml configs wrapped in gazillion interfaces and some inventive orm-s. Sadly our tools have not kept up with the complexity that is external to the executing code. The debugger in the pre- inversion of control/bean injection days was almighty because you had all of the program state in front of you.

A little known fact is that if you wrap Java/C# code in enough unit tests you get Haskell.


> A little known fact is that if you wrap Java/C# code in enough unit tests you get Haskell.

It is enough to make you want to switch to a pure message passing system, isn't it?


I've ran into this a little bit as well, but I feel the benefits far outweigh the disadvantages. Once you understand the basics, your code becomes so much cleaner.

You do have to be careful for hidden pitfalls, like an accidental N+1 select against the database, but even that is usually fixed pretty easily if you simple call .Include() in the original expression.

From the article, I really don't understand the concept of not updating your knowledge of Linq. It's been out for years now. And then the first concern is with performance? That sounds very problematic to me. There are a lot of .NET developers out there that are still stuck in the previous decade. There are a lot of shops that are still on 2.5, which is too bad, because that was when the framework and C# really started to take off.


There is also some trickiness around when things get executed. If you return an IEnumerable<T> that is really a LINQ query, and the caller doesn't immediately enumerate it, they run the risk of getting exceptions when they do enumerate if you mutate the collection the query runs over. You can of course force its evaluation, but that also loses some of the benefits of the laziness. Bring in multi-threading and even if you properly synchronize collection access (and that is hard to do without custom collection types, since LINQ is like the honey badger w.r.t. synchronized access) you can still see the collection modified exceptions talked about above.


The "execute inside SQL but ends up in C#" is because of the terrible decision of the C# language designers to make reified code (expression trees) have the exact same syntax as lambda functions that are normal code.

If you had to tell the compile if you wanted code or a tree, that problem would be solved. It'd also be one step closer to allowing type inference for lambdas assigned to locals.


I hate to be the arrogant know it all on Hacker News, but seriously, if you're writing c# and not using LINQ all the time, you need to catch up.

I'm tired of seeing people answer interview questions with anything _other_ than LINQ.


I dont really agree with your statement, but I did find it perplexing that a C# .Net technical interviewer would not know the fundamentals of LINQ 5 years after its inception.


Time flies when you're building stuff, I know, but LINQ was actually announced at PDC 2005 - 8 years ago this fall.

That a C# developer can seriously say they're not up to speed with "new stuff" like LINQ (2005) and dynamic (2008) in 2013 is kind of ridiculous. What they mean is that they figured out how to map what they learned about Java in school into C#'s syntax and now they're done learning.


I wouldn't call you arrogant, LINQ has been around for 5 years now and any C# interview should include it.

That said, not all questions are best solved with it, but it certainly has made working with enumerables much easier.


LINQ is not a magic bullet. It has some serious pitfalls.

I'd like to aee people using the right tools for the job.


LINQ is a great mechanism for expressing complex operations in a succinct manner. Sure, LINQ may not be a magic bullet and can be abused, but you know the old adage about premature optimisation. Perhaps you could elaborate on "It has some serious pitfalls.".


I didn't write the post you responded to, but I can tell you of the main pit fall I have encountered. By far the worst is evaluating the same enumerable again and again by mistake. It don't show up in unittests because they use small data sets, but as soon as you do any kind of load test or put it into production it's painfully slow. The solution is often very simple, but you really have to get dirt on your hands before you recognize where it is a problem.



I disagree, but for different reasons than the other commenter. LINQ is great for transformations that are intuitively described by the verbs LINQ provides. If your transformation is most easily understood as a process of mapping filtering grouping selecting etc, then by all means. But trying to pigeonhole every iterative process into LINQ is a recipe for code that's hard to decipher. As always its a question of how does one communicate a process most effectively. Succinctness for its own sake is a fools errand. Clarity is key, and sometimes that means foregoing LINQ for standard iteration.


I answered a phone interview question once with LINQ -- it was about combining string arrays while avoiding duplicates.

The HR interviewer followed up my answer asking about runtime and memory usage. While these are good questions, I got the feeling they didn't want to receive an answer using a single line of LINQ.


Do you know the runtime and memory usage of a LINQ query? That's sort of a complication of LINQ.


but seriously, if you're writing c# and not using LINQ all the time

You don't sound arrogant, but rather sound naive. I agree that someone who recruits surely should have known about and have experienced LINQ significantly by now, but the notion that you should be using it "all the time" is absolute nonsense.

I avoid LINQ. I encourage others to avoid LINQ. It is almost always a sign of bad code.

LINQ is syntactical sugar over basic set operations. It is perfectly fine if you're doing naive activities, such as the example give -- a contrived example of brute forcing a problem, where two approaches of the same very basic need unsurprisingly yield the same complexity -- but it falls apart in real-world persistent code with considered algorithms and storage. In real long term code, it is usually the canary in the mineshaft telling you that the developers aren't using proper algorithms or storage.

In many large-scale projects it invariably turns into the performance nightmare that ends up causing whole rewrites.

Again, not because of LINQ itself, which of course can do basic operations like grouping and sorting as quickly as you could do "by hand", but that it makes it so easy to do those things that developers start to resort to that as a catch-all magical solution that is costless because how much could a line or two of code cost? -- a master List<stuff> that they just sort and group by and select from all over the code (O(n) * O(n) * O(n)...why not?). The end result is that the data structures and encapsulation that should have happened never did, so while it might seem more concise and obvious on a one-to-one comparison with a loop perspective, neither case should ever have happened.

LINQ is the gun by which a lot of terrible programmers are repeatedly shooting themselves in the foot with, all while gloating about their concise code.


(Note: When I say LINQ I am referring to the functional style it encourages, not the query syntax. The query syntax is nice, but it's just a trivial syntactic transformation.)

Correct me if I'm wrong, but the world is moving towards functional programming (i.e. LINQ) not away from it. Personally, I find LINQ far, far easier to read, write, and analyze. (On the other hand, I understand the deferred semantics and watch for warning signs like enumerating a sequence more than once.)

Honestly, a C# company avoiding LINQ sounds to me like the canary in the mineshaft telling you the company has programmers falling behind the times and doing things the hard way.


I'm definitely in agreement with the last point. However, I originally said that good C# code contains a lot of LINQ, not that C# code with a lot of LINQ is necessarily good. Deferred execution is something that confuses people, but frankly, it's a concept they need to learn. It's been in the language since yield return got added in 2.0.

LINQ's deferred execution was the right choice for performance, but it's the hardest one. You've got to know when to call .ToList(). I'm not denying that I've seen people evaluate the same expensive list 100 times, use a join when precomputing a Dictionary would have been much faster, close a connection before the result is actually evaluated. But I've never seen C++ programmers say you can make mistakes with pointers, so don't them.

Clojure's lazy sequences are guaraanteed to evaluate once, but that comes at the expense of storage. In particular, a query like the one in the original blog post will evaluate multiple intermediate lists that then need to be thrown away. And indeed, they've introduced reducers to address this, which behaves more like LINQ.


In particular, a query like the one in the original blog post will evaluate multiple intermediate lists that then need to be thrown away.

Why not fuse the operations, if the values are immutable?


Well, there's two ways to do that in Clojure: write your own list processing (usually regarded as bad style) or use reducers. Both are appropriate in specific instances, but the fact remains that the lazy seq code above is the idiomatic way of doing it in Clojure. So the _default_ way of doing it is slower than LINQ's default way of doing things.

It wouldn't be hard to make LINQ behave like Clojure, either.


Don't forget about Memoize(). Often it can be a better choice than ToList().


I'm not sure what you're actually arguing for or against. There is nothing at all wrong with functional programming, and such had nothing whatsoever to do with my comment. That LINQ happens to use some functional techniques doesn't make my criticism a criticism of fP.


Functional languages use the concepts you're criticizing even more extensively than LINQ does.


What are the "concepts I'm criticizing"? I suspect that you have absolutely no idea what the context of my comment was, because your diversion on the topic of functional programming is just grossly out of place.

LINQ encourages the belief that set operations are free, such that you no longer have to concern yourself with concepts like memoization or appropriate structures. This has nothing to do with functional programming. Literally at all.


You've actually been pretty vague about the concepts you're criticizing, except for "LINQ". Now I can see you were alluding to just the deferred execution part (the "lack of memoization").

Basically, you're saying it's too easy to accidentally perform a computation twice by iterating through a sequence twice. Recurse on that problem and you get an exponential blowup.

That is actually a problem mostly unique to LINQ and it is really important that a programmer understand the deferred semantics. Some tools, e.g. ReSharper, will detect multiple enumerations and warn you about it. Judicious use of ToArray/ToList solves most issues with defer-splosion.


I haven't been vague whatsoever.


In your first post, everything is "linq is bad" in generic terms except for these two bits:

> "it falls apart in real-world persistent code with considered algorithms and storage"

> "it makes it so easy to do [grouping / sorting / etc] that developers start to resort to that as a catch-all magical solution that is costless because how much could a line or two of code cost"

With hindsight I know your issue is with LINQ's deferred non-memoized execution. It's really easy to enumerate a sequence twice, but every place you do that might add a factor of 2 to the runtime and so there's a danger of exponential blowup that must be avoided. But that's not what I see when I read the above.

The above reads as complaints about easy composability and worries about code monkeys failing to consider what the code they're writing will do. Those are standard "functional / high-level programming is bad" complaints. With hindsight I know that's not what you meant, but that's what it reads like.


I get what you're saying, I understand it, and I agree with the pitfalls. But I really really really don't agree that one should avoid LINQ. Sure if you don't understand how it works, and how it impact the performance, you should use it less. But don't try and drag the language down because you find it difficult. LINQ is easily one of the best things that have happened to a mainstream language in a long time, and if used right it produce highly maintainable code, that is very easy to read and understand. A lot of data manipulation really is done on sets, and LINQ makes it a breeze to do.


But don't try and drag the language down because you find it difficult.

Oh give me a break. My exact complaint about LINQ is the opposite -- that it makes expensive operations seem simple, which with all innovations in language means that mediocre programmers will start abusing it in all codebases.

LINQ has a place. To say that one should be using it "all the time", however, is exactly as I said -- a canary in a coal mine. The only time I've ever come across the extensive use of LINQ it has been in horrific code.


So I'm guess you haven't done any programming with the reactive extensions yet?


And why would you guess that? Because I don't consider LINQ of Rx observers being a magic cure-all?

In fact, LINQ on Rx is a symptom of the classic LINQ issue -- don't have subscriber or source filtering, or coherent, useful events -- simply broadcast everything and through the magic, free filtering of LINQ all is solved in the consumer.


Wouldn't it be (O(n) + O(n) + ...) at the worst? And if it's on lazy sequences, which I thought it was, then composing a .Select and an .Aggregate should only involve a single pass through the sequence.


I'm not talking about individual LINQ queries, which themselves can be perfectly fine: Microsoft is pretty smart, so if you're doing basic LINQ for objects grouping and sorting and selecting, they're going to use decent algorithms given the structures used. There is nothing surprising that the submitted usage shows very similar performance, as in the end both cases are doing essentially the same thing.

The problem is that it makes it so conveniently easy to do brute-force tactics that....oh the horrible things I've seen...code gets littered with LINQ doing naive queries repeatedly over massive sets of data. Of course you need good coders and good code audits, but LINQ, I think, gives a unsupported sense of comfort that one is making good code (where if people had to code these as loops, it would become very evident very early on that maybe they should rethink their approach).


/me wonders

are candidates allowed to chose their favorite language?

man bash | tr '[:upper:] ' '[:lower:]\n' | sed '/^$/d' | sort | uniq -c | sort -rn | head | awk '{ print $2 }' | fmt

or do you only hire windows coders?


Awesome. I was going to post my two line javascript version, but instead I'm going to study this. There are three commands I've never heard of there. And it's definitely time to take a first crack at awk.


You don't need that tr, just sort -f and then uniq -i. Of course, whoever put -f as the ignore-case flag for sort should be shot methinks, or rather the one who first added -i as "ignore nonprinting characters".

Edit: and while we're on the subject, you can get rid of head and use awk 'NR<=10 {print $2}'

Edit²: I would also like to point out that your version is superior to any short C# program, because sort can sort sequences that exceed the amount of RAM you have.


It sounds like OP has been writing C# for ten years straight, so I'd guess there's a certain level of Windows/MSSQL/.NET experience expected.


This is how I do this sort of thing all the time. But, the sort is O(n log n), so it's asymptotically less satisfying.


True, a bag of words would perform better.

man bash | tr '[:upper:] ' '[:lower:]\n' | awk '/./ { bag[$1]++ } END { for(word in bag) { print bag[word], word } }' | sort -rn | awk '{ print } NR>=10 { exit(0) }'

But it would require more typing and thinking.

sure one could do this in awk completely,

  man bash | awk '
  /./ { 
    for (i = 1; i<=NF; i++) { 
      bag[tolower($i)]++
    } 
  } 
  END {
    for (i = 1; i<=10; i++) {
      score=0;
      for(word in bag) {
        if (bag[word] > score) {
          score=bag[word];
          best=word
        }
      } 
      printf "%s ", best
      delete bag[best]
    } 
    printf "\n" 
  }'
if the requirement is: please chose one language and not the complete Unix babylon.


I should have looked below, I wrote a similar thing above, but you forgot to split multiple words in a line to lines. That's okay, a lot of people above forgot to about case and punctuation even!


Perhaps, if you did it in Powershell?


I had a class where we built a DB engine from scratch, and I ported my code to C# just for LINQ and proper list handling. While it's not the most efficient, it reduced many of the complex sections down to a few lines, like reordering the values being inserted to match the order necessary for a record by comparing the schema to the parameter list in the insert statement.

If you like LINQ and wish it were available in JS, look at underscore or lo-dash.


LINQ does make C# more readable but it doesn't add much value over normal functional collections which usually end up more concise, simpler and easier to reason about - visible in my Dart port of C#'s 101 LINQ samples (which are also lazy): https://github.com/dartist/101LinqSamples

performance of Linq 2 Objects is not that great either and it doesn't add much value readability-wise over rewriting the same task in other dynamic languages: https://github.com/dartist/sudoku_solver


My main problem with LINQ is that it seems to perform terribly on mobile. I was looking for map/reduce/etc type functions for C# in Unity, and thought I found it with LINQ. To my dismay, LINQ creates so many crazy intermediate objects to pull off its "laziness" that our GC high water mark was being crossed all the time. I went and just reimplemented everything from underscore.js in C# and got way better performance. I imagine this is probably not an issue on desktops/servers.


If you'd like to share your code, I'm a functional programmer at heart and am always looking for optimization opportunities for functional-style programming in C#. If you send me some of the code you were having trouble with I can see if your use case maps well into some compiler optimization strategies.


It's not mobile; it's old versions of Mono having a completely shit GC. Do you know whether Unity has upgraded to SGEN yet?


I don't believe so, I know we had to use the same pool workarounds that everyone seems to need due to this issue


The interviewer doesn't see the difference between the LINQ and extension methods and he hasn't provided a single line of LINQ in his post.


You're confusing the syntax with the functionality.


In the vein of http://xkcd.com/353/ :

    >>> from collections import Counter
    >>> Counter("here are some words here are".split()).most_common(3)
    [('are', 2), ('here', 2), ('words', 1)]


What is it with people's obsession over lines of code?

As someone who doesn't do C# or LINQ, that second solution seems to me like someone really wanted to have as few lines of code as possible.

I don't claim to have an impressive programming pedigree, but while I take simplicity and performance into account, I never take "conciseness" into account. Conciseness usually means "this is opaque as shit but at least it's short". And who really cares about short? What's the purpose of "short"? None that I can find, other than impressing interviewers. Anyone who believes otherwise should probably be writing in Clojure or Haskell (and probably is), but I personally just don't see the point.

But that's just my opinion. My favorite language is Python.


Less code is less places to insert bugs and less to read. The majority of time spent "writing" software is actually spent reading the existing code, so "less to read" is really important. "Concise" does not mean "short"; it means "short and clear". Obviously if your short code is opaque or bug-prone then you're defeating the purpose.


The O(n) solutions sound way more interesting!

1. Iterate through all key,value pairs once, keeping track of the 10 most common 2. Run quickselect 10 times 3. Coolest (and an interview question in it's own right) - modify quickselect to return the top 10!


> “Return the top 10 most frequently occurring words in a string.”

...

> var words = s1.Split(' ');

Wrong. Yet another example where an interviewer cannot correctly solve his own questions.


Care to elaborate? It probably lacks punctuation characters, case-insensitiveness and RemoveEmptyItems option, but is there anything else missing?


中国四分之一地区六亿人受雾霾影响 contains more than one word.


Depends on the definition of a word. I18n is normally hard, and likely would make for an exciting discussion instead of just a tech interview.


> there anything else missing?

You named more than I detected.


I'm not sure that language features making code more succinct really ruin the question.

The C# isn't even that succinct compared to doing a similar thing in other popular languages. For example, in Haskell:

  topTenWords :: String -> [String]
  topTenWords = take 10 . map fst . sortBy (flip (comparing snd)) . map (\l -> (head l, length l)) . group . sort . words


flip (comparing whatever) is a neat trick! I'll have to remember that.

For \l -> (head l, length l) I tend to use head * * * length (without the spaces between those stars).


The LINQ really is much more succint, but the original code did not set the bar very high. Why should one write a 12 line comparing function when using 'b.value - a.value' would work pretty much the same (unless C# really requires comparison to return -1/1 instead of any negative/positive integer, which would be fixed by a sign function).


Or do:

  private static int CompareKVPByCount(KeyValuePair<string, int> a, KeyValuePair<string, int> b)
{ return a.Value.Compare(b.Value); }

Or even:

  kvpList.Sort(kvp => kvp.Value)
But that would probably be straying into authors "list of language features I've completely ignored for the last 5 years"


I think the analysis may've been skewed by the outlying points. They are almost certainly "high leverage points"[1] and so possibly exert an undue influence on the final trend lines.

[1]: http://en.wikipedia.org/wiki/Partial_leverage


Here's the Haskell golf

    import qualified Data.Map as M
    import Data.List
    import Data.Ord

    countWords :: String -> [String]
    countWords = map fst . take 10
               . sortBy (comparing snd)
               . M.toList . M.fromListWith (+) . map (\w -> (w, 1)) 
               . words


The biggest benefit I find from using Linq is readability. It allows the code to express what it is doing rather than how it is doing it. Compare:

    foreach (var item in list)
        if (SomeCondition(item)) return item;
    return null;
vs.

    list.FirstOrDefault(item);


I would have used a weighted SET.

But it's funny, the first time I read the paragraph, I'd assume the words were not separated by space, and you had to find occurrences of combinations than.

I instinctively made the test much harder than it would be. I'm damaged.


Good read. It also shows that interviewing can be a great learning tool. I've experienced that myself, while interviewing takes a lot of time investment on the part of the interviewer it's also often a source of new insights.


Well, welcome to high-level programming

  count = take 10 . map head . reverse . sortBy (comparing length) . group . sort . words
That's ignoring Unicode rules for word splitting of course


Everyone knows "Hadoop is a distributed system for counting words." ;-)

https://github.com/twitter/scalding


That was a fun exercise :)

    input_string.split(/\W+/).inject(Hash.new(0)) {|acc, w| acc[w] += 1; acc}.sort {|a,b| b.last <=> a.last }[0,10]


Since we're comparing notes, here's the Mathematica version:

Reverse[SortBy[Tally[StringSplit[#]], #[[2]] &]][[;; 10, 1]] &


The first projection isn't needed, which would eliminate the creation of 10 objects.


text.Split(' ').Where(x=>!string.IsNullOrWhiteSpace(x)).GroupBy(x => x).Select(x => new { word = x.Key, count = x.Count() }).OrderByDescending(x => x.count).Select(x=>x.word).Take(10).ToArray();


Perl 6:

    .say for (bag($text.words) ==> sort {-*.value})[^10]


here's a verbose ruby version:

   def topx(str,x)
      c = Hash.new(0)
      str.split(/\s+/).each { |s| c[s] += 1 }
      c.sort_by {|k,v| -v}.take(x)
    end


isn't .ToList() part of linq too. technically the first solution is also using Linq. would be interesting to make them implement sort algorithm too.


LINQ is my favorite API of all times. It rocks.


I totally agree. Sure there where similar things a in the funcional world before, but LinQ had some important pros:

- deferred ejecution by default, saving memory and time

- step by step syntax, each new operation is at the end, not the beginning

- excellent type inference and intellisense, js? ruby?...

- it works with the same syntax on the database!!! Haskell?

- map and filter where there, but groupby and join where not so common in previous query comprehensions APIs.

- the most important: it's actually usable in jobs you get paid for, not experiments you can make at home or university.

There are however two things that doesnt make it 100% perfect:

- expression tree lambas are identical to non expression ones, making it hard for developers to know if one step is going to be translated or executed. I would have chosen => for non expression and -> for expressions for example or something like that.

- having two syntax, method chain and query comprehensions, produces a frequent anoying back and forth since some operators are better written in one (let, join, group by) while others are only available in method chain (take, toDictionary...)


Why does this "ruin" a question?


Because it spoils the interviewers ability to feel smug.


So, aside from the Clojure, Mathematica, Python, Ruby, Bourne Shell, Haskell, and Scala solutions posted in the other comments, all of which are simpler than the C++, C#, and JS solutions, presented here with some minor cleanups:

    (take 10 (reverse (sort-by (comp first rest) (frequencies (string/split ... #"\+s")))) ; llambda Clojure

    // haakon Scala
    s.split(' ').groupBy(identity).mapValues(_.size).toList.sortBy(-_._2).take(10).map(_._1)

    (->> (string/split s #"\s+") frequencies (sort-by val) reverse (take 10)) ; aphyr Clojure

    var top = (from w in text.Split(' ')  // louthy C# LINQ
               group w by w into g 
               orderby g.Count() descending 
               select g.Key).Take(10);

    collections.Counter(s1.split()).most_common(10) # shill Python

    d = {}  # shill Python without collections library
    for word in s1.split(): d[word] = d.get(word, 0) + 1
    print [(x, d[x]) for x in sorted(d, key=d.get, reverse=True)][:10]

    words = s.split()    # spenuke and abecedarius probably O(N²) Python
    sorted(set(words), key=words.count, reverse=True)[:10]

    d3.entries((s.split(" ").reduce(function(p, v){  // 1wheel JS with d3
        v in p ? p[v]++ : p[v] = 1;
        return p;}, {})))
      .sort(function(a, b){ return a.value > b.value; })
      .map(function(d){ return d.key;})
      .slice(-10)

    # kenuke O(N²) Ruby:
    str.split.sort_by{|word| str.split.count(word)}.uniq.reverse.take(10)

    counts = Hash.new { 0 } # my Ruby
    str.split.each { |w| counts[w] += 1; }
    counts.keys.sort_by { |w| -counts[w] }.take 10

    # aaronbrethorst ruby
    str.split(/\W+/).inject(Hash.new(0)) {|acc, w| acc[w] += 1; acc}.sort {|a,b| b.last <=> a.last }[0,10]

    Commonest[StringSplit[string], 10]  # carlob Mathematica

    Reverse[SortBy[Tally[StringSplit[#]], #[[2]] &]][[;; 10, 1]] &  # superfx old Mathematica

    $a = array_count_values(preg_split('/\b\s+/', $s)); arsort($a); array_slice($a, 0, 10) // Myrth PHP

    tr -cs a-zA-Z '\n' | sort | uniq -c | sort -nr | head  # mzs and me sh

    -- lelf in Haskell
    take 10 . map head . reverse . sortBy (comparing length) . group . sort . words

    # prakashk Perl6
    .say for (bag($text.words) ==> sort {-*.value})[^10]

    # navinp1912 C++
     string s,f;
     map<string,int> M;
     set<pair<int,string> > S;
     while(cin >> s) {
             M[s]++;
             int x=M[s];
             if(x>1) S.erase(make_pair(x-1,s));
             S.insert(make_pair(x,s));
     }
     set<pair<int,string> >::reverse_iterator it=S.rbegin();
     int topK=10;
     while(topK-- && (it!=S.rend())) {
             cout << it->second<<" "<<it->first<<endl;
             it++;
     }


I thought I'd maybe take a look at Afterquery: http://afterquery.appspot.com/help

Although I haven't tested it, I think the Afterquery program to solve this, assuming you first had something to tokenize your text into one word per row, would be something like

    &group=word;count(*)
    &order=-count(*)
    &limit=10
which, though perhaps less readable, is simpler still, except for Mathematica. More details at http://apenwarr.ca/log/?m=201212.

Perl 5, perhaps surprisingly, is not simpler:

    perl -wle 'local $/; $_ = <>; $, = " "; $w{$_}++ for split; print @{[sort {$w{$b} <=> $w{$a}} keys %w]}[0..9]'
And neither is this, although it uses less code and less RAM:

    perl -wlne '$w{$_}++ for split; END { $, = " "; print @{[sort {$w{$b} <=> $w{$a}} keys %w]}[0..9]}'
I was surprised, attempting to solve this in Common Lisp, that there's no equivalent of string/split in ANSI Common Lisp, and although SPLIT-SEQUENCE is standardized, it's not included in SBCL's default install, at least on Debian; and counting the duplicate words involves an explicit loop. So basically in unvarnished CL you end up doing more or less what you'd do in C, but without writing your own hash table. Lua and Scheme too, I think, except that in Scheme you don't even have hash tables.


Small C++11 improvement:

     string s,f;
     map<string,int> M;
     set<pair<int,string>> S;
     while(cin >> s) {
             M[s]++;
             int x=M[s];
             if(x>1) S.erase(make_pair(x-1,s));
             S.insert(make_pair(x,s));
     }
     auto it=S.rbegin();
     int topK=10;
     while(topK-- && (it!=S.rend())) {
             cout << it->second<<" "<<it->first<<endl;
             it++;
     }

Surely, it could even be more improved with help from lambdas and algorithms.


> Perl 5, perhaps surprisingly, is not simpler:

Here's one.

    perl -0777 -nE '$w{$_}++ for split; say for (sort {$w{$b} <=> $w{$a}} keys %w)[0..9]'
It is slightly different compared to your version in that each word is printed on a separate line. To print all words on the same line, is just a bit longer:

    perl -0777 -nE '$w{$_}++ for split; $, = $"; say((sort {$w{$b} <=> $w{$a}} keys %w)[0..9])'


(his name is Jon Skeet)

(sorry)


     string s,f;
     map<string,int> M;
     set<pair<int,string> > S;
     while(cin >> s) {
             M[s]++;
             int x=M[s];
             if(x>1) S.erase(make_pair(x-1,s));
             S.insert(make_pair(x,s));
     }
     set<pair<int,string> >::reverse_iterator it=S.rbegin();
     int topK=10;
     while(topK-- && (it!=S.rend())) {
             cout << it->second<<" "<<it->first<<endl;
             it++;
     }


Not bad. I think it would be a little simpler and faster with:

    while (cin >> s) M[s]++;
    for (map<string,int>::iterator i = M.begin(); i != M.end(); i++) {
             S.insert(make_pair(i->second, i->first));
    }
But maybe there's a downside to that approach that isn't obvious to me?


If you move what while (topK--) into the map loop , it becomes an online code for topK whereas what you wrote is an offline . If you want offline then pushing it into a priority_queue and then popping it out would be much faster.


He should try adding AsParrallel() to the expression. I bet there will be a drastic speed up as long as he has a multi core.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: