The above is a Clojure one-liner example that I believe satisfies the original problem. So while LINQ may have simplified from the C-language family solutions he had seen, it's clearly possible to take it one step further with the expressivity of modern languages like Clojure...
Edit: remember to sort! (Forgot my coffee this morning...)
Edit again: and aphyr's solution is even more concise and idiomatic, where `s` is the first paragraph of the blog post:
from word in text.Split(' ')
group word by word into g
let count = g.Count()
orderby count descending
select g.Key
Under the covers when it is compiled it's turned into a select. That way it it only does N counts, not potentially N log N (for each comparison in the orderby), where N is the number of items you enumerate (so if you .Take() just 10 it doesn't really matter).
While we're golfing, (comp first rest) is equivalent to the stdlib function called second, or, perhaps more idiomatic for map entries, val. We can also flatten deeply nested chains of computation using the -> and ->> macros, like so:
Think you also need a (map key) there to be equivalent to the C# code. At which point it's starting to look very similar to well written C# code solving the same problem.
Much as I love Clojure, it's worth pointing out that Clojure's lazy sequences (include intermediate sequences) get cached, while LINQ evaluates more like Clojure's reducers. (You can even get it to do so in parallel.)
On a side note, I'd like to point out a key difference between this Clojure example and corresponding variants in C#, Python, &c: the Clojure variant has no variables. This isn't just a matter of concision: coming up with descriptive names is hard, and usually means duplicated information, either in the name or the type declaration. Often, those names are re-used over and over again with subtly different meanings, for each stage in a pipeline--forcing the reader to reason carefully about the declaring scope at each use.
Same operation--but with (doto), four uses of a variable and one type declaration have been cleared away. Consider the original post:
var words = s.Split(' ');
var wordCounts = words.GroupBy(x => x).Select(x => new { Name = x.Key, Count = x.Count() }).OrderByDescending(x => x.Count);
var countedWords = wordCounts.Select(x => x.Name).Take(10).ToList();
return ExtractTopTen(countedWords);
Six uses of three formal variables, including the bewilderingly confusable countedWords and wordCounts, plus nine uses of the delightfully generic "x". Even in the more compact C# example from this thread, consider:
var top = (from w in text.Split(' ')
group w by w into g
orderby g.Count() descending
select g.Key).Take(10);
This variant refers to the temporary variables w (for "words") and g (for "groups"?) six times. Both authors felt the need to reduce the repetition of variables, but the best they could do was to choose single-character names.
This is the real power of ->, .., ->>, doto, and friends: eliminating names for things. By thinking about the composition of transformations, instead of the intermediate results, you can make an algorithm easier to understand and change.
Yeah, I tend to prefer the lambda notation over the sqlish notation, but either way the lambda form of your solution would be similarly terse. The select call that constructed an anonymous-typed object was unnecessary, as were all the temporary variables... C# coders should never be afraid to embrace the Rubyish leading dot approach.
The thing being that C# has a feature for a whole bunch of idioms, but none of them are extensible and writing new ones requires a new version of the language. The Clojure reducers, doto, ->>, go macro could all have been implemented in 1.0. (And two of them were.)
I speak as someone who codes in C# for work and Clojure as a hobby. There's lots of things that just aren't quite possible in C#, and _everything_ is possible in Clojure.
Pointfree is beautiful to read and grok. It's a little trickier to debug, since pretty much every debugger on earth needs points to inspect values.
> Six uses of three formal variables, including the bewilderingly confusable countedWords and wordCounts, plus nine uses of the delightfully generic "x".
I totally agree with you on this.
>Even in the more compact C# example from this thread, consider:
var top = (from w in text.Split(' ')
group w by w into g
orderby g.Count() descending
select g.Key).Take(10);
> This variant refers to the temporary variables w (for "words") and g (for "groups"?) six times. Both authors felt the need to reduce the repetition of variables, but the best they could do was to choose single-character names.
> This is the real power of ->, .., ->>, doto, and friends: eliminating names for things. By thinking about the composition of transformations, instead of the intermediate results, you can make an algorithm easier to understand and change.
You lost me somewhere along the way....are you saying the "from w in text"... snippet is bad, and something more along the lines of "This is the real power of ->, .., ->>" is more appropriate?
I ask because that code seems extremely readable to me. Personally, I don't give a shit if it's 30% more verbose or runs 50% slower, for 99% of code (written in the world), optimum performance doesn't matter. Maintainability does matter though. All of this software being written today has to be either maintained by someone, or replaced by something else. And the top ~2% of programmers like you sure as hell aren't going to be taking maintenance jobs any time soon.
I'm curious what the thoughts of a technically smart person such as yourself are on the subject of what companies will be left with 5 to 10 years down the road when consultants have come through and implemented using the currently most optimum platform/language/algorithms?
And I honestly don't mean for this question to be disrespectful. I'm just coming from a situation where I'm a former developer but on a project where I'm not coding, and I ask for features and the developers say they can't do it, or it will be a performance problem. And I know these guys are far more like you than me intelligence/education wise, but the things I ask for I've done tons of times in the past with 10 to 1000 times the data size, without a problem, on far older hardware.
I'm just curious what kind of a support problem you ultra smart people are leaving behind, or if you ever think about the idea that almost no on else is as smart as you?
Over the last ten years, I've read a lot of code. I maintain my own code, and I maintain others' code. Because I am not that smart, I try to simplify everything--to make large problems tractable for my little brain to reason about. In this process, I've come to the conclusion that good, maintainable code is:
1. As simple as humanly possible, so that the algorithm is easy to understand and change. Each component is isolated and can be understood and modified in isolation. Boundaries between component and environment are clearly thought-out.
2. As simply expressed as possible: broken up into distinct, well-organized functions, with descriptive, regular names for functions and variables, in context.
3. Well-documented; each function and each namespace come with contextual docs explaining their motivation, arguments, consequences, invariants, etc., with examples.
4. As short as possible, because humans have trouble holding large amounts of context in their head. Minimize the amount of scrolling or jumping between files necessary to understand the algorithm.
5. Well-tested, so that changes can be made freely. The test suite needs to be fast, so one can get feedback within seconds of making a change to the file; ideally a few milliseconds. A balance of typechecking, logical tests with mocks, integration tests, and full stress tests provides a continuum of safety.
I don't see these goals as particularly constrained to any language, but I will say that I feel best able to achieve them in a Lisp. Dunno whether that helps you project your problems at work onto me personally, though. ;-)
If the aim is golf, it's probably best to keep #" ". If the aim is accuracy, we might get better results by using #"[^\w]+". Though that does nasty things to words like "it's".
Edit: Actually, we can also get rid of the 'reverse' by replacing (sort-by val) with (sort-by (comp - val)). Not going to be shorter in terms of character count, though we win on line count.
It looks like collections is part of the standard library: http://docs.python.org/2/library/collections.html. I think this disqualifies your JS solution, as long as doStuff isn't some fancy nodejs core module I missed.
I think the point still stands though, when you think of it as an interview question. The point is to get to the bottom of how the candidate would process such a query themselves, not to test their knowledge of the core libraries of their favourite language.
I feel like that's what the post author meant when he said that LINQ had "ruined" the question- because it kind of does the same thing.
Isn't it kind of axiomatic that any question that provides for a concise and measurable solution will eventually be refactored into a library? It seems like kind of a weakness in these kinds of "programmer" questions, that they all regress to trivia. Why not an open-ended business problem and/or deliverables-type question? I'm sure that would speak more to the daily job, where I've never heard of someone being tasked with "OK, find the top 10 words by frequency in a random Wikipedia article."
My memory of the math homework you describe brings up two images: geometry proofs and story problems. Never did I have to explain multiplication when doing my times tables, nor quadratics.
collections.Counter.most_common() uses heapq.nlargest() which is slightly more efficient than a full n*log n sort.
But implementing a heap correctly from scratch is I guess outside the scope the OP intends for the interview question.
I would expect a good candidate to:
1. know that a heap is optimal here (remembering whether Counter uses it is optional)
2. express reluctance to implement it from scratch because surely the stdlib can do it better.
So IMHO using libraries like this ruins the question only in the sense of showing its not a challenge for the candidate. Something so simple should not take a page of code.
If they're stuck on a version of software that hasn't been supported in 5 years for bugs or security with a set of features that was originally released nearly 10 years ago, it means that there are some culture issues in the workplace that will most likely drive me insane.
Or, use a defaultdict(int), or more in line with yours:
d[word] = d.get(word,0) + 1
dictionary.get is quite useful.
Also, I'd consider s.split(None), instead of s.split(' '). It will group whitespace, so that any double space or other whitespace is collapsed into one delimiter.
The question didn't ask for word counts, so I didn't see the need for a dictionary. I'd appreciate any advice on my solution. I'd be thrilled if I'm not too far off from being capable of starting to apply for jobs.
The need for the dictionary shows up when your text gets significantly large. Each call to `words.count` is going to re-examine each word of the text to count 'em up, so, if n is the number of words in the text, and m is the number of distinct words in the text, then this solution is at least O(mn + nlog(n)) whereas the dictionary-based solution is O(nlog(n)). That is, we're re-reading the word list over and over, whereas the dict-based solution only reads it once. It's therefore more likely to be more efficient.
But I like the readability of this solution and there's a strong argument to be made for it on that basis, especially if the string is short. If this were a job interview, this would be a totally acceptable solution, though it'd be important to be able to discuss why other solutions might be faster and why you prefer this one anyway.
Isn't the difference between O(mn + nlog(n)) vs O(nlog(n)) running time going to get less significant as the value of n gets larger?
I thought the whole point of Big O / asymptotic analysis is that you can ignore lower-order terms and constant factors because they are insignificant for any appreciably large input size. And also because the lower order terms and constant factors vary too much depending on the programming language, the compiler or VM, the hardware, etc.
I'd never heard of Big O until this thread, but from what I can tell it'll only be when a solution is in the O(logn) that "running time will get less significant as n gets larger". This is simply the way you describe logarithmic growth, so maybe that's where you got confused. Also, wouldn't it be fair to say that a logarithm (nlogn) is a lower order term than mn? In which case your definition stands.
At any rate, I wanted to test this out, so I made a naive benchmark for running these functions. The dict solution was ten times faster (0.0011s vs 0.015s) than the list version with ~1350 words. The dict solution ran in 0.13s at ~162,000 words, while I waited a couple minutes before killing the list version on that input.
Oh, you're right, m is not a constant but also varies somewhat independently of n, so you can't exclude it from the big O notation. And whether mn or nlogn is the lower order of the two terms depends highly on the value of m, which is the number of unique words in the text. A really long text that's just the same word over and over will have a big n but a small m.
Hey, thanks much! I'm going to have to spend some time with these logarithmic evaluations to really get what you're saying, but I dig the basic concept. Very helpful.
I think this version and spenuke's version are impractically slow, but yours is actually O(N²). If you give it an 824 000 word input, it will do 0.6 quadrillion word comparisons, which you will probably not be willing to wait for. A more practical solution:
This is still O(N lg N) instead of O(N lg 10) like the Python version, but it's good enough this time; it still gave me ["the", "and", "of", "to", "And", "that", "in", "shall", "he", "unto"] reasonably quickly.
I'd be interested to see if there's a way to do this in a single expression in Ruby.
I actually was working on a hash-based solution first, using group_by, but I switched to using just array after seeing the Python version. I didn't really think about the time complexity of the count operation inside the sort_by operation, thanks for pointing that out.
You definitely can do a hash-based solution in a single expression in Ruby. Here's a very ugly and kludgy example that you could probably improve on if you wanted to. I don't think it's n^2 because the group_by just counts the occurrences of each word and returns a hash where the count is the key:
I'm also trying to work out a better way to do this using "chunk" because although hashes are fast to access, they are not fundamentally sortable, and sort_by returns a 2d array just like chunk does anyway.
You should always compare your results to what is expected. For a problem like this, use a small set of test data that can easily be counted and sorted in your head or on paper.
You forgot reverse=True and your results show the 10 least common words. ;)
This kind of error happens to all of us. That's why we have unit tests and QA teams. If you made this mistake during an interview I wouldn't give it much importance and we would have a good laugh about it.
Ha, nice! Funny you should mention that. I did exactly that when tooling around in the repl and forgot the reverse kw, saw something fishy, and fixed it. Promptly forgot it again when I typed it in this comment. :)
This is a great approach. Even better because you carefully examined the requirements and found an interpretation which allowed you to make an optimization (words only, not counts, therefore sets are useful)
Agreed, if I was looking for a candidate that will always "reinvent the wheel" then I would want them to implement from scratch. I find it perfectly valid for the interviewee to say "Hey! I wrote a module for <language> for this exact purpose! Want to check out my github page?".
Thinking.. two of these things belong together, two of these things are kind of the same.. but one of these things is doing his own thing...
Seriously built in method naming and argument ordering don't follow any consistent convention in PHP, that's always been the most irksome thing to me...
Uh, no. Your solution just uses a bunch of standard library functions (at least, I hope they're not syntactic forms… and why a function as specific as "frequencies" not in some namespace boggles my mind). I could write that in C with an appropriate standard library.
Ironically, expressing this in something like SQL actually speaks to the expressivity of the language because the solution was produced entirely with syntactic forms.
That may be true, but it's still a pretty idiomatic Clojure solution. Ask 10 Clojure programmers to solve this problem and most of them will come up with something similar. Ask 10 C programmers and they will likely come back with a program much more similar to the one in the post.
The fact that such functions are at a Clojure programmer's fingertips is exceptionally important. If you strip out the core functions (and macros) from any lisp or Clojure you are left with almost nothing( see http://stackoverflow.com/questions/3482389/how-many-primitiv...).
`frequencies` is surprisingly useful, enough to justify its inclusion into core. I once pondered why.
Consider that in Clojure, a common act transforming one datastructure into another. Typically only a few kinds: maps, sets, and some kind of sequence.
What is a highly common generic transformation of a sequence to a map? `frequencies`. Particularly since you'd only be doing this with pretty finite sequences, given the finite nature of the built-in maps. Finite here means countable. What generic, domain-independent thing would you be counting? Often, the items themselves.
Ironically, expressing this in something like SQL actually speaks to the expressivity of the language because the solution was produced entirely with syntactic forms.
What? Pure untyped lambda calculus does things "only with syntactic forms". Don't tell me that your solution wouldn't use some basic operators other than rudimentary syntax. Keep in mind that in Lisp-family languages, many things that are a part of syntax in other languages are actually simple functions - arithmetic operations being a case in point - so you really can't avoid using library functions.
I believe that the major point of Clojure etc. is that they provide concise basic operations that you can use everywhere to compose complex operations better than you can do with, say, C#.
I'd say 13 characters isn't really a huge differentiator. They're both using exactly the same algorithm, almost exactly the same built-in library functions, and have exactly the same flexibility for programmers to add their own (C# is using extension methods which lets you add methods that look like they belong to the class, but are actually static methods defined elsewhere).
There's a lot of benefits to Clojure over C#, but the ability to chain functional list operators really isn't one of them.
You can use tr instead of sed and grep, too. Also -k1 and -n 10 are default for sort and head:
$ < bible-pg10.txt tr -cs a-zA-Z '\n' | sort | uniq -c | sort -nr | head
62265 the
38915 and
34588 of
13474 to
12846 And
12589 that
12387 in
9762 shall
9668 he
8942 unto
I have a `bins` script that is just `sort | uniq -c | sort -nr`, so that reduces to `tr -cs a-zA-Z '\n' | bins | head`.
Ooh that's a clever use of reshape, but wrong sadly. Yet you're right I could make it simpler, it was just what came first to mind, I did not even check it. Imagine typical sentences:
Also the link mentioned using wiki articles for testing, so they would have paragraphs and that's where reshape shines for things like emails, but fails here:
I think it goes beyond the original question now. :) If one wants to do proper thing, then definitely contractions, word variations, etc should be considered. For this, I guess something like OpenNLP can be used.
What's interesting though is how UNIX shell, being essentially a symbolic FP language, allows one to solve the problem in a clear and concise way. And if one desires, the program can be easily modified to read sentences, say, from network from ssh tunneled via HTTPS. That kind of flexibility can rarely be achieved in languages like C# with tight coupling between the units of abstraction.
I like how you snuck in "modern languages (like Closure)", as if C# is not a "modern language". Both languages have their place, but let's not pretend that Closure is somehow superior for all applications.
You'll need to sort the result of frequencies: it's an unordered map.
In practice, if you use ->> and write neater LINQ, the code is very close. Although LINQ only has group and sum/reduce, no dedicated frequencies function.
No expert, but from the source it looks like it just uses whatever a 'map' uses to match keys:
(defn frequencies
"Returns a map from distinct items in coll to the number of times
they appear."
{:added "1.2"
:static true}
[coll]
(persistent!
(reduce (fn [counts x]
(assoc! counts x (inc (get counts x 0))))
(transient {}) coll)))
That works as well. On the other hand, I'm not a fan of these huge one-liners (including the one I posted). I split mine up a bit and posted it here: https://coderwall.com/p/ha-c2w
LINQ tends to get thought of as "database syntax sugar", but it's way more than that. It's C#'s version of the lazy collection operations you find in most functional languages, just given friendlier sqlish names. (IE, "Select" instead of "map" and "Where" instead of "filter")
I almost feel a bit gross when I have to write a "foreach" loop at this point, because there's almost always an equivalent way to do it in LINQ (although it's a tradeoff, as the one downside of LINQ is that given it's deferred nature, it's harder to debug).
>It's C#'s version of the lazy collection operations you find in most functional languages, just given friendlier sqlish names. (IE, "Select" instead of "map" and "Where" instead of "filter")
Actually........
It's more generic than that. It's C#'s monadic comprehension syntax ala Scala's for and Haskell's Do.
It works on more than just IEnumerable, it actually works on anything that implements Select, SelectMany, and Where...
This is super obvious if you watch some of Erik Meijer's excellent videos on Channel 9. Recommended viewing for anyone reading this who hasn't. Watching him bounce between Haskell and C# felt like someone explaining etymology by moving between Latin and English.
The most annoying thing about LINQ is that they use the SQL language. So unless you use C# and LINQ all of the time it's hard to figure out what functions translate to the ones used by every functional language in the world.
I wish they would at least alias map, reduce, some, etc.
Many C# users have experience with SQL, but not necessarily with functional programming languages. The number of these users likely far outnumbers the number of C# programmers who know functional programming languages, but not SQL.
It makes sense to use terms that many C# (and SQL) users are already familiar with, rather than terms that they may not know of.
I understand the rationale, but in my experience knowing some SQL didn't help me understand LINQ at all. It was only after picking up some functional experience elsewhere and coming back to C# much later that LINQ made sense to me. I'd be really curious to see how the SQLesque naming's worked for others, but it was a bit of a miss for me personally.
tbh I don't think either are particularly intuitive in a "can guess it" sense. Both are intuitive in a "oh, that's what it's called, I can remember that now" sense.
I do Lazy LINQ, by which I mean I write the looping code with "foreach"es and get it debugged & working, but then allow Resharper to convert it to LINQ if it can.
I think this is mostly because I'm still not comfortable with the syntax, and partly because of the set/iterative impedance mismatch that is there. "It's not you, LINQ. It's me."
You would really benefit from learning a functional language (any) to help you roll your own LINQ. Some of Resharpers refactoring to LINQ is awful and does not make for cleaner code.
I almost feel a bit gross when I have to write a "foreach" loop at this point
Agreed. I've transitioned to spending most of my time in JS, and wherever I can I use .map(), but the chaining it's not quite the same as LINQ. Someday I intend to write a library of Array addons to provide GroupBy and so on, but I can't imagine it'll be super efficient.
This is the problem with doing functional list stuff in languages that permit side effects: the order of execution for each function has to be strictly defined (in both senses of the word), so the compiler can't optimise the code. In Haskell the solution (see my comment on the original page) would probably (I haven't actually checked the GHC core output) be folded into a single loop.
The problem with this is that it's doing multiple iterations. LINQ on the other hand builds up an expression tree that is "compiled" into a single loop when `ToList()` (or some other method which gets the results) is eventually called.
Yes.. I do understand why extending the base Array object in JS is deeply frowned upon but at the same time I often wish that these sort of extensions were on the Array object. It feels clunky to use them otherwise.
I am a big fan of LINQ but it gives and takes away complexity at the same time.
It makes a 'whole class of things that you would have to do with loops' go away.
Once you are comfortable with the syntax, it makes code a lot more readable.
But you lose track of when and where things are getting executed.
For example, it is easy to make something that you intend to execute inside of
SQL server run inside of C# code. And then, all of a sudden, string comparison
is case-sensitive.
You wind up having to context switch between procedural and set mentality without
the same kind of visual cues you used to get.
It's also really easy to shoot yourself with unless you understand everything and think each problem through.
Four killer issues I've seen so far:
We had a major production performance issue which turned out to be a stray ToList which was causing a massive memory ballooning. Didn't get noticed in test as the test cases passed but it hit prod and 2000 users bashed it and tried to allocate 20Mb each causing our cluster to shit a brick.
Null reference exceptions! There are so many dereferencing operations in an average LINQ expression that you really have no idea which one is blowing if it goes pop in production in release config.
People using .Single(...) and getting more or less than one result back. So frustrating.
If you push an IEnumerable<T> over an interface boundary the performance and memory semantics are not preserved and you end up with a leaky abstraction. These are shits to resolve. Example: queries executing inside the view which is outside the transaction scope.
We try to use ICollection in our APIs instead of IEnumerable, since the latter can have surprising semantics like being a wrapper for some operation which may not be valid anymore, or might be slower than you expect to do things like .Count(). IMO it's really not best for transporting across interface boundries in the most common case; only when you're specifically trying to avoid having the whole collection in memory or something like that.
Another thing that can help is this wonderful May<T> library[1]. It a great option type[2] for .NET. It helps make operations more composable.
But you lose track of when and where things are getting executed.
That is the curse of modern software development. But LINQ is minor offender here compared to some kinds of remoting, code reflects from endless xml configs wrapped in gazillion interfaces and some inventive orm-s. Sadly our tools have not kept up with the complexity that is external to the executing code. The debugger in the pre- inversion of control/bean injection days was almighty because you had all of the program state in front of you.
A little known fact is that if you wrap Java/C# code in enough unit tests you get Haskell.
I've ran into this a little bit as well, but I feel the benefits far outweigh the disadvantages. Once you understand the basics, your code becomes so much cleaner.
You do have to be careful for hidden pitfalls, like an accidental N+1 select against the database, but even that is usually fixed pretty easily if you simple call .Include() in the original expression.
From the article, I really don't understand the concept of not updating your knowledge of Linq. It's been out for years now. And then the first concern is with performance? That sounds very problematic to me. There are a lot of .NET developers out there that are still stuck in the previous decade. There are a lot of shops that are still on 2.5, which is too bad, because that was when the framework and C# really started to take off.
There is also some trickiness around when things get executed. If you return an IEnumerable<T> that is really a LINQ query, and the caller doesn't immediately enumerate it, they run the risk of getting exceptions when they do enumerate if you mutate the collection the query runs over. You can of course force its evaluation, but that also loses some of the benefits of the laziness. Bring in multi-threading and even if you properly synchronize collection access (and that is hard to do without custom collection types, since LINQ is like the honey badger w.r.t. synchronized access) you can still see the collection modified exceptions talked about above.
The "execute inside SQL but ends up in C#" is because of the terrible decision of the C# language designers to make reified code (expression trees) have the exact same syntax as lambda functions that are normal code.
If you had to tell the compile if you wanted code or a tree, that problem would be solved. It'd also be one step closer to allowing type inference for lambdas assigned to locals.
I dont really agree with your statement, but I did find it perplexing that a C# .Net technical interviewer would not know the fundamentals of LINQ 5 years after its inception.
Time flies when you're building stuff, I know, but LINQ was actually announced at PDC 2005 - 8 years ago this fall.
That a C# developer can seriously say they're not up to speed with "new stuff" like LINQ (2005) and dynamic (2008) in 2013 is kind of ridiculous. What they mean is that they figured out how to map what they learned about Java in school into C#'s syntax and now they're done learning.
LINQ is a great mechanism for expressing complex operations in a succinct manner. Sure, LINQ may not be a magic bullet and can be abused, but you know the old adage about premature optimisation. Perhaps you could elaborate on "It has some serious pitfalls.".
I didn't write the post you responded to, but I can tell you of the main pit fall I have encountered. By far the worst is evaluating the same enumerable again and again by mistake. It don't show up in unittests because they use small data sets, but as soon as you do any kind of load test or put it into production it's painfully slow. The solution is often very simple, but you really have to get dirt on your hands before you recognize where it is a problem.
I disagree, but for different reasons than the other commenter. LINQ is great for transformations that are intuitively described by the verbs LINQ provides. If your transformation is most easily understood as a process of mapping filtering grouping selecting etc, then by all means. But trying to pigeonhole every iterative process into LINQ is a recipe for code that's hard to decipher. As always its a question of how does one communicate a process most effectively. Succinctness for its own sake is a fools errand. Clarity is key, and sometimes that means foregoing LINQ for standard iteration.
I answered a phone interview question once with LINQ -- it was about combining string arrays while avoiding duplicates.
The HR interviewer followed up my answer asking about runtime and memory usage. While these are good questions, I got the feeling they didn't want to receive an answer using a single line of LINQ.
but seriously, if you're writing c# and not using LINQ all the time
You don't sound arrogant, but rather sound naive. I agree that someone who recruits surely should have known about and have experienced LINQ significantly by now, but the notion that you should be using it "all the time" is absolute nonsense.
I avoid LINQ. I encourage others to avoid LINQ. It is almost always a sign of bad code.
LINQ is syntactical sugar over basic set operations. It is perfectly fine if you're doing naive activities, such as the example give -- a contrived example of brute forcing a problem, where two approaches of the same very basic need unsurprisingly yield the same complexity -- but it falls apart in real-world persistent code with considered algorithms and storage. In real long term code, it is usually the canary in the mineshaft telling you that the developers aren't using proper algorithms or storage.
In many large-scale projects it invariably turns into the performance nightmare that ends up causing whole rewrites.
Again, not because of LINQ itself, which of course can do basic operations like grouping and sorting as quickly as you could do "by hand", but that it makes it so easy to do those things that developers start to resort to that as a catch-all magical solution that is costless because how much could a line or two of code cost? -- a master List<stuff> that they just sort and group by and select from all over the code (O(n) * O(n) * O(n)...why not?). The end result is that the data structures and encapsulation that should have happened never did, so while it might seem more concise and obvious on a one-to-one comparison with a loop perspective, neither case should ever have happened.
LINQ is the gun by which a lot of terrible programmers are repeatedly shooting themselves in the foot with, all while gloating about their concise code.
(Note: When I say LINQ I am referring to the functional style it encourages, not the query syntax. The query syntax is nice, but it's just a trivial syntactic transformation.)
Correct me if I'm wrong, but the world is moving towards functional programming (i.e. LINQ) not away from it. Personally, I find LINQ far, far easier to read, write, and analyze. (On the other hand, I understand the deferred semantics and watch for warning signs like enumerating a sequence more than once.)
Honestly, a C# company avoiding LINQ sounds to me like the canary in the mineshaft telling you the company has programmers falling behind the times and doing things the hard way.
I'm definitely in agreement with the last point. However, I originally said that good C# code contains a lot of LINQ, not that C# code with a lot of LINQ is necessarily good. Deferred execution is something that confuses people, but frankly, it's a concept they need to learn. It's been in the language since yield return got added in 2.0.
LINQ's deferred execution was the right choice for performance, but it's the hardest one. You've got to know when to call .ToList(). I'm not denying that I've seen people evaluate the same expensive list 100 times, use a join when precomputing a Dictionary would have been much faster, close a connection before the result is actually evaluated. But I've never seen C++ programmers say you can make mistakes with pointers, so don't them.
Clojure's lazy sequences are guaraanteed to evaluate once, but that comes at the expense of storage. In particular, a query like the one in the original blog post will evaluate multiple intermediate lists that then need to be thrown away. And indeed, they've introduced reducers to address this, which behaves more like LINQ.
Well, there's two ways to do that in Clojure: write your own list processing (usually regarded as bad style) or use reducers. Both are appropriate in specific instances, but the fact remains that the lazy seq code above is the idiomatic way of doing it in Clojure. So the _default_ way of doing it is slower than LINQ's default way of doing things.
It wouldn't be hard to make LINQ behave like Clojure, either.
I'm not sure what you're actually arguing for or against. There is nothing at all wrong with functional programming, and such had nothing whatsoever to do with my comment. That LINQ happens to use some functional techniques doesn't make my criticism a criticism of fP.
What are the "concepts I'm criticizing"? I suspect that you have absolutely no idea what the context of my comment was, because your diversion on the topic of functional programming is just grossly out of place.
LINQ encourages the belief that set operations are free, such that you no longer have to concern yourself with concepts like memoization or appropriate structures. This has nothing to do with functional programming. Literally at all.
You've actually been pretty vague about the concepts you're criticizing, except for "LINQ". Now I can see you were alluding to just the deferred execution part (the "lack of memoization").
Basically, you're saying it's too easy to accidentally perform a computation twice by iterating through a sequence twice. Recurse on that problem and you get an exponential blowup.
That is actually a problem mostly unique to LINQ and it is really important that a programmer understand the deferred semantics. Some tools, e.g. ReSharper, will detect multiple enumerations and warn you about it. Judicious use of ToArray/ToList solves most issues with defer-splosion.
In your first post, everything is "linq is bad" in generic terms except for these two bits:
> "it falls apart in real-world persistent code with considered algorithms and storage"
> "it makes it so easy to do [grouping / sorting / etc] that developers start to resort to that as a catch-all magical solution that is costless because how much could a line or two of code cost"
With hindsight I know your issue is with LINQ's deferred non-memoized execution. It's really easy to enumerate a sequence twice, but every place you do that might add a factor of 2 to the runtime and so there's a danger of exponential blowup that must be avoided. But that's not what I see when I read the above.
The above reads as complaints about easy composability and worries about code monkeys failing to consider what the code they're writing will do. Those are standard "functional / high-level programming is bad" complaints. With hindsight I know that's not what you meant, but that's what it reads like.
I get what you're saying, I understand it, and I agree with the pitfalls. But I really really really don't agree that one should avoid LINQ. Sure if you don't understand how it works, and how it impact the performance, you should use it less. But don't try and drag the language down because you find it difficult. LINQ is easily one of the best things that have happened to a mainstream language in a long time, and if used right it produce highly maintainable code, that is very easy to read and understand. A lot of data manipulation really is done on sets, and LINQ makes it a breeze to do.
But don't try and drag the language down because you find it difficult.
Oh give me a break. My exact complaint about LINQ is the opposite -- that it makes expensive operations seem simple, which with all innovations in language means that mediocre programmers will start abusing it in all codebases.
LINQ has a place. To say that one should be using it "all the time", however, is exactly as I said -- a canary in a coal mine. The only time I've ever come across the extensive use of LINQ it has been in horrific code.
And why would you guess that? Because I don't consider LINQ of Rx observers being a magic cure-all?
In fact, LINQ on Rx is a symptom of the classic LINQ issue -- don't have subscriber or source filtering, or coherent, useful events -- simply broadcast everything and through the magic, free filtering of LINQ all is solved in the consumer.
Wouldn't it be (O(n) + O(n) + ...) at the worst? And if it's on lazy sequences, which I thought it was, then composing a .Select and an .Aggregate should only involve a single pass through the sequence.
I'm not talking about individual LINQ queries, which themselves can be perfectly fine: Microsoft is pretty smart, so if you're doing basic LINQ for objects grouping and sorting and selecting, they're going to use decent algorithms given the structures used. There is nothing surprising that the submitted usage shows very similar performance, as in the end both cases are doing essentially the same thing.
The problem is that it makes it so conveniently easy to do brute-force tactics that....oh the horrible things I've seen...code gets littered with LINQ doing naive queries repeatedly over massive sets of data. Of course you need good coders and good code audits, but LINQ, I think, gives a unsupported sense of comfort that one is making good code (where if people had to code these as loops, it would become very evident very early on that maybe they should rethink their approach).
Awesome. I was going to post my two line javascript version, but instead I'm going to study this. There are three commands I've never heard of there. And it's definitely time to take a first crack at awk.
You don't need that tr, just sort -f and then uniq -i.
Of course, whoever put -f as the ignore-case flag for sort should be shot methinks, or rather the one who first added -i as "ignore nonprinting characters".
Edit: and while we're on the subject, you can get rid of head and use awk 'NR<=10 {print $2}'
Edit²: I would also like to point out that your version is superior to any short C# program, because sort can sort sequences that exceed the amount of RAM you have.
I should have looked below, I wrote a similar thing above, but you forgot to split multiple words in a line to lines. That's okay, a lot of people above forgot to about case and punctuation even!
I had a class where we built a DB engine from scratch, and I ported my code to C# just for LINQ and proper list handling. While it's not the most efficient, it reduced many of the complex sections down to a few lines, like reordering the values being inserted to match the order necessary for a record by comparing the schema to the parameter list in the insert statement.
If you like LINQ and wish it were available in JS, look at underscore or lo-dash.
LINQ does make C# more readable but it doesn't add much value over normal functional collections which usually end up more concise, simpler and easier to reason about - visible in my Dart port of C#'s 101 LINQ samples (which are also lazy): https://github.com/dartist/101LinqSamples
performance of Linq 2 Objects is not that great either and it doesn't add much value readability-wise over rewriting the same task in other dynamic languages: https://github.com/dartist/sudoku_solver
My main problem with LINQ is that it seems to perform terribly on mobile. I was looking for map/reduce/etc type functions for C# in Unity, and thought I found it with LINQ. To my dismay, LINQ creates so many crazy intermediate objects to pull off its "laziness" that our GC high water mark was being crossed all the time. I went and just reimplemented everything from underscore.js in C# and got way better performance. I imagine this is probably not an issue on desktops/servers.
If you'd like to share your code, I'm a functional programmer at heart and am always looking for optimization opportunities for functional-style programming in C#. If you send me some of the code you were having trouble with I can see if your use case maps well into some compiler optimization strategies.
What is it with people's obsession over lines of code?
As someone who doesn't do C# or LINQ, that second solution seems to me like someone really wanted to have as few lines of code as possible.
I don't claim to have an impressive programming pedigree, but while I take simplicity and performance into account, I never take "conciseness" into account. Conciseness usually means "this is opaque as shit but at least it's short". And who really cares about short? What's the purpose of "short"? None that I can find, other than impressing interviewers. Anyone who believes otherwise should probably be writing in Clojure or Haskell (and probably is), but I personally just don't see the point.
But that's just my opinion. My favorite language is Python.
Less code is less places to insert bugs and less to read. The majority of time spent "writing" software is actually spent reading the existing code, so "less to read" is really important. "Concise" does not mean "short"; it means "short and clear". Obviously if your short code is opaque or bug-prone then you're defeating the purpose.
1. Iterate through all key,value pairs once, keeping track of the 10 most common
2. Run quickselect 10 times
3. Coolest (and an interview question in it's own right) - modify quickselect to return the top 10!
The LINQ really is much more succint, but the original code did not set the bar very high. Why should one write a 12 line comparing function when using 'b.value - a.value' would work pretty much the same (unless C# really requires comparison to return -1/1 instead of any negative/positive integer, which would be fixed by a sign function).
I think the analysis may've been skewed by the outlying points. They are almost certainly "high leverage points"[1] and so possibly exert an undue influence on the final trend lines.
But it's funny, the first time I read the paragraph, I'd assume the words were not separated by space, and you had to find occurrences of combinations than.
I instinctively made the test much harder than it would be. I'm damaged.
Good read. It also shows that interviewing can be a great learning tool. I've experienced that myself, while interviewing takes a lot of time investment on the part of the interviewer it's also often a source of new insights.
I totally agree. Sure there where similar things a in the funcional world before, but LinQ had some important pros:
- deferred ejecution by default, saving memory and time
- step by step syntax, each new operation is at the end, not the beginning
- excellent type inference and intellisense, js? ruby?...
- it works with the same syntax on the database!!! Haskell?
- map and filter where there, but groupby and join where not so common in previous query comprehensions APIs.
- the most important: it's actually usable in jobs you get paid for, not experiments you can make at home or university.
There are however two things that doesnt make it 100% perfect:
- expression tree lambas are identical to non expression ones, making it hard for developers to know if one step is going to be translated or executed. I would have chosen => for non expression and -> for expressions for example or something like that.
- having two syntax, method chain and query comprehensions, produces a frequent anoying back and forth since some operators are better written in one (let, join, group by) while others are only available in method chain (take, toDictionary...)
So, aside from the Clojure, Mathematica, Python, Ruby, Bourne Shell, Haskell, and Scala solutions posted in the other comments, all of which are simpler than the C++, C#, and JS solutions, presented here with some minor cleanups:
(take 10 (reverse (sort-by (comp first rest) (frequencies (string/split ... #"\+s")))) ; llambda Clojure
// haakon Scala
s.split(' ').groupBy(identity).mapValues(_.size).toList.sortBy(-_._2).take(10).map(_._1)
(->> (string/split s #"\s+") frequencies (sort-by val) reverse (take 10)) ; aphyr Clojure
var top = (from w in text.Split(' ') // louthy C# LINQ
group w by w into g
orderby g.Count() descending
select g.Key).Take(10);
collections.Counter(s1.split()).most_common(10) # shill Python
d = {} # shill Python without collections library
for word in s1.split(): d[word] = d.get(word, 0) + 1
print [(x, d[x]) for x in sorted(d, key=d.get, reverse=True)][:10]
words = s.split() # spenuke and abecedarius probably O(N²) Python
sorted(set(words), key=words.count, reverse=True)[:10]
d3.entries((s.split(" ").reduce(function(p, v){ // 1wheel JS with d3
v in p ? p[v]++ : p[v] = 1;
return p;}, {})))
.sort(function(a, b){ return a.value > b.value; })
.map(function(d){ return d.key;})
.slice(-10)
# kenuke O(N²) Ruby:
str.split.sort_by{|word| str.split.count(word)}.uniq.reverse.take(10)
counts = Hash.new { 0 } # my Ruby
str.split.each { |w| counts[w] += 1; }
counts.keys.sort_by { |w| -counts[w] }.take 10
# aaronbrethorst ruby
str.split(/\W+/).inject(Hash.new(0)) {|acc, w| acc[w] += 1; acc}.sort {|a,b| b.last <=> a.last }[0,10]
Commonest[StringSplit[string], 10] # carlob Mathematica
Reverse[SortBy[Tally[StringSplit[#]], #[[2]] &]][[;; 10, 1]] & # superfx old Mathematica
$a = array_count_values(preg_split('/\b\s+/', $s)); arsort($a); array_slice($a, 0, 10) // Myrth PHP
tr -cs a-zA-Z '\n' | sort | uniq -c | sort -nr | head # mzs and me sh
-- lelf in Haskell
take 10 . map head . reverse . sortBy (comparing length) . group . sort . words
# prakashk Perl6
.say for (bag($text.words) ==> sort {-*.value})[^10]
# navinp1912 C++
string s,f;
map<string,int> M;
set<pair<int,string> > S;
while(cin >> s) {
M[s]++;
int x=M[s];
if(x>1) S.erase(make_pair(x-1,s));
S.insert(make_pair(x,s));
}
set<pair<int,string> >::reverse_iterator it=S.rbegin();
int topK=10;
while(topK-- && (it!=S.rend())) {
cout << it->second<<" "<<it->first<<endl;
it++;
}
Although I haven't tested it, I think the Afterquery program to solve this, assuming you first had something to tokenize your text into one word per row, would be something like
&group=word;count(*)
&order=-count(*)
&limit=10
which, though perhaps less readable, is simpler still, except for Mathematica. More details at http://apenwarr.ca/log/?m=201212.
And neither is this, although it uses less code and less RAM:
perl -wlne '$w{$_}++ for split; END { $, = " "; print @{[sort {$w{$b} <=> $w{$a}} keys %w]}[0..9]}'
I was surprised, attempting to solve this in Common Lisp, that there's no equivalent of string/split in ANSI Common Lisp, and although SPLIT-SEQUENCE is standardized, it's not included in SBCL's default install, at least on Debian; and counting the duplicate words involves an explicit loop. So basically in unvarnished CL you end up doing more or less what you'd do in C, but without writing your own hash table. Lua and Scheme too, I think, except that in Scheme you don't even have hash tables.
perl -0777 -nE '$w{$_}++ for split; say for (sort {$w{$b} <=> $w{$a}} keys %w)[0..9]'
It is slightly different compared to your version in that each word is printed on a separate line. To print all words on the same line, is just a bit longer:
If you move what while (topK--) into the map loop , it becomes an online code for topK whereas what you wrote is an offline . If you want offline then pushing it into a priority_queue and then popping it out would be much faster.
Edit: remember to sort! (Forgot my coffee this morning...)
Edit again: and aphyr's solution is even more concise and idiomatic, where `s` is the first paragraph of the blog post: