
NLP in Python vs other Programming Languages - rayvega
http://nltk.googlecode.com/svn/trunk/doc/howto/nlp-python.html
======
jules
His Ruby

    
    
        for line in ARGF
          for word in line.split
            if word.match(/ing$/) then
              puts word
            end
          end
        end
    

I'd write as

    
    
        for line in ARGF
          puts line.split.grep(/ing$/)
        end
    

Or

    
    
        puts ARGF.map{|line| line.split.grep(/ing$/)}

~~~
petercooper
The Python implementation doesn't use ARGF and so appears clearer. Luckily
this is no problem in Ruby too:

    
    
      while line = gets
        puts line.split.select { |word| word =~ /ing$/ }
      end

~~~
swolchok
This is clearly not as bad as Perl, but it's still raving a bit. My key
complaints are the complete lack of clarity as to what kind of thing split is,
as well as what exactly is going on with select. (I'm assuming it's getting
passed a predicate in the form of a lambda or block or whatever the cool kids
are calling it these days, and it's obvious _to the Perl programmer_ what the
block does after that.)

The equivalent Python is something like

    
    
        import sys
        for line in sys.stdin:
          print '\n'.join(filter(line.split(), lambda word: word.endswith('ing'))
    

which I would argue is more readable to a layperson because function calls
look like function calls. (By the way, does Ruby's puts automatically do the
join with newlines, or what?)

~~~
petercooper
_My key complaints are the complete lack of clarity as to what kind of thing
split is_

The post seems to be about language independent readability. Rather than
"functions should be called with () - that is clearly not a function - can not
compute!!", the question is "can this code be understood even ignoring
unfamiliar syntax?"

Most developers familiar with languages like Java, JavaScript, Python, Perl,
C# or Ruby could correctly guess what "split" refers to and that it is
unlikely to be used as a variable name. Further, if a developer is familiar
with the idea of method calls, it will be inferred that
"whatever.split.whatever" is a sequence of method calls.. much as if it were,
in some fictional languages, _whatever- >split->whatever_ or
_whatever:split:whatever_. Similarly, _a(b(c()))_ and _a[b[c[]]]_ could both
be easily inferred to be a set of nested function calls.

"import sys" is not exactly clear either, but again, the meaning of this could
be accurately inferred by most developers.

    
    
      print '\n'.join(filter(line.split(), lambda word: word.endswith('ing'))
    

I want to join the array's elements together, not join the string. Joining
together an array's elements by calling a method on the delimiter is no less
raving than any Ruby I may cook up. Your example is both longer _and_ has more
syntax.

------
lars512
There's many dimensions on which we can evaluate programming languages, but
the NLTK folk are only really interested in one: readability. Their page
implicitly argues that high-level languages with good string processing are
the most readable, and that amongst those Python is more readable than the
alternatives (for both non-programmers and experts).

NLTK is supposed to be an educational toolkit. It's used by linguists taking
their first steps in programming, and by CS students taking their first steps
in complexity and mess of human language. They're not looking for the shortest
code, the fastest code, or the most <quality attribute X> code, just the most
readable, insofar as readability can be supported and encouraged by a
language.

~~~
bradleyland
In that case, Python and Ruby must be front runners. I'm not sure why they put
Ruby so close to the bottom, or why they used the obscure ARGF mechanism
instead of the more readable examples elsewhere in this discussion. It seems
as though the author may have an interest in promoting Python as the choice.

~~~
chromatic
_It seems as though the author may have an interest in promoting Python as the
choice._

Clearly the author has made a strong assumption that the peculiarities of
Python syntax and semantics (`import`, `sys.stdin`, `for` ... `in`) are
somehow clear to Python novices.

------
tspiteri
A C++ version:

    
    
        #include <iostream>
        #include <string>
        int main()
        {
            std::string s;
            while (std::cin >> s) {
                if (s.size() >= 3 && s.match(s.size()-3, 3, "ing") == 0) {
                    std::cout << s << '\n';
                }
            }
        }

~~~
pgbovine
were you trying to make an argument as to how this compares to other
languages? (sorry, i don't mean to sound trollish, i'm actually curious as to
your intent)

~~~
tspiteri
If anything, this particular comment is to show that C and C++ are quite
different although they have quite a bit in common.

------
emef
The entire time I was reading, I hoped that a Haskell solution would be there
(knowing it would be much simpler), I got my wish :) +1 to haskell

~~~
megaman821
Except for Prolog, I am familiar with all of the languages presented. Python
and Haskell were the easiest for me to parse, not that any of them were overly
hard.

~~~
eru
Perhaps more idiomatic:

    
    
      import Data.List
      main = putStr . unlines . filter ("ing" `isSuffixOf`) . words =<< getContents
    

(To be read from right to left.)

A Forth example would be interesting.

~~~
Avshalom
Here is a Factor version:

    
    
      : print-ing ( filepath encoding -- )
      [ [ 
          " " split
          [ dup "ing" tail? [ print ] [ drop ] if ] each ] 
        each-line ]
      with-file-reader ;
    

It's kind of ugly because it's designed to print out the values without
leaving anything on the stack. It can be prettier if it produces an array of
matching words

    
    
      : collect-ing ( filepath encoding -- seq ) { } -rot 
      [ [
          " " split 
          [ "ing" tail? ] filter append ] 
        each-line ] 
      with-file-reader ;
    

and then just prints that out

    
    
      : print-ing ( filepath encoding -- )
      collect-ing [ print ] each ;

~~~
eru
Thanks!

------
Jun8
I agree with everyone that Perl syntax can look random gibberish, however,
their particular Perl example seems quite easy to interpret.

~~~
pyre
Yea. I don't agree with them ragging on the '$' in the Perl example. I could
have made the Python code just as obtuse by using the regex library rather
than word.endswith(). Ragging on a language because a beginner doesn't
understand regular expressions seems a bit misleading.

They are also inconsistent with their coding style. In the while() loop, they
make use of $_ implicitly (even in the split statement in the foreach loop),
but in the foreach loop, they don't use $_, but instead define $word. Then
they go off on split being 'difficult to guess what it represents.'

    
    
      > Having used Perl ourselves in research and teaching
      > since the 1980s, we have found that Perl programs of
      > any size are inordinately difficult to maintain and
      > re-use.
    

Says more about the programmer than the language. You can write maintainable
code in any language. "Too many choices" doesn't cause code to be
unmaintainable. Lack of discipline in your programming practices does (as well
as lack of documentation). Saying that a language is better in this respect is
just to say that "X language took the ability to choose away from me so that
I'm forced to do things a certain way, whether I like it or not."

~~~
sigzero
I agree with you. Bad code isn't the programming languages fault. Bad code is
the programmer's fault. Perl itself and the Perl culture has changed
enormously since the "1980s".

------
kroger
The lisp code has a few problems. It's using a regex library that's only
available in clisp (I think), it's not using the standard input like the other
examples, and having two functions named has-suffix and has_suffix is no good.
Also, it'll return an error if the string is shorter than the suffix.

In the following example I'm using the portable <http://www.cliki.net/SPLIT-
SEQUENCE> to split the words:

    
    
      (defun endswith (string suffix)
        (let ((size (- (length string) (length suffix))))
          (unless (minusp size)
            (equalp (subseq string size) suffix))))
      
      (loop for line = (read-line *standard-input* nil) while line do
            (loop for word in (split-sequence #\Space line) do
                  (if (endswith word "ing")
                      (write-line word))))
    

It's still wordier than python, though.

------
ekiru
The C and Prolog examples solve a different problem than the others. The
others split on either any whitespace or on only spaces. The Prolog example
splits on whitespace and punctuation. The C example splits on anything that
isn't alphanumeric.

------
eru
"LISP is a so-called functional programming language, in which all objects are
lists, and all operations are performed by (nested) functions of the form
(function arg1 arg2 ...). "

Reading this hurts.

------
tspiteri
The C version has a buffer overflow if more than 1024 consecutive alphanumeric
characters are input. And a much less serious point,

    
    
        isalnum(c)
    

looks much better than

    
    
        (c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')

~~~
pgbovine
yup, agreed, and

    
    
      if word.endswith('ing')
    

looks much better than

    
    
      if word[-3] == 'i' and word[-2] == 'n' and word[-1] == 'g'
    

clearly the authors of that article weren't proficient in the C standard
library :)

------
zephyrfalcon
For what it's worth, this could be written in one line of Io:

    
    
      File standardInput readToEnd split select(endsWithSeq("ing")) foreach(println)
    

(Given that I don't actually use Io a lot, there might be shorter ways to do
this.)

Anyway, as usual, code samples prove very little. =)

------
10ren
lua (a better version is welcome):

    
    
        for line in io.lines() do
          for word in line:gfind("[^%s]+") do
            if word:find("ing$") then
              print( word )
            end
          end
       end

------
pgbovine
i love python more than anything else in the world (well, almost), but i think
that this example is quite superficial ... this line alone gives python its
enormous 'readability edge' over other languages:

    
    
      if word.endswith('ing')
    

of course, it's great standard library design to have a string method called
endswith() rather than making people use a regexp ending in '$', since finding
suffixes is a common operation. but such a simple operation is hardly
indicative of hardcore NLP (which would mostly be hidden in special-purpose
library code anyways)

------
mark_l_watson
I use Ruby, and not Python. That said I still bought a print copy of this book
a few years ago: nice book and the NLTK package has a lot of grate tools built
in. Definitely "batteries included."

~~~
mark_l_watson
The comments on why "my language is better than yours" could have been toned
down however.

The book is available at the same site:
<http://nltk.googlecode.com/svn/trunk/doc/book/book.html>

really nice work on both the book and the NLTK library.

------
urza
C#

    
    
      Console.ReadLine().Split().Where( word => word.EndsWith("ing"))
        .ForEach( word => Console.WriteLine(word));

