A Pythonic way to do MapReduce using hadoop

jcsalterego · on Feb 20, 2011

Great overview, from top to bottom.

Just wondering, is it better performance-wise to explicitly check for key membership than to rely on exceptions?

Existing:

  for line in sys.stdin:
      name, marks = line.rstrip().split('\t')
      try:
          if agregatedmarks.get(name):
              agregatedmarks.get(name).append(marks)
          else:
              agregatedmarks[name] = [marks]
      except ValueError:
          pass

Or

  for line in sys.stdin:
      name, marks = line.rstrip().split('\t')
      if name in aggregatedmarks:
          aggregatedmarks.append(marks)
      else:
          aggregatedmarks[name] = [marks]

This is a common idiom I find myself using (append to an existing list or creating a new one in a larger dict).

beagle3 · on Feb 21, 2011

aggregatedmarks.setdefault(name,[]).append(marks)

See http://docs.python.org/release/2.5.2/lib/typesmapping.html

ferdous · on Feb 21, 2011

neat and minimal! I will update the code. Thanks for your comments and stopping by :)