

US Patent #7,650,331: System & method for efficient large-scale data processing - mattyb
http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=7,650,331.PN.&OS=PN/7,650,331&RS=PN/7,650,331

======
profquail
The software patent situation is getting more and more ridiculous by the day.
This patent is so general, it describes basically any data-mining software
(that may or may not be distributed over multiple computers.)

I think that all patents should have to go through the peer review process,
unless there is some kind of extra-ordinary reason that it can't be made
public:

<http://www.peertopatent.org/>

~~~
cperciva
_This patent is so general, it describes basically any data-mining software
(that may or may not be distributed over multiple computers.)_

Bullshit. Here are the two independent claims:

1\. A system for large-scale processing of data, comprising: a plurality of
processes executing on a plurality of interconnected processors; the plurality
of processes including a master process, for coordinating a data processing
job for processing a set of input data, and worker processes; the master
process, in response to a request to perform the data processing job,
assigning input data blocks of the set of input data to respective ones of the
worker processes; each of a first plurality of the worker processes including
an application-independent map module for retrieving a respective input data
block assigned to the worker process by the master process and applying an
application-specific map operation to the respective input data block to
produce intermediate data values, wherein at least a subset of the
intermediate data values each comprises a key/value pair, and wherein at least
two of the first plurality of the worker processes operate simultaneously so
as to perform the application-specific map operation in parallel on distinct,
respective input data blocks; a partition operator for processing the produced
intermediate data values to produce a plurality of intermediate data sets,
wherein each respective intermediate data set includes all key/value pairs for
a distinct set of respective keys, and wherein at least one of the respective
intermediate data sets includes respective ones of the key/value pairs
produced by a plurality of the first plurality of the worker processes; and
each of a second plurality of the worker processes including an application-
independent reduce module for retrieving data, the retrieved data comprising
at least a subset of the key/value pairs from a respective intermediate data
set of the plurality of intermediate data sets and applying an application-
specific reduce operation to the retrieved data to produce final output data
corresponding to the distinct set of respective keys in the respective
intermediate data set of the plurality of intermediate data sets, and wherein
at least two of the second plurality of the worker processes operate
simultaneously so as to perform the application-specific reduce operation in
parallel on multiple respective subsets of the produced intermediate data
values.

9\. A method of performing a large-scale data processing job, comprising:
executing a plurality of processes on a plurality of interconnected
processors, the plurality of processes including a master process for
coordinating the large-scale data processing job for processing a set of input
data, and worker processes; in the master process, in response to a request to
perform the large-scale data processing job, assigning input data blocks of
the set of input data to respective ones of the worker processes; in each of a
first plurality of the worker processes, executing an application-independent
map module to retrieve a respective input data block assigned to the worker
process by the master process and to apply an application-specific map
operation to the respective input data block to produce intermediate data
values, wherein at least a subset of the intermediate data values each
comprises a key/value pair, and wherein at least two of the first plurality of
the worker processes operate simultaneously so as to perform the application-
specific map operation in parallel on distinct, respective input data blocks;
using a partition operator to process the produced intermediate data values to
produce a plurality of intermediate data sets, wherein each respective
intermediate data set includes all key/value pairs for a distinct set of
respective keys, and wherein at least one of the respective intermediate data
sets includes respective ones of the key/value pairs produced by a plurality
of the first plurality of the worker processes; and in each of a second
plurality of the worker processes, executing an application-independent reduce
module to retrieve data, the retrieved data comprising at least a subset of
the key/value pairs from a respective intermediate data set of the plurality
of intermediate data sets and applying an application-specific reduce
operation to the retrieved data to produce final output data corresponding to
the distinct set of respective keys in the respective intermediate data set of
the plurality of intermediate data sets, and wherein at least two of the
second plurality of the worker processes operate simultaneously so as to
perform the application-specific reduce operation in parallel on multiple
respective subsets of the produced intermediate data values.

This very clearly only covers computations distributed over multiple systems (
_a plurality of processes executing on a plurality of interconnected
processors_ ) and is narrow enough that there are lots of parallel systems
which don't fall under those claims.

~~~
ajross
No offense, but shouldn't the standard be a little higher than "there are lots
of parallel systems which don't fall under those claims"? It looks quite
shockingly broad to me.

~~~
cperciva
The standard for patenting is higher than that. I was responding to the claim
that the patent "describes basically any data-mining software", which is
simply untrue.

------
mark_l_watson
Google could make themselves look better right now by quickly granting a free
perpetual nonexclusive patent license to the Apache Hadoop project.

------
cabalamat
I think patent attorneys should be required to write patents in understandable
English, on pain of receiving a plurality of punches in the face.

~~~
jordyhoyt
They are very aware of how they are speaking, and in person, they are the most
precise people you will ever meet. I recently had occasion to speak with a
patent attorney, and not only did he explain the patent in plain English, but
I was floored at how unambiguous he was. We spoke over the phone and he was
able to guide several people through the details of the patent quickly and
clearly.

I do agree though, the way these are written is almost completely unreadable.
Almost like how we use English words in programming languages, but without the
domain knowledge, it is meaningless.

~~~
cabalamat
Ptents in theory are meant to blanace a public bad (a monopoly) with a public
good (teaching practitioners how to do something). I am a programmer, and I do
not understand software patents. I'm sure I'm not the oonly one. If a typical
practitioner of an art, when seeing a patent, says its hard to understand or
confusingly worded, the patent should be void.

~~~
10ren
I think there's a good argument for that approach being applied to the
_description_ of the patent. That's the part that a person skilled in the art
should be able to use to make the invention.

However, it's not so applicable to the _claims_ of the patent. This part is
for lawyers to use to determine the exact extent of the legal protection
conferred by the patent. The description is of just one embodiment of the
invention; it is natural that slight variations from that embodiment should
also be covered - but exactly how much variation is covered? How general (or
how abstract) is the coverage? Especially when you consider that the given
embodiment isn't necessarily the "center" of the inventions - it is not _the_
embodiment, just _an_ embodiment. This extent is difficult to specify, and
special language is needed.

To use jordyhoyt's analogy, it would be like expecting Erlang to be readable
by a layman. Of course, we can do better. The point is that it's hard to serve
many masters.

~~~
cabalamat
> To use jordyhoyt's analogy, it would be like expecting Erlang to be readable
> by a layman.

If lawyerese for a formally-defined language with a compiler and everything,
I'd have less problems with it. But from whetre I'm standing, it just looks
like obfuscated English.

------
siculars
Don't be Evil. Indeed. Well, you know, someone needs to keep IBM from a C&D
against Hadoop.

~~~
lmkg
While I'm normally the first to pile on and complain about Google, in this
case I'm going to hold off unless/until they try to enforce their patent. The
big companies like Google and Microsoft need to preemptively patent everything
they can, in order to stave off the real patent trolls.

Of course, the necessity for defensive patenting itself just goes to show how
silly the current patent system is.

~~~
houseabsolute
Well, it's not a very good defense against patent trolls, because those guys
typically don't exercise their own patents or anyone else's. However, it does
help somewhat against companies like Apple or Nokia who are fine with
litigating against anyone who tries to compete on an even footing. For
example, if Yahoo or Microsoft sued Google for some search related patent,
probably Google could use this patent against them.

------
10ren
You can't patent something that has been publicly disclosed, including
disclosure by yourself. Commercial use counts as disclosure.

This was filed June 18, 2004, so given the one year grace allowed under US
law, if Google made _commercial use_ of this system before June 18, 2003, then
they cannot patent it.

I'm pretty sure Google used distributed map-reduce from the beginning... so
I'm thinking that this patent must cover a refined method that differs in some
way from their initial approach.

It's a lot of work to read and understand a patent (I spent a couple of days
on one this week), so if anyone else wants to do so and let us know what the
patented algorithm actually is...

~~~
WalterGR
> You can't patent something that has been publicly disclosed, including
> disclosure by yourself. Commercial use counts as disclosure.

Incorrect. IANAL, so I will not attempt to correct you. I suspect YANAL
either.

~~~
cromulent
It seems to be true in the US.

"If the invention has been described in a printed publication anywhere, or has
been in public use or on sale in this country more than one year before the
date on which an application for patent is filed in this country, a patent
cannot be obtained."

[http://www.uspto.gov/web/offices/pac/doc/general/index.html#...](http://www.uspto.gov/web/offices/pac/doc/general/index.html#novelty)

~~~
tybris
The original MapReduce paper was published in December 2004. 6 months after
the patent was filed.

------
alttab
Even if they wanted to enforce it they'd have to prove it.

Otherwise patenting map reduce - which is practically one of the foundations
of functional data processing - is down right silly

~~~
neilc
The relevant prior art is not the idea of a "map" or "reduce" function as such
(that isn't what is significant about MR, anyway).

The patent is about parallelizing M and R operators for large-scale data
analysis; I think that the relevant prior art would include 1980s work on
parallel database systems (e.g. Gamma, Bubba).

------
sjunkin
I wonder what AbInitio, Informatica, et al think about this patent, is the HN
community aware of their claims to similar IP?

------
fexl
μολον λαβε

~~~
zandorg
Is that APL?

~~~
profquail
Probably.

