
Algorithm Development is Broken - platypii
http://blog.algorithmia.com/post/75680476188/algorithm-development-is-broken
======
elliptic
The "{area/field/business} is broken" template is broken (well, I don't know
if ever worked). Algorithm development is not broken - it's specialized work
and it's largely done by people with specialized backgrounds. Maybe that's not
ideal, and maybe you can improve it, but saying "it's broken" is lazy and
false.

~~~
platypii
Based on my experiences in grad school, I'm not sure it is an exaggeration to
say that it's "broken". So much work done solving hard problems, that ends up
in academic journals and never seen again. And there's no real incentive for
that to change. I think that's broken.

~~~
matthewmacleod
_So much work done solving hard problems, that ends up in academic journals
and never seen again_

Is this really the case? When I have to solve a particularly unusual or tricky
problem, I often have a look at relevant academic papers and go from there –
either finding existing solutions, or basing one on the contents of the paper.
I kind of assumed most developers did the same.

~~~
tokenrove
Yeah, this seems bizarre. There is better access right now to cutting-edge
algorithms than ever, as well as building blocks that make it easier than ever
to implement the algorithms described. I've been doing this for years, and
I've noticed it get progressively easier. Honorable mention to
[http://thepaperbay.com/](http://thepaperbay.com/) for making it easier to get
around those occasional papers stuck behind paywalls.

~~~
mikevm
Another honorable mention:
[http://www.reddit.com/r/scholar](http://www.reddit.com/r/scholar)

And:
[http://gen.lib.rus.ec/scimag/index.php](http://gen.lib.rus.ec/scimag/index.php)

------
gfodor
What's the monetization strategy here? It would be a major step backwards for
the industry if it turns out the plan here is to try to make money off of what
is currently a public resource. I'd hate to see researchers having to decide
if they want to publish their results or just make some cash by creating a
closed source API with your service. You guys should make it clear if source
code will always be available or if this is basically going to be a land-rush
for people to provide an API for all the standard CS algorithms with you
monetizing access to them.

~~~
platypii
We are algorithm developers ourselves and our goal is just to make these
awesome algorithms that are being developed more widely available.

We are not trying to monopolize basic algorithms. In fact in a free market I
don't see how that could happen, since anyone else could come along and
implement the same basic CS algorithms and offer a lower price.

The thing that makes us different from say Wikipedia or Github is that we
don't provide just source code, but live infrastructure delivering the
algorithms as a service. So rather than uploading code to github, and putting
the burden on a user cloning the algorithm, setting up virtual servers,
deployment systems, load balancers, etc. We handle all that. That way
algorithm developers are free to focus on algorithm code, and the
infrastructure is taken care of. This infrastructure needs to be paid for
however, so our general plan is to charge for using algorithms through the
API. It will be up to the developers whether to share source code or not,
since they own the IP.

~~~
gfodor
The status quo now is not broken: developers who want to share IP are
encouraged to incorporate their work into open source libraries so they can be
used by everyone and be made robust and evolve over time. In various domains
these libraries are more or less agreed upon, and the patches go through
extensive review to get incorporated.

This type of system will encourage these folks to release their source code in
a form compatible with _your_ API, and 3rd parties who want to leverage their
work will need to either cut-and-paste it to incorporate it into their own
system (a nightmare), or access it through a slow network API, likely paying
for it along the way. Libraries have solved the problem of code re-use of
algorithms for the last several decades. You're basically proposing that
LAPACK and BLAS and Mahout and all these great algorithmic libraries would
have been best served up without any centralized coordination or binary
distribution, but with each individual algorithm behind a separate API with
each author publishing each one individually. I don't see how if the goal is
really to enable wider distribution of good algorithmic work that you can't
see that an heterogenous API-centric approach to this problem (while one that
would certainly make you a pile of cash if it catches on) wouldn't really be a
great step forward for the industry compared to the centralized open source
library model we have now. Good algorithmic work is _hard_ , and the model we
have now results in battle hardened, robust, agreed-upon implementations of
well understood algorithms.

If you really want to make me a believer, all algorithms that your API
supports that have public source code should also be integrated and built into
a binary library that a user can download instead. At the very least, the path
for a contributor to take what they have written and additionally publish it
as a appropriate binary package themselves should be minimal. For example, if
I implement something in Ruby that uses your service, I should be able to push
it as a un-coupled, reusable gem in a straightforward manner.

~~~
platypii
I agree that the existing libraries are great, and if anything that success is
what we to encourage even more of. We want to build a community around
algorithm development, and have a central place for developers to share their
work, and collaborate with other developers to make Huge library of algorithms
so that everyone can benefit.

The first language we implemented was Java/Scala, largely because of the
excellent libraries like Mahout and Weka easily available in Maven Central.
But even though the code is there in maven, it is still not trivial to deploy
those algorithms as a service. We are trying to make that easier, while at the
same time supporting the developers of those libraries in the process.

Being able to run the code locally is in the plans. For the most part, the
algorithms in our system are already simply maven packages, or ruby gems, npm
packages, etc. Being able to distribute them in binary form is definitely
possible. We're still figuring things out right now, but our focus is always
on doing right by the algorithm development community.

~~~
daken
This doesn't make sense to me. You display Dijkstra as an example on your
homepage. Imagine I want to use Dijkstra in an app that needs critical
performance and calls the function 1000000 times per minute, making this an
API rather a library just breaks the whole point of centralizing the
algorithms.

~~~
dalke
Plus, the choice of Dijkstra algorithm implementation depends on the graph
size, type, and even CPU. See
[http://www.cs.sunysb.edu/~rezaul/papers/TR-07-54.pdf](http://www.cs.sunysb.edu/~rezaul/papers/TR-07-54.pdf)
, with 10 different implementation variations and several different
benchmarks.

The 9th DIMACS Implementation Challenge from 8 years ago (see
[http://www.dis.uniroma1.it/challenge9/](http://www.dis.uniroma1.it/challenge9/)
) served as a way to gather the best algorithms for the shortest path problem.
I don't see how this project can be significantly better, or even close to, a
traditional effort like that.

~~~
robrenaud
How many of those shortest paths algorithms can you run right now?

It doesn't have to be better at producing good algorithms. It just has to be
better at making those algorithms implementations easy to use.

But I still don't think the business model is viable.

~~~
dalke
How many do I need to run?

The Boost Graph Library includes three different shortest path
implementations, and I have that already downloaded. That library likely fits
my needs.

In the DIMACS challenge papers at
[http://www.dis.uniroma1.it/challenge9/papers.shtml](http://www.dis.uniroma1.it/challenge9/papers.shtml)
you'll see "Single-Source Shortest Paths with the Parallel Boost Graph
Library".

Indeed, the Parallel Boost Graph Library library is easily available, and
contains two parallel shortest-path implementations.

The flip question is, what if it doesn't fit my needs? Well, then the question
is "what do I need"?

The other DIMACS challenges include 1) pre-computation, which is useful for
road-based graphs, 2) supercomputer parallelism to search billions of nodes,
3) out-of-core graph searches, for when you don't have enough RAM, and 4)
k-shortest path search.

It's very unlikely that this proposed algorithm service provider will
implement these alternate algorithms.

------
frik
I first thought this is a Wiki for Algorithms with sample code in different
languages.

Well, this site is apparently a marketplace for algorithms.

If anyone is more interested in open Wikis, I found what two sites using
Google:

* RosettaCode.org (code in many languages): [http://rosettacode.org/wiki/Sorting_algorithms/Merge_sort](http://rosettacode.org/wiki/Sorting_algorithms/Merge_sort)

* Algorithmist.com: [http://www.algorithmist.com/index.php/Main_Page](http://www.algorithmist.com/index.php/Main_Page)

~~~
Aloisius
Not a wiki, but still incredibly handy is DADS:
[http://xlinux.nist.gov/dads/](http://xlinux.nist.gov/dads/) or
[http://fastar.org/dads/](http://fastar.org/dads/)

I can't tell you how many times I've looked up something there over the years
(complete with links to sample code!).

~~~
dmunoz
Another good resource is Steven Skiena's Stony Brook Algorithm Repository [0].

It's not immediately clear on the linked page, but the "By Language", "By
Problem", "Algorithm Links" targets will drop down with links to a specific
page. The "By Problem" targets link to a page that is very similar to what is
in "The Hitchhiker's Guide to Algorithms" part in Skiena's "The Algorithm
Design Manual" textbook. They each lead with an image representation of the
input and output of the general case of the algorithm, a problem description,
some short text description, and then links to implementations. It's not as
detailed as what is in the textbook, and I'm not sure how up to date it is
these days, but it's a good place to browse about.

[0] [http://www.cs.sunysb.edu/~algorith/](http://www.cs.sunysb.edu/~algorith/)

~~~
justin66
Which is sort of a subpage of [http://algorist.com/](http://algorist.com/) ,
the page for his book.

People really need to be aware of this site. It's fantastic. His book is great
as well.

------
chubot
Hm, sounds interesting but vague. I don't really understand how it will work.

How is data for the algorithms provided? A lot of times it is big. And messy,
and proprietary. There seems to be an implicit assumption that you can just
plug different algorithms to different data sets. But I can't think of a
project where that has been the case.

I also doubt that "algorithms" are the bottleneck in a lot of projects. I'm
not an expert, but I have some personal experience to back up what people say
about "data trumping algorithms" (e.g. Peter Norvig and others have written
about this)

I would like to hear about some more concrete examples / success stories.
"Algorithms" is just too vague. I think if this becomes successful, it will be
by first narrowing it to a particular domain, and then generalizing it again.

~~~
platypii
Sorry for being vague, that wasn't our intention, so much as to focus on
laying out the problem that we've experienced personally as algorithm
developers. We will be posting more blog posts with much more detail about the
system in the near future.

The data question is huge for us. We have built our system on top of Hadoop +
Spark in order to handle large amounts of data, and be able to apply
computations to it. You're absolutely right that data is very heavy, and often
the limiting factor, so we are doing everything we can to both get data into
the system (currently with a focus on streaming data, since that gets around
the heavy data problem), and getting algorithms to where the data is.

As far as specific algorithms, there are many that could apply. My personal
experience is in machine learning algorithms, so I'm personally biased toward
those. There are numerous algorithms like Classification, Clustering,
Optimization, Anomaly Detection, etc. which are very CPU-intensive. We will be
doing a blog post soon with a demo of live algorithms already in our system to
give more concrete examples.

~~~
heurist
Could you return the actual code of the algorithm that could be loaded and
used during runtime (for some languages if the user wants it)? Seems like
having the user send in the data and running an algorithm as a service would
take a lot more time and effort for everyone involved.

------
dalke
I can't figure out how such a system is supposed to work. Anyone have any
ideas?

For example, a couple of years ago I worked on an algorithm to find the
maximum common subgraph of a set of 2 or more molecular graphs. More
specifically, I wanted the largest subgraph in M of N graphs (M<=N), I wanted
to define the atom match criteria, and I wanted to require that rings not be
broken in the subgraph. (Chemists love rings.)

I did it the old-fashioned way. I read papers, I investigated similar systems,
I implemented various implementation details, and I did a lot of testing.

How would I find such an algorithm using this system?

There are only a few people who develop this sort of algorithm. Why might I
expect that this system is a better resource than traditional means?

~~~
marcosdumay
I have no idea how this would work, but I disagree with your statement. There
are probably lots of people solving similar problems, but with different
notations.

Lots of domains share the same mathematical abstractions, but we don't know
because (almost) nobody is trying to discover them. We can know that it's
common, because once in a while people do look, and always come back with new
shared abstractions.

If somebody created an index that took different notations into account,
that'd revolutionize research. But I don't think this system does that.

~~~
dalke
Which statement of mine do you disagree with? Do you mean that there are few
people working on the maximum common subgraph problem as it applies to small
molecule chemistry? [1] What forms the basis for your opposition?

Of course there are many different notations. One need only look to Newton and
Leibniz notations for calculus, or Schrodinger's wave equations and
Heisenberg's matrices for quantum mechanics, to know there's a long history of
different mathematical abstractions for the same concept.

In my own field, I recently discovered (by reading the literature and getting
feedback from conferences) that there appears to be a previously unknown
connection between this maximum common subgraph problem and frequent subgraph
mining, so it's not like connections are altogether rare.

Indeed, I'll argue that these occur all the time, and research libraries exist
in part in order to help people find them.

Which is why I asked why this project might be better than traditional means,
and I gave an real-world example to give a basis for discussion.

[1] There's plenty of MCS research, but since the problem itself is known to
be NP-hard, most of the mathematical work shows that certain limit cases, like
planar graphs or outerplanar graphs, is solvable in polynomial time. As the
structures I deal with don't all fall into those categories, I can't use them
as a general purpose solution. [2]

[2] I might be able to use them as a special purpose solution, and this
project might help identify codes I can use for this case. I just can't figure
out a way to make it easier than the current methods.

------
3pt14159
Funny he mentions LDA. A company I founded and sold (Algo Anywhere, which
started off as a Generalized Algorithms as a Service company) was a
recommendation engine as a service business built on top of LDA. The papers
out there are dense, but LDA is actually pretty easy to get information on.
Check out Gensim in pythonland.

The thing with algorithms is that you really have to think. 200 lines of code
might take two months to really grok. Especially because the people in the
field make certain assumptions when they start, and sometimes even just one of
those assumptions takes two weeks to understand and research.

You can't just jump into PhD level research and expect to understand it right
away.

I don't know if algorithmia will help solve this problem, but I wish them all
the best of luck. Getting actual code next to research is super important and
useful.

~~~
doppenhe
Thanks for the best wishes. LDA was an interesting one since when we were
investigating it it was our first exposure. Although the algorithm itself is
well documented there was at least 3 or 4 libraries that could be implemented
and we really didnt know where to go from there. We want academic papers to be
accompanied by live code as you mentioned. We strongly believe this will not
only increase the reach of developed algorithms but also significantly lower
the bar of understanding for those who are not.

------
w_t_payne
Yes algorithm development _is_ broken, but not quite in the way that the OP
suggests.

People tend to focus on initial algorithm development, because that is the
academically prestigious intellectually stimulating bit, but the job is really
about how to turn algorithms into cash; a much broader problem than the narrow
slice that people typically obsess over.

A huge (and frequently overlooked) part of that job is the communication and
coordination role between business development and algorithm development. The
volume of communication and level of detail required cannot be overstated.

Another huge part of the job is actually turning a piece of research into a
functioning product. Whilst a large part of the OP's proposition is intended
to address this problem, (kudos) I think that his solution falls short in a
big way: It omits the largest part of the solution, which is where the
business learns about the algorithm and how the behaviour and performance
characteristics of the algorithm interact with the business' problem domain.
I.e. how does the business build sufficient expertise and knowledge of their
product to be able to effectively sell it. All of these are human problems,
mainly oriented around communications and learning.

Having said all that, I would like to encourage the OP in his efforts. I think
that it is something that is worth doing, and I really hope he is able to
build a business around this idea. I think that technology can help support
all of these activities, and this is actually something that I have wanted to
do myself for very a long time, so Kudos to the OP for actually taking the
chance, going out and doing it!

------
eliteraspberrie
This was (partly) solved in mathematical software with TOMS and Netlib:

[http://toms.acm.org/](http://toms.acm.org/)

[http://www.netlib.org/](http://www.netlib.org/)

However, if you want to do something non-trivial and you want it done right,
hire a computer scientist or mathematician. No amount of crowd will help if no
one in the crowd has a clue what they're doing.

------
morganherlocker
1) Will there be limits on the size of data sets/ what size data are you
optimizing for? Some algos focus on hundreds of records, some on billions, and
the ideal system for either are quite different (small data works great by
transferring data back and forth over http, data with hundreds of thousands of
records and up... not so much).

2) Same question as above, but for processing times. Some algos are aimed at
operating on the fly and might take ~1 second or less. Others (many that I
deal with quite often), might run for days, or even weeks. What sort of
processes are you optimizing for, and are long running procs on the radar?

It is an interesting idea, but there is a high bar when competing against my
language's package manager, and the 10s of thousands of "algorithms" already
out there. Best of luck!

~~~
platypii
The data question is something we are taking very seriously. I talked a bit
more about this in another comment, but we know that in many cases data is
heavy and is often the limiting factor. We are building a robust data platform
ontop of Hadoop + Spark, so be able to handle large amounts of data, and so
that we can deploy code to where the data is.

That said, I think there are many algorithms that will work as standalone
algorithms. Lots of machine learning type algorithms are mostly CPU-bound.
There are companies like [http://www.kooaba.com](http://www.kooaba.com) that
do image recognition as a service over HTTP. Also things like Siri and Google
Now seem to work well enough, despite network latency. Thanks for the
feedback!

------
DennisP
The API is interesting but what I really want is to understand the algorithms.
Clean, clear reference code in multiple languages with good explanations would
be hugely helpful. Is that part of the plan?

~~~
platypii
Yes definitely. The code is available for people to learn from, and you will
be able to edit the code and see the results immediately. The goal is to all
about making algorithms more accessible.

~~~
dmunoz
This is the most interesting part to me. I agree with the other people
commenting in this thread that advanced algorithm design (as seen in journals)
isn't broken. Interesting advanced algorithms do get spun out into open source
implementations.

A non-academic community based around discovery, discussion and implementation
of algorithms would be an interesting place to hang around. A MVP would be,
basically, a discussion forum with an index of github repos for the
implementations, but there are other values to be added on top.

Good luck. I'll be paying attention to what becomes of your project.

------
j2kun
Reminds me of the xkcd:

1\. There are 14 libraries for technical algorithms. 2\. We need a meta-
library so that we don't need to worry about having so many libraries! 3\.
There are 15 libraries for technical algorithms.

------
joveian
How is the DeepMind purchase in any way a "record sum"? The linked article
does not seem to make this claim. I didn't even get to your main point and
already I don't trust you.

~~~
doppenhe
450 million for a company that never shipped a product. Their entire value
came from the algorithms and algorithm developers. That part is unheard of.

~~~
joveian
Thanks, that makes more sense.

------
X4
Contradiction at it's finest:
[http://i.imgur.com/BwE2jmj.png](http://i.imgur.com/BwE2jmj.png) Good idea,
but remove the god damn login wall.

~~~
platypii
The irony is not lost on us :-) We are a small team actively building this
site. We will be completely open as soon as possible.

------
jroesch
I think it is a little disingenuous to say that algorithms get buried in
academic literature and are _impossible_ to find. By the nature of research
most of these things are made public when published, often times there is no
implementation or just a research quality one (which I can guarantee you is
almost never useful for the "real world").

For example one day I was interested in implementing HyperLogLog(a set
cardinality measure that is useful in data analysis). In about 10m I had all
the relevant papers on hand and after skimming them I had a pretty good sense
of how to implement it.

Similarly if I want to know how to implement a program dependency graph for
doing program analysis I can go read a few pages of a paper and get a good
description on the algorithm I would need to construct such a thing. I can
believe the argument that some of these things are poorly indexed but even a
bare minimum of Google searching usually results in useful algorithms. I would
argue that often times many of these research algorithms have a bunch of
different design decisions that are best explored in the academic literature
around them, and an implementation and a few notes is not sufficient
exploration.

For example I was recently implementing Paxos and there were _tons_ of little
details to be extracted from the papers around that had a big impact on the
actual implementation we ended up with. The 'Paxos Made Live' paper from
Google had many details that were only relevant/true because of engineering
decisions made by the team. If one was presented you with an implementation
derived solely from that paper there are multiple incorrect assumptions you
could derive.

An instance of this is made apparent in Paxos Made Live. Google essentially
fixes their proposer because the have used Google specific details about the
number of participants and their availability. The result is that they direct
_all_ traffic to a single node, and don't spend a much time talking about
leader/proposer selection (which could be useful to _your_ needs).

I also don't buy that an important part is getting the algorithms running as a
service. Most so called "algorithms" are nothing more than a subroutine that
is need as a piece of a greater whole. I would venture most useful
"algorithms" are most likely container data structures and algorithms that
operate over them. It seems that these are probably most useful to have as a
library. Many libraries have already taken this approach LLVM (algorithms for
code generation, albeit not always everything you want), OpenCV for computer
vision routines, BLAS for linear algebra, NLOpt for non-linear optimization,
and I'm sure one could come up with many more examples of democratized
algorithms.

------
gone35
Two unsolicited suggestions, if I may:

1\. Make open source releasing of algorithms compulsory, not optional.

Think of it from your prospective clients' perspective: the upside of digging
out an algorithm from academic journals and implementing it yourself is that
you get to fully understand the source code (since you end up writing it
yourself); you can attest its correctness; and you get to stand on the
shoulders of giants by tweaking and extending it later, if you wish.

Admittedly this might not be seen as that much a benefit for business users,
but for many actual users of advanced algorithms in the scientific computing
community, having to use proprietary algorithms with restrictions on their use
has been seen as a significant step backwards, as manifest during the
controversy brewed over the "Numerical Recipes" controversy [1,2,3]. Even if
the benefits are more a matter of principle than mere practicality, the
palpable distaste for proprietary algorithms in the scientific computing
community is something you should at least keep in mind, lest you risk
alienating a core user base for your product.

2\. Formally verify the correctness of every algorithm submitted.

This is as crucial as large-scale deployment for many scientific computing
users, and it is one of the banes of (and reasons why) implementing the
algorithm yourself. Here your product then could really offer a compelling
proposition to these users.

This would also be beneficial for ensuring the reliability of your API, even
if you formally waive liability to the algorithm developers (as you surely
do). Else you might find yourself on the other side of securities regulators
for a multi-million dollar trading glitch caused by one of your algorithms
[4], or something crazy like that.

[1] In fact, in the wikipedia article for Numerical Recipes, it is claimed
that one of the motivations for the development of the GNU C library was
precisely to come up with a free alternative to them! See:
[http://en.wikipedia.org/wiki/Numerical_Recipes](http://en.wikipedia.org/wiki/Numerical_Recipes)

[2]
[http://aufbix.org/~bolek/download/nr.pdf](http://aufbix.org/~bolek/download/nr.pdf)

[3]
[http://www.astro.umd.edu/~bjw/software/boycottnr.html](http://www.astro.umd.edu/~bjw/software/boycottnr.html)

[4] [http://www.bloomberg.com/news/2013-10-16/knight-capital-
agre...](http://www.bloomberg.com/news/2013-10-16/knight-capital-agrees-to-
pay-12-million-fine-for-2012-errors.html)

------
mmenafra
nice article! looks like a very promising platform.

cheers

~~~
platypii
Thanks! Will be coming out with more details about the platform soon. Always
interested to hear feedback though.

