
Publish your Python function to PiCloud and invoke via REST - kelkabany
http://blog.picloud.com/2011/09/14/introducing-function-publishing-via-rest/
======
socratic
This seems interesting, but I'm wondering who the target customer is (both for
PiCloud, and for this specific use case with REST publishing).

We're running on Heroku, where workers are expensive (at least, unless you
scale them dynamically). We looked at SimpleWorker, which is basically PiCloud
for Ruby. But latency for enqueue-ing a job was way too high, and not having
access to database models and such made it inconvenient for that purpose.
(Long story short, we ended up with Resque like everyone else.) Does PiCloud
have the enqueue latency problem beat? (Like under 100ms RTT?)

But these services don't make sense to me for big data or scientific computing
either. If you have tons of data, you're probably going to end up with
something like MapReduce with some specialized HBase-like backend. And most
"scientific computing" I've seen seems to be either MATLAB (if performance is
no issue at all) or custom C++ code linking to specific linear algebra
libraries (if performance is important). That said, I'm not an expert in
either of these use cases, so maybe I'm wrong? Maybe this is for mobile or to
fix some brokenness in GAE?

Put another way, what is the use case where you (a) don't care about high
enqueue latency, (b) don't care about running within your own infrastructure
and with your own code (e.g., for database access), (c) want tons of compute
cycles in Python without having to manage the infrastructure, but (d) don't
want specific control over the actual infrastructure and data persistence?

~~~
kelkabany
Thanks for your input!

From the get-go, we've targeted users with serious compute-intensive needs.
These users from both academia and industry generally do what we call
"scientific computing." A couple examples are oil companies doing immense
amounts of geophysics simulations, and bioinformatics laboratories in
universities doing comparative genomics; we'll be posting a guest blog entry
in the next week on successfully reducing the time of sequence alignment from
20-25 hours to less than an hour. Other examples include neuroscience,
astronomy, seismology, weather analysis, and protein folding.

As time has gone on, we have noticed two sizable class of users who we
consider to be non-scientific. The first are the risk analysis divisions of
financial firms who run Monte Carlo simulations in their never-ending
endeavors to determine their portfolio's risk profile. The second are web
companies doing all sorts of background processing: video encoding, feed
aggregation, social data scraping, and analytics of all sorts.

The REST feature is an experiment on many levels. In the short term, it helps
our existing users utilize PiCloud without Python, which is very important in
companies with large polyglot codebases (read: Java). In the medium term, we
want to see whether there's adequate interest from web and mobile companies
who require serious scalability, and thus are interested by how we entirely
divorce their algorithms from their servers. In the long term, we want to see
how far we can take the idea of "publishing" a function; there are many users
out there who would love to share their algorithms in _working form_ but lack
the programming know how and computing hardware to do it.

To answer your other questions: We're releasing a BigData a la MapReduce
solution for our users in the coming months. Also, Python has a strong
scientific computing community primarily because it interoperates very well
with existing C/C++/Fortran code via extensions. In most of our use cases,
Python is being used as a wrapper for C code, which is doing the heavy
lifting. PiCloud, unlike many of the other background processing services,
allows users to deploy more than just Python or Ruby code. We used to allow
this through a "package upload" system, but we've recently just replaced it
(in beta) with the ability to fully customize a file system environment (apt-
get away!).

Our users are able to access their databases, often times without any code
modifications, from PiCloud, which means their Django models work just fine.
If their database resides on EC2 us-east, then the performance hit is minimal.

We just did a quick non-scientific check of our enqueue time from an EC2 us-
east server to PiCloud. A cold start (first job) takes 900ms (It could take
longer depending on what needs to get transferred), but once the connection
has been made, enqueue takes around 86ms, and as the system heats up it
reduces to 50ms.

------
gatlin
My friend and I just wrote something today that I hope will become the bedrock
of a similar service, but for Perl: Oyster
(<http://github.com/gatlin/oyster>), so named because it contains a Perl.

We tied Redis lists to filehandles, and then redirected STDIN/STDOUT in a
local context to the Redis filehandles. Then, we simply eval code. It's a
proof of concept awaiting sophistication / configuration.

------
illumen
The 90's called. They want their object publishing back ;)

~~~
catwell
Word. Compare this to ftp://catwell.info/code/examples/webservice/ and cry.

~~~
kelkabany
:'( There's nothing that precludes us from exposing a SOAP interface.

------
lsemel
From the perspective of a web app, how would PiCloud be better than, say,
running Celery on my own server and sending all background tasks to it?

~~~
kelkabany
It's best to understand the differences from the top down. Celery is software.
PiCloud is a service. This largely dictates the automation each system is able
to provide.

Using PiCloud, there is no server to setup with your background processing
software, in this case Celery. There is no server to deploy your own codebase
on, where you have to manage the versioning of code and data; we automatically
deploy the correct versions for you. With Celery, if you need more computing
power, you'll have to setup a new server. With PiCloud, we're managing your
infrastructure so we'll automatically boot new servers--hundreds, if necessary
--or you can manually do it with the click of a button. If there's an update
to Celery, you have to shut down your system and deploy it. With PiCloud, we
handle all the server-side software updates because we control it.

It all boils down to less management, and more automation. A couple more
examples. We've built redundancy across our system so that you don't have to
design a system to handle server failure. You can choose the type of core (CPU
+ RAM combo) you want to use with one keyword argument; no need to change out
all the machines you use.

The final result is that with a simple download of our client library and
three lines of code, you can be leveraging a cluster of hundreds of machines.
That's PaaS at its best.

A much more apt comparison would be with a service-oriented Celery.

------
0x12
That's pretty neat. I hope they have enough sandboxing in place that this will
not be used as a means to ddos innocent bystanders.

~~~
kelkabany
Unfortunately, with our new s1 core type, it's even easier to DDoS services.
;)

In all seriousness, we've given thought to this issue, and the solution we've
found isn't technological. Users who want to use over a couple of hundred
cores simultaneously are capped until we've approved them. It's exactly what
Amazon Web Services does as well.

------
trusko
Very nice, good idea. Cheers.

