Publish your Python function to PiCloud and invoke via REST

socratic · on Sept 15, 2011

This seems interesting, but I'm wondering who the target customer is (both for PiCloud, and for this specific use case with REST publishing).

We're running on Heroku, where workers are expensive (at least, unless you scale them dynamically). We looked at SimpleWorker, which is basically PiCloud for Ruby. But latency for enqueue-ing a job was way too high, and not having access to database models and such made it inconvenient for that purpose. (Long story short, we ended up with Resque like everyone else.) Does PiCloud have the enqueue latency problem beat? (Like under 100ms RTT?)

But these services don't make sense to me for big data or scientific computing either. If you have tons of data, you're probably going to end up with something like MapReduce with some specialized HBase-like backend. And most "scientific computing" I've seen seems to be either MATLAB (if performance is no issue at all) or custom C++ code linking to specific linear algebra libraries (if performance is important). That said, I'm not an expert in either of these use cases, so maybe I'm wrong? Maybe this is for mobile or to fix some brokenness in GAE?

Put another way, what is the use case where you (a) don't care about high enqueue latency, (b) don't care about running within your own infrastructure and with your own code (e.g., for database access), (c) want tons of compute cycles in Python without having to manage the infrastructure, but (d) don't want specific control over the actual infrastructure and data persistence?

kelkabany · on Sept 15, 2011

Thanks for your input!

From the get-go, we've targeted users with serious compute-intensive needs. These users from both academia and industry generally do what we call "scientific computing." A couple examples are oil companies doing immense amounts of geophysics simulations, and bioinformatics laboratories in universities doing comparative genomics; we'll be posting a guest blog entry in the next week on successfully reducing the time of sequence alignment from 20-25 hours to less than an hour. Other examples include neuroscience, astronomy, seismology, weather analysis, and protein folding.

As time has gone on, we have noticed two sizable class of users who we consider to be non-scientific. The first are the risk analysis divisions of financial firms who run Monte Carlo simulations in their never-ending endeavors to determine their portfolio's risk profile. The second are web companies doing all sorts of background processing: video encoding, feed aggregation, social data scraping, and analytics of all sorts.

The REST feature is an experiment on many levels. In the short term, it helps our existing users utilize PiCloud without Python, which is very important in companies with large polyglot codebases (read: Java). In the medium term, we want to see whether there's adequate interest from web and mobile companies who require serious scalability, and thus are interested by how we entirely divorce their algorithms from their servers. In the long term, we want to see how far we can take the idea of "publishing" a function; there are many users out there who would love to share their algorithms in working form but lack the programming know how and computing hardware to do it.

To answer your other questions: We're releasing a BigData a la MapReduce solution for our users in the coming months. Also, Python has a strong scientific computing community primarily because it interoperates very well with existing C/C++/Fortran code via extensions. In most of our use cases, Python is being used as a wrapper for C code, which is doing the heavy lifting. PiCloud, unlike many of the other background processing services, allows users to deploy more than just Python or Ruby code. We used to allow this through a "package upload" system, but we've recently just replaced it (in beta) with the ability to fully customize a file system environment (apt-get away!).

Our users are able to access their databases, often times without any code modifications, from PiCloud, which means their Django models work just fine. If their database resides on EC2 us-east, then the performance hit is minimal.

We just did a quick non-scientific check of our enqueue time from an EC2 us-east server to PiCloud. A cold start (first job) takes 900ms (It could take longer depending on what needs to get transferred), but once the connection has been made, enqueue takes around 86ms, and as the system heats up it reduces to 50ms.

gatlin · on Sept 16, 2011

My friend and I just wrote something today that I hope will become the bedrock of a similar service, but for Perl: Oyster (http://github.com/gatlin/oyster), so named because it contains a Perl.

We tied Redis lists to filehandles, and then redirected STDIN/STDOUT in a local context to the Redis filehandles. Then, we simply eval code. It's a proof of concept awaiting sophistication / configuration.

illumen · on Sept 15, 2011

The 90's called. They want their object publishing back ;)

arethuza · on Sept 15, 2011

Did anyone every make CORBA-based services accessible via the public Internet using IIOP?

catwell · on Sept 15, 2011

Word. Compare this to ftp://catwell.info/code/examples/webservice/ and cry.

kelkabany · on Sept 15, 2011

:'( There's nothing that precludes us from exposing a SOAP interface.

lsemel · on Sept 15, 2011

From the perspective of a web app, how would PiCloud be better than, say, running Celery on my own server and sending all background tasks to it?

kelkabany · on Sept 15, 2011

It's best to understand the differences from the top down. Celery is software. PiCloud is a service. This largely dictates the automation each system is able to provide.

Using PiCloud, there is no server to setup with your background processing software, in this case Celery. There is no server to deploy your own codebase on, where you have to manage the versioning of code and data; we automatically deploy the correct versions for you. With Celery, if you need more computing power, you'll have to setup a new server. With PiCloud, we're managing your infrastructure so we'll automatically boot new servers--hundreds, if necessary--or you can manually do it with the click of a button. If there's an update to Celery, you have to shut down your system and deploy it. With PiCloud, we handle all the server-side software updates because we control it.

It all boils down to less management, and more automation. A couple more examples. We've built redundancy across our system so that you don't have to design a system to handle server failure. You can choose the type of core (CPU + RAM combo) you want to use with one keyword argument; no need to change out all the machines you use.

The final result is that with a simple download of our client library and three lines of code, you can be leveraging a cluster of hundreds of machines. That's PaaS at its best.

A much more apt comparison would be with a service-oriented Celery.

0x12 · on Sept 15, 2011

That's pretty neat. I hope they have enough sandboxing in place that this will not be used as a means to ddos innocent bystanders.

kelkabany · on Sept 15, 2011

Unfortunately, with our new s1 core type, it's even easier to DDoS services. ;)

In all seriousness, we've given thought to this issue, and the solution we've found isn't technological. Users who want to use over a couple of hundred cores simultaneously are capped until we've approved them. It's exactly what Amazon Web Services does as well.

trusko · on Sept 15, 2011

Very nice, good idea. Cheers.