As an example: you can buy a Dell PowerEdge M915 from ebay with 128 cores for ~$500 USD and a rack costs around the same. Five of them is 640 cores for a total cost of just 3k USD. That's 640 cores that you now own forever. Guess how much it would cost to use that many cores for a month on AWS? Well over 10k... and next month its another 10k... and so is next month...
With this option you only pay for power and still have the resources for all future experiments. I think the m1000e rack can fit 8 blades in total so you could upgrade to a max 1024 cores in the future! The downside with this particular rack is that it's very, very loud. But I've run the numbers here and it's hard to beat high-core, used rack servers on a $/Ghz basis.
It's telling that you've neglected your own labor in this assessment, and that of the other specialists required to get such ad-hoc compute working at scale (dynamic software networks, on-demand compute, provisioning/billing/monitoring apparatuses, etc...)
What plenty have realized in cloud-native approaches is that the TCO is pretty compelling, sometimes because its easier to distribute procurement/provisioning/billing to teams for their own resource needs, or just in sidestepping the inevitable bureaucracy when you try to centralize that decision-making.
Of course its different if you're talking hobby kit setups in our own homes, but if you're aiming at real scale for absurd parallelisms then that ship has sailed.
It's looking like for about $50K in hardware I get x10 the ram/compute/storage/(internal) network I get from the cloud at around $10K/mo. However, it is taking me about $25K in labor to setup. Hosting costs about $1K for a rack and a decent pipe to the server.
So more than a hobby but not approaching real scale.
The main barrier to using a lambda model is if you have anything that smacks of a DB. It would take me $100K+ to transition away from SQL. If you can do lambda w/o a big retool cost, then it is probably viable. If you are just running a bunch of VMs on the cloud, it is pretty expensive.
Interestingly, my main client (a very large company) just went the other direction and moved all its compute to a system that is Spark underneath running on Azure. They are trying to decommission some expensive TerraData instances. So far, it is a mixed bag -- it is a big step forward (for them) on anything that is 'batch-oriented' but 'interactive' performance is dismal.
Price appears to be a big motivation. I always forget that most large enterprises run on exotic stuff with crazy service contracts that makes the cloud look cheap.
Don't forget the second factor of (not having) the workforce. Physical servers require a bunch of suckers who know physical infrastructure and accept to be oncall 24/7 and deal with DELL/HP/colo full time on top of periodic travels to the datacenter.
In practice, when I was responsible for racks in two separate data centres, I spent 3-4 days a year in the data centres (combined); everything else was handled easily via tickets or remotely. Overall I've generally spent less time on devops with hardware in colo's than with cloud setups.
But specifically the amount of on-call work tends to be down to code quality and higher level architecture not whether you host in cloud or colo or rent managed servers - the "low level" problems become part of the noise floor very quickly in any system with reasonable failover.
If we're talking regular maintenance, like a raid controller or a power supply going bonker. AWS is always accessible, it allows you to realize something is broken and create a new volume or instance in 5 minutes. Whereas with dedicated hardware you might be toast with no remote access and/or no spare parts.
Sure, you know you’re down, you simply can’t do anything about it. Not so with your own hardware, which is why many orgs continue to run their own gear.
I don't buy it. Everywhere I've worked that colo'd or owned their DC had wider-reaching outages (fewer than cloud, but affecting more systems). Usually to do with power delivery (turns out multi tiered backup power systems are hard to get right) or networking misconfiguration (crash carts and lights-out tools are great, but not if 400 systems need manual attention due to a horrible misconfiguration).
I think folks underestimate the non-core competencies of running a data center. Also often underestimated is the value of running in an environment designed to treat tenants as potential attackers; unlike AWS's fault isolation, when running your own gear it's really easy to accidentally make the whole system so interconnected as to be vulnerable--even if you make only good decisions while setting it up
As someone who works for a company with over 150 data centers around the world, I know space and power is always one of our biggest expenses.
Many of them offer a service which allows you to drop off your hardware and they will manage the power, cooling, even repairs and replacement for a monthly fee.
What do you think is ‘the’ data centre? There’s no single universal data centre. And most people don’t have a data centre, or access to one.
It’s called co-location where you buy rack space in one of the core data centre hubs in each city. Been available for many years and it’s very affordable e.g. a few hundred a month.
E.g. people say "you can buy carrots at the grocery store" despite there being multiple such stores.
"At the datacentre" doesn't imply one universal datacentre. It just informs you of which kind of facility you can use.
"you can buy carrots at a grocery store"
"you can buy carrots at the grocery store over there"
The definite article can also be used in English to indicate a specific class among other classes: The cabbage white butterfly lays its eggs on members of the Brassica genus.
And from other sources:
We also use the definite article:
to say something about all the things referred to by a noun:
The wolf is not really a dangerous animal. (= Wolves are not really dangerous animals.)
The kangaroo is found only in Australia. (= Kangaroos are found only in Australia.)
The heart pumps blood around the body. (= Hearts pump blood around bodies.)
We use the definite article in this way to talk about musical instruments:
Joe plays the piano really well.
She is learning the guitar.
to refer to a system or service:
How long does it take on the train?
I heard it on the radio.
You should tell the police.
There are plenty of cloud providers who sell colocation services or even rent bare metal hardware.
Take for instance Hetzner. They rent 32-core AMD EPYC boxes wth 128GB of RAM for about 130€/month. If you rent a half a dozen of those boxes you get more computational bang for a fraction of the buck.
If you reach a scale where your needs span tens of data centers across the world then the economies of scale and operational expenditure are quite different and peculiar, but still they a couple of orders of magnitude cheaper than AWS.
You need a few people who understand a little bit of hardware and IPMI and you are set. That's not a big deal.
The cloud gaslighting is nuts.
Tons of companies now exist to give you insight in your cloud bill because that's the last thing cloud provides want you to have.
So in the end you must decide what kind of pain you want. Cost pain, tech pain, and that all depends on your company's particular circumstances.
Second, 640 cores on an m4.large for 3 years upfront (the equivalent purchase) is $8755/mo, not well over 10k.
Third, you really underestimate electricity costs if you're running a server 24/7 at all cores.
Last, and most importantly, most people simply don't run servers 24/7. They run batch computations for five hours a week, or a day, and then spend a week or two crunching the numbers.
There's some case to make that for certain institutions with very particular compute needs, running on-premise might make short-term financial sense (let's not even get started on labor costs). But it's really inaccurate to call cloud computing a "rip-off" for anything but the niche-est customers.
You can beat that cores per dollar handedly that if you're willing to use sufficiently odd hardware.
E.g. my Intel cluster here is mostly I have systems with 4x E7-8880v4 (firstname.lastname@example.orgGHz). I paid $150/chip + $450 host (thanks ebay) excluding ram.
(I have 8 such hosts now, so 704 cores)
So essentially one months cost for your AWS price w/ a 3year commit pays for an equivalent core count in systems for me... Throw in another month for the ram. Round up to another another to count for the (quite) non-trivial power usage.
You do not have to be anywhere near 100% utilization for AWS to be an extremely poor value large computing jobs compared to thrifty spending on the surplus market.
>I paid $150/chip + $450 host (thanks ebay) excluding ram.
E7-8880v4 are $600 used each. Quad core mobos are not cheap, ranked ram is not cheap. Even if you pulled these from a vendor going out of business the bit coin mining crews would suck these up in heartbeat.
Where are you getting your prices?
(similarly, quad socket systems with slow or no cpus go cheap from people that don't know much about them)
> ranked ram is not cheap.
Yes, ram for quad socket hosts it adds up, since you need at least 2-8 dimms per socket for full memory bandwidth (which was why I explicitly pointed out that I left it out).
On the plus side, those systems tend to have a lot of dimm sockets so you can use lower capacity dimms.
> bit coin mining crews
nah, not useful for that.
For my use case (infrequent, easily separable jobs) FaaS stuff is a godsend. In the last year I've been doing things that would be orders of magnitude harder and more expensive thanks to it.
not to mention the cooling, or the per square foot cost of renting the space for this hardware. anything close to what's being described here is going to need some sound deadening (be it distance or material) from humans.
Still a world of difference between AWS's 10k/month cost to access that many cores.
Assuming the $600 covers power, rack costs and cooling, your data center bill is 6600/month.
For three years, the AWS bill is $315k, and the self hosted option is $265,100 (best case). I'm not up to speed with the pricing trends to say whether the gap will change over those 3 years based on data center changes or aws changes, but the difference is ~50k for three years, or roughly 3 months of an engineers salary over the cost of three months.
And management of the hardware itself doesn't take an extra employee. It probably averages to less than one workday per month.
My average time spend in data centers when handling racks in two different location was around 3-4 days a year. Buying on-call, out-of-hours support is going to cost you a tiny fraction of a persons salary per year, either on retainer and/or based on hefty day rates.
Hardware failures simply do not occur often enough for a rack or two worth of hardware to typically require enough maintenance to make it that expensive, and remote hands (people who act based on a ticket) are available in pretty much any data centre at low costs. If you get servers with IPMI etc. you typically pop in to wire up or retire equipment and dealing with the occasional failure.
Source: Have provided those kind of services for years.
To compare apples to apples, that hypothetical pricetag buys you the system's max capacity for the entire period. In AWS, the invoice from using any of their serverless offerings grows wildly even when utilization stays way below the resources of that hardware bundle.
For instance, let's keep in mind that AWS charges:
* API Gateway per million of requests,
* Lambdas per timememory used by a single call, including the time it wasted with hot and cold starts,
SQS per queued message,
* WAF per request through the firewall,
That's about 3 months worth of AWS to access a similar amount of computational resources.
And you still need someone available for a cloud hosted system - Over ~25 years of handling this kind of work, hardware failures have made up a vanishingly small proportion of the outages I've had to deal with, be it a colo'd setup or a cloud provider
Q: Don't you still need at least 3 people on call for around the clock ops, regardless of how and where your servers or instances live?
Yes, this is a good future: who would ever want to touch the computers they operate? Better to rent computers and position the core competency of your business with a third party, right? Because you can shed some of those pesky janitorial salaries? I’ll be waiting by my phone when you’re on the verge of bankruptcy once the clouds have their teeth in your books, and suddenly I get promoted from “janitor” to “computer operator who has my interests in mind, even though I not-so-subtly malign his existence”.
All you’ve done with this mindset is drive the same janitors to work for the clouds instead and contributed to the downfall of computing as a discipline that any player has any semblance of agency within, as the people who have actually touched datacenter equipment all work for them now or sit around in horror watching as a generation sympathetic to FLOSS arguments willingly hands over the reins of owning a computer to massive corporations.
It's not that cloud is not the right answer. But people have started to forget that running your own metal is still an option. Or with current prices and performance: even a more viable option as it was in the past, because you can do so much more with so much less.
: cranking out useless features nobody ask for while looking down on those (dev)ops people.
Quite. In the middle of lockdown a client needed to spin up some virtual machine instances to demo a product to a potential client. Previous boss had been pushing a cloud-only strategy using Azure and was itching to retire all the physical servers.
Problem: can't spin up any kind of VM due to lack of Azure capacity.
me: "Well, we have that [physical] dev server which we still have, we could spin up the demo stuff on that..."
new boss: "Oh, cool. Great lateral thinking!"
Demo done. Potential client happy. Boss happy.
Where? Was not able to find one. Found a 32 core for $750. It was a pre-Epyc AMD server which has it's own problems.
There were only 9 options listed for the exact term you gave 'Dell PowerEdge M915' for the USA before I got to pricey international sellers. The few options make me feel skeptical about this.
This is of course before I even consider security and performance issues of used Intel servers.
1. AWS and other cloud vendors don't have just one data centre per region, they usually have three redundant locations for availability reasons connected via a direct fibre link.
2. Cost of time; time to find hardware, prep hardware, maintain hardware
3. Cost of your agility. If you need more compute capacity, it will take you weeks to get the hardware required and installed, otherwise you need to have unused hardware sitting around "just in case".
4. Cost of your availability. What if you have a sudden spike of traffic within minutes and the current available hardware cannot service it? At least in the cloud you can spin up short lived resources to manage that load before it throttles back down again.
5. Permanent running costs of fixed hardware. A lot of implementations do not need permanent hardware running and can throttle down to a base of almost nothing so the average running cost on a monthly basis is actually very low.
Only if your time is worthless
The issue is 'total cost of ownership' - not 'unit cost of equipment'.
The operational cost and overhead of running your own gear can be prohibitive.
AWS is successful because it is in fact much cheaper than the alternative in many enterprise situations.
Your execs are probably right.
Capital outlays are expensive and risky and imply a structural lock-in - but that's just the tip of the iceberg.
"A 1000USD server is in practice impossible for me to order at work"
And what about hosting? Networking gear? Support? Repairs and upgrades? Networking security? And how will those servers integrate with the rest of your outlay?
AWS is ridiculously cheap compared to the alternatives in most scenarios and that's why it's so successful.
AWS/Cloud enables so much more, far more dynamically - the value is considerable.
In some situations, if you have a fairly big need for computing, and it's predictable over long time-frames, and those services don't need to be tightly integrated with other cloud services, and you have the internal know-how to keep them running, the cost obviously savings can be achieved, but this is an optimization.
Put on your Eng/Ops/Business Hat for a minute and consider why services like AWS are exploding and growing to be one of the biggest segments in tech? Because 'stupid executives'? No - it's because the value add is fairly immense.
It's so successful because a lot of engineers selecting what to rent never see the invoices and don't understand the costs of colo or managed servers. You see it all over this discussion e.g. with people assuming a rack or two requires full time staff when you can get sufficient on-call resources on retainer for a few hundred a month or so depending on complexity of the setup most places.
Having ordered, configured, set up, hired staff for and run both colo setups, managed servers, and AWS setups many times I've yet to see AWS be remotely competitive on price ever, to the point that when I did contracting I used to offer clients to transition their systems to managed hosting (whether managed or colo comes out cheapest depends on scale and location - e.g. I'm in London and real-estate prices here are too high for colo to typically beat managed servers if you can deal with ~8ms round-trip to providers in Germany or France; for others managed is necessary to be close to customers) for a percentage of their first 3-6 months of reduced cost. That was costs including their devops contracts and monitoring etc.
I use AWS. It has lots of great features, and sometimes those features are worth the cost, but it's the expensive luxury option of hosting.
> Put on your Eng/Ops/Business Hat for a minute and consider why services like AWS are exploding and growing to be one of the biggest segments in tech? Because 'stupid executives'? No - it's because the value add is fairly immense.
I've done this for 25 years, including private cloud setups from before AWS was a thing, and I've had to yell at execs that wanted to triple our monthly costs because AWS sales had whispered buzzwords into their ears. That was the all in costs. It took a massive effort to explain it to them even with the numbers in black and white in front of their faces.
So, yeah, a lot of the time (not always), it is "stupid" executives. Sometimes talked into it by engineers working around planning processes that makes paying by use an easy end-run around budgeting.
The biggest achievement of AWS is to sell the idea that it is cheap because Amazon.
This is for cost optimized on-prem storage vs ebs or s3. I check the actual workload and space utilization, and give amazon every benefit of the doubt to get the ratio that low.
It probably doesn’t help that a $100 Samsung EVO (500GB) can do 500K IOPS, but that the storage would cost $62.50 per month on a provisioned IOPS EBS volume. At a five year amortization, thats 37.5 times more expensive than buying. (EBS has poor durability by design, so you end up keeping the same number of copies on prem or in EBS.)
At that point, it’s basically game over, so it doesn’t matter that the IOPS would cost $32,500 per month.
Note: I always include the cost of the scale out infrastructure, etc in the comparisons, but with that, on prem is competitive even if the cluster is 10% full on average and does vintage stuff like triplicating data instead of erasure coding.
If the ops team can’t manage >>10 machines for the price of me managing one, there’s something horribly wrong.
They should just let the developer buy the $1000 machine if it will help productivity.
That doesn’t mean production should run on a pile of machines under someone’s desk. That’s a different scenario.
I recall pointing out to Platform's management that if Google could provide an infrastructure that solved these sorts of problems with massive parallelism that currently required specialized switching fabrics and massive memory sharing we would have something very special. But at the time it was a non-starter, way too much money to be made in search ads to bother with building a system for something like the 200 customers in the world total.
I didn't care one way or the other if Google did it so after running at the wall of "under 2s" a couple of times I just said "fine, your loss."
One time, I wanted to process a lot of images stored on Amazon S3. So I used 3 Xeon quad-core nodes from my render farm together with a highly optimized C++ software. We peaked at 800mbit/s downstream before the entire S3 bucket went offline.
Similarly, when I was doing molecular dynamics simulations, the initial state was 20 GB large, and so were the results.
The issue with these computing workloads is usually IO, not the raw compute power. That's why Hadoop, for example, moves the calculations onto the storage nodes, if possible.
You make a good point about I/O and I actually wanted to comment something along the lines of "why not Hadoop?" since the programming model looks very similar but with less mature tooling.
However, now I think about it, the big win of serverless is that it is not always on. With Hadoop, you build and administer a cluster which will only be efficient if you constantly use it. This Serverless setup would suit jobs that only run occasionally.
Plus, most tasks that only run occasionally tend to be not urgent, so instead of parallizing to 3000 concurrent executions, like the article suggests, you could just wait an hour instead.
Serverless is only useful if you have high load spikes that are rare but super urgent. In my opinion, that combination almost never happens.
But thanks for the offer! The natural market forces will drive cloud computing prices down the same way they've driven everything else down. But until then, roll-your-own can save loads.
Yeah, I was particularly curious because I was unable to find better public offers than AWS (with their homebrew 100Gbit/s MPI that drops infiniband's hardware-guaranteed-delivery to prevent statically-allocated-buffer issues in many-node setups, allowing them quite impressive scalability) or Azure (with their 200Gbit/s Infiniband clusters), at least for occasional batch-jobs.
I wouldn't ask if I could DIY for less than using AWS, but owning ram is expensive. And for development purposes it would be quite enticing to just co-locate storage with compute, and rent some space on those NVMe drives for the hours/days you're running e.g. individual iterations on a large dataset to do accurate profile-guided optimizations (by hand or by compiler). Iterations only take a few minutes each, but loading what's essentially a good fraction (minus scratch space, and some compression is typically possible) of the ram over network causes setup to take quite a long time (compared to a single iteration).
Still I would not be opposed to such client-server(less) architecture being used where I could have slower devices seamlessly integrating with my personal server for faster processing of compute heavy tasks.
It's not that this hasn't been done before (thin clients anyone? Even X server model is exactly like that), but a similar approach could make a come back at some point.
For most people, uploading that 4gb file for cloud processing will take an hour. But re-encoding 2h of video with GPU acceleration only takes 15-20 minutes. So no matter how fast serverless is, it'll always need to wait for upload and download, which may be slower than all the computations combined.
As for X server, using it over the internet is a pain. It is optimized for a low latency connection, meaning the opposite of putting calculations in a cloud hundreds of ms of ping away.
Again, we are not there yet, but we are not that far off either.
My mention of X was to highlight how this is just old technology within new constraints (move things that do not need small latency from the thin client onto the fast server), but how it's applied is going to make it or break it.
But my lived reality is that I have to go to the upper floor in my parent's house if I want to have 2G reception. And they only live a 10 minute drive from the town hall.
E.g. suppose you have 100 TB of data files and you want to run some kind of keyword search over the data. If the data can be broken into 1000x100GB chunks then you can do some map-reduce-ish thing where each 100GB chunk is searched independently, then the search results from each of the 1000 chunks is aggregated. 1000x speedup! serverless!
however, if you want to execute this across some fleet of rented "serverless" servers, a key factor that will influence cost and running time is (1) where is the 100 TB of data right now, (2) how are you going to copy each 100 GB chunk of the data to each serverless server, (3) how much time and money will that copy cost.
I.e. in examples like this where the time required to read the data and send the data over the network is much larger than the time required to compute the data once the data is in memory, is going to be more efficient to move the code & the compute to where the data already is rather than moving the data and the code to some other physical compute device behind a bunch of abstraction layers and network pipes.
There's an even smaller subset which is one-shot data access.
> in examples like this where the time required to read the data and send the data over the network is much larger than the time required to compute the data once the data is in memory
The annoying thing about lambda and other functional alternatives is that data-access patterns tend to be repetitive in somewhat predictable ways & there is no way to take advantage of that fact easily.
However, if you don't have that & say you were reading from S3 for every pass, then lambda does look attractive because the container lifetime management is outsourced - but if you do have even temporal stickiness of data, then it helps to do your own container management & direct queries closer to previous access, rather than to entirely cold instances.
If there's a thing that hadoop missed out on building into itself, it was a distributed work queue with functions with slight side-effects (i.e memoization).
Is that not named spark ? :)
Redshift, Bigquery etc implement it this way and then have various schemes for computation on top. Redshift bundles individual compute with storage whereas other implementations scale the compute independent of the distributed storage.
But this has allowed very cheap scale for querying large datasets and in practice, I imagine you very rarely have to worry about implementing the data transport yourself beyond initial loading with tools like those available.
Edit: Also most clouds moving data within their networks is free so really it's just talking time for moving data which indirectly influcences cost in terms of run time.
Distributed over what? Putting all the data in one service and transferring it into another is still going to look an awful lot like a big centralized transfer.
If you're permanently storing chunks of data next to chunks of the compute that will run on your data, that sounds an awful lot like a server.
It's literally one of the basic building blocks for modern data warehouses which have solved this problem. By distributing to different nodes over some key in your dataset when you ingest the data or in Google's case some intelligent chunking method , at query time you wind up with hopefully evenly distributed chunks already, and by doing things this way you minimize the amount of data you need to actually move or copy at query time before the data warehouses essentially run a map reduce job (with some really cool query planning  ) with your query to get you your results.
As to your second point, of course it sounds like a server? Serverless just means I don't have to maintain the hardware resources which is a nightmare at any real scale, AWS Athena and Spectrum are great examples of not having to scale hardware, though there are tradeoffs. The point I'm arguing against is that it's expensive (it's not), because most clouds allow free data transfer within their networks and because of modern data techniques that minimize the amount of data transfer that's needed so run times are limited to how bad you've written a SQL query or chosen to initially distribute your data.
Combine that with instance reservations for more regular workloads and modern Big Data can be pretty cheap.
On the chance you have to move 100TB through a raw Spark job, if you have an idea that you know what you'll need the data for you can take a page or two from systems that were built to solve these problems and organize your data in such a way that lends itself to that fact.
Storing your 100TB data as one contiguous block and then having to chunk + transfer it at query time in their rudimentary search implementation like the OP suggests is probably about the worst position you could have put yourself in, and would have been a naive thing for someone to do.
Emerging stateful serverless runtimes have been shown to support even big data applications whilst keeping the scalability and multi-tenancy benefits of FaaS. Combined with scalable fast stores, I believe we have here the stateful serverless platforms of tomorrow.
 https://github.com/lsds/faasm (can run on KNative, includes demos)
 https://github.com/hydro-project/anna (KVS)
 https://github.com/stanford-mast/pocket (Multi-tiered storage DRAM+Flash)
"Serverless" is basically equivalent to a supercomputer in that context, but then it goes on to exhibit latency characteristics that would be considered a non-starter for a supercomputer.
Latency is one of the most important aspects of IO and is the ultimate resource underlying all of this. The lower your latency, the faster you can get the work done. When you shard your work in a latency domain measured in milliseconds-to-seconds, you have to operate with far different semantics than when you are working in a domain where a direct method call can be expected to return within nanoseconds-to-microseconds. We are talking 6 orders of magnitude or more difference in latency between local execution and an AWS Lambda. It is literally more than a million times faster to run a method that lives in warm L1 than it is to politely ask Amazon's computer to run the same method over the internet.
This stuff really matters and I feel like no one is paying attention to it anymore. Your CPU can do an incredible amount of work if you stop abusing it and treating it like some worthless thing that is incapable of handling any sizeable work effort. Pay attention to the NUMA model and how cache works. Even high level languages can leverage these aspects if you focus on them. You can process tens of millions of client transactions per second on a single x86 thread if you are careful.
Furthermore, the various cloud vendors have done an exceptional job at making their vanilla compute facility seem like a piece of shit too. These days, a $200/m EC2 instance feels like a bag of sand compared to a very low-end Ryzen 3300G desktop I recently built for basic lab duty. I'm not quite sure how they accomplished this, but something about cloud instances has always felt off to me. I can see how others would develop a perception that simply hosting things on one big EC2 instance would mean their application runs like shit. I am unsurprised that everyone is reaching for other options now. On-prem might be the best option if you have already optimized your stack and are now struggling with the cloud vendors' various layers of hardware indirection. Simply going from EC2 to on-prem could buy you an order of magnitude or more in speedup just by virtue of having current gen bare metal 100% dedicated to the task at hand. Obviously, this brings with it other operational and capital costs which must be justified by the business.
I've cut number of servers by large factors on several instances when moving off cloud to managed servers in a data centre as well because I've been able to configure the right mix of RAM, NVMe and CPU for a given problem instead of picking a closest match that often isn't very close.
It doesn't require a new kind of software engineer. It's just another software architecture to go alongside micro services, containerisation etc.
And it hasn't changed the world because (a) it's the ultimate form of vendor lock in and (b) it makes even simple apps much more complex to reason about and manage.
I really dislike the local dev experience and deployment for serverless. But otherwise the model is pretty clear: a file is a function, data goes in, data comes out, just like any other server function. If one instance is busy, spin off a new one.
What’s hard to reason about?
If you have a /myroute handler defined in express or flask or even form.php or form.cgi in Apache,you never had to write the code to make user requests trigger your handler anyway even in the old days. That's the entire point of using a server instead of listening to a socket yourself
With serverless the same thing still happens with someone else managing the path from a request to a handler and back.
In fact, if you ever used a cpanel host with PHP like in the good old days (110mb.com anyone?), you already used 'serverless'. You just uploaded .php files to a directory and your website just 'magically' worked.
Need aggregation of results? Communication among nodes? Computation subdivision that is not strictly predeterminable? Sorry, not embarrasingly parallel, won't be doable like this.
You may be able to extract some embarrassingly parallel part, like compilation of independent object files, but very often you still have a longish, complex and timeconsuming serial step, like linking those object files. This kind of recognising different parts of a program is already state of the art, no need to invent a new field...
Traditional server applications can be rewritten “serverless” so long as your pipeline is pretty functional, ie you’re not saving critical state in memory on the server process itself.
And if you're storing things in DB, how long does it take for your lambda to start up now? In a project I was on, it could easily take 3s for a new request to be served, because we had a lambda that was doing and auth check with the DB, and then a different lambda that was doing the actual application stuff with another DB. So not only would we incur the cold start cost twice, but, since our lambdas needed a nic inside a VPC to talk to the DB, the cold start cost was huge. And of course, you pay that cost for each additional concurrent connection, not just once.
Of course, if we had stored everything in some other managed service, and maybe used some managed auth scheme, this would have not been a problem. AWS likes it when they get you hook, line and sinker.
Also, things stored in the global scope (in Node.js functions) are kept between different invocations of a same container. If it can help...
And yes, different invocations of the same container helped significantly (most importantly, the interface was already deployed in the private network), that is why I said that the problem was concurrent access. You could serve a pretty good number of users/second, but each user that happened to send a request while the already deployed containers were all busy would have to wait for up to 3s before we could give them any data from the backend. And of course, if two new users came in parallel, they would both have to wait, etc.
Sorry I can't be of any more help :/
But for the majority of apps it doesn't save that much. And it still pales in comparison to the cost of the engineers. Which often spend significantly more time to build, debug, test etc a serverless app than one they just throw on an EC2 server.
In general I ask myself if my new server can run on a lambda and if it can’t then I reach for something else. Most of the time it can.
What your are selling here is nothing else than a new proprietary-scheduler-runtime to run embarrassingly parallel jobs ( the easiest kind of parallelism)
There were already plenty of solution to do that, the only difference here is that you run on AWS lambda.
Why would you need an entire new type of engineer to do that ?
There is nothing new here excepted buzzwords.
Are engineer nowadays a script-kiddies bind to a technology there entire career ? (Tip: of course no)
If the author just wanted to fetch pages in parallel, they could have done better than 8 hours even on their own laptop (you can run more than one chromium process at a time). The real benefit they got from using AWS Lambda is that the requests weren't throttled or ghosted by Redfin, probably because the processes were running on enough different machines, with different IP addresses.
Depending how you look at it, I don't think most software is designed to take advantage of multiple cores, let alone multiple machines.
Has anyone benchmarked the speed of running (let's say, on AWS) 1000x a lambda function vs. running the same function on a regular AWS instances?
What about all the overhead (for example, k8s overhead, both in CPU and disk, etc)
I'm afraid it would be very easy to get a repeat of this https://adamdrake.com/command-line-tools-can-be-235x-faster-...
So, does serverless computing reduce the job completion time? Yes if the job is somewhat parallelizable. Does it save energy, money, etc.? Definitely no. The question is whether you want to make the tradeoff here: how much more energy would you afford to pay for, if you want to reduce the job completion time by half? It like batch processing vs. realtime operation. The former provides higher throughput, while the latter gives user a shorter latency. Having better cloud infrastructure (VM, scheduler, etc.) helps make this tradeoff more favorable, but the research community have just started looking at this problem.
The author seems to think the paradigm is new (it isn’t) and claims that it hasn’t taken off massively (it has) because he incorrectly points to a number of workloads that aren’t embarrassingly parallel. On the other hand, in theory having a common runtime for these operations from a public cloud provider should enable them to keep their utilization of resources extremely high, such that it would be cheaper for us to use AWS/GCP/etc instead of rolling our own on OVH/Hetzner. But if anything, the per compute cost of FaaS is higher than it is for other compute models, which means the economics really only work for small workloads where the fixed overhead of EC2 is larger than the variable overhead of Lambda.
Datasets that are tens of gigabytes, or maybe 100mil records or so...this really covers most things.
And for every 1 thing it doesn't, there's 20 more claimed that a single machine using simple tools could handle just fine.
Being able to detect when things have been processed, have a way to set dirty flags, prioritize things, have regions of interest, be able to have re-entrant processing, caching parts of the pipeline and having nuanced rules for invalidating it, these in my mind are kinda basic things here.
When they aren't done, sure, someone will need giant resources because they're doing foolish things. But that's literally the only reason. Substituting money for sense is an old hack.
Doing a facial search on it?
Matching a rhythm picked up by the mic to your local music collection?
Hashing and/or encryption of data.
There's plenty of desktop-like use cases that would benefit from massively parallel computation, but network (or even IO) bandwidth is currently going to be the limiting factor.
IOW, we are not there yet!
Currently, we can parallelize tasks which are low on data and high on computation.
So how can we expand the IO bandwidth for everyone, even desktop or mobile users?
Still, as you note yourself with "if auth was easier", we'd need custom applications even for the cloud — it's just that you'd hope they provide unbounded bandwidth for each user, but I am not even sure that's the case for the biggest of players (dropbox, google drive...).
This thing is a 3 minute read
You'd also need built-in support in tooling and compilers, where you can compile specific functions or modules into something that can run separately without actually doing that manually.
If you goal is <0.1 sec startup -- yeah, then you'll need WASM.
If you are OK with 1-5 second of startup, you have a ton of options. Apache Spark uses JVM magic to send out the the raw bytecode. You can start up docker container. If you are willing to rewrite stdio, you can exec machine code under seccomp/pledge.
There are even full-blown VM solutions -- Amazon Firecracker, which claims that: "Firecracker initiates user space or application code in as little as 125 ms and supports microVM creation rates of up to 150 microVMs per second per host."
It only tells you that the number of transistors on on a silicon die will double every 18 months.
If we’re still able to add parallel threads of execution at the same rate then Moore’s Law still holds.
Well, no. Software compilation does not work massively parallel. Maybe parts of the optimization pipeline, but compilation of 1000 unit program (assuming your language of choice even has separate compilation) does normally require to put the units into a dependency graph (see OCaml, for instance), or puts most of the effort into the inherently serial tasks like preprocessing and linking (C++).
Probably my pedantic side showing through but I find reading text where ampersand is used in place of “and” really jarring (same for capitalised regular nouns). It seems somewhat common now so I guess I’ll have to get used to it.
Although & means “and” they are generally used differently. & is used in places like company names where it’s part of a noun (e.g. B&Q, Smith & Wesson). “And” joins parts of a sentence. I find it jarring to use & because a) it looks like a punctuation mark and I naturally pause when reading, b) I expect to have read a noun, not a join in a sentence, and it takes some cognitive effort to re-parse the sentence using & in a way I didn’t first expect. Reading, especially quickly, relies a lot on expectation and pattern matching and I find this disrupts it. If you don’t, good for you.
Obviously in informal speech people write whatever they want and it’s true that language evolves over time. But I’d argue that using & instead of and isn’t “correct”, at least by current standards — if it was we’d see this used in newspapers, books, and so on.
Scraping is an embarrassingly perfect scenario for coroutines. Most asynchronous frameworks even use scraping as one of the examples.
In short, it would probably be done in 15 minutes, assuming you don’t get throttled quickly. If the tool wasn’t already async capable, another 15 minutes to wrap some scraping in gevent/eventlet.
So is this a workaround for "censorship" by Google etc?
And where would the crawl archives come from?
Also, I wonder how this could be made usable and affordable for random individuals.