It is a cool project. S3 can be cost efficient, but only if you don't touch data :)
Their price calculation doesn't mention cost of S3 requests, which very quickly adds up and is often neglected.
It costs $1 for 2.5M GET requests to S3. They have 180 shards, in a general case query seems to fetch all of them. Presumably they don't download full shard per request, but download an index + some relevant ranges. Lets say that is 10 requests per shard. So that would be 1800 S3 GET requests = ~1400 search queries cost them $1.
Assuming their service is reasonably popular and serve 1 req/second on average, that would be $1,440 per 30 days in addition to advertised $1,000 spent on EC2 and S3 storage.
Seems comparable to AWS ElasticSearch service costs:
Don't forget S3 includes replication.
Also EBS throughput (even with SSD) is not good at all.
Also our memory footprint is tiny. This is necessary to make it run on two servers.
Finally, cpu-wise, our search engine is almost 2x faster than
lucene.
If you don't believe us, try to replicate our demo on an elastic search :D.
Chatnoir.eu is the only other common crawl cluster we know of. It consists of 120 nodes.
> If we get 1 req/s, even for a dataset of that size, this is not as cost efficient.
How many req/s do you have in mind for your system to be a viable option?
> Also EBS throughput (even with SSD) is not good at all.
It is not worse than S3 still, right?
> Chatnoir.eu is the only other common crawl cluster we know of. It consists of 120 nodes.
I have no deep ES experience. Are you saying, that to host 6TB of indexed data (before replication) you'd need 120 nodes ES cluster? If so, then reducing it to just 2 nodes is the real sales pitch, not S3 usage :)
What about d3en instances? Clustered, and together with minio you might reach similar performance. Only issue is the inter-region traffic, it would need to be inside the same AZ
I had the same feeling when reading the post. Their remark that they "estimated the cost" to be that low is in my experience a bad signal. Estimating costs on the cloud is really hard, there are so many (hidden) costs you may miss making it a lot more expensive.
For what it's worth, if you want to run ElasticSearch on AWS I would always go with local-NVMe instances from the i3 family, this is also what AWS and Elasticsearch themselves recommend.
4x i3en.2xlarge (64GB / 5TB NVMe) at $449 / month (1yr reserved) is $1796, or $2636 without reservation, but much better performance due to the NVMe drives.
It's easy to put a block cache in front of the index, and I'm sure they'll get to it sooner or later.
The benefit of using S3 in that case is that unlike e.g. Elastic, your block cache servers don't need replication, and you can tear them down when you're done. You can put them in a true autoscaling group as well.
A ridiculous blanket statement, despite the "almost never" cop-out...
It is cost-efficient in a wide array of scenarios. Many companies pay for it because they have calculated the different investment scenarios and AWS comes on top of alternatives such as owning the hardware or using competing cloud vendors.
I own a consultancy that builds complex web apps and while I appreciate how occasionally a dev has tried to save costs for me by cramming every piece of the stack (web server, cache, db, queue, etc.), in a single Docker image to host in a droplet, I'd much rather pay for separate services, as I consider it cheaper in the long run.
Yes, I charge $60-$90 per hour of dev time to my customers and the time saved from using simple Elastic Beanstalk deployments pays for itself in saved dev time. The architecture is also infinitely easier to reason about and scale than cramming parts in a single image.
First, we are getting better throughput from S3 than I we were using a SATA SSD. (and slower than a NVMe SSD).
This is a bit of a secret.
Of course, single sequential throughput on S3 sucks. At the end of the day the data is stored on spining disk
and we cannot do anything against the law physics.
... but we can concurrently read many disks using s3. Network is our only bottleneck.
The theoretical upper bound on our instances is 2GB/s.
On throughput intensive 1s query, we observe an average of 1GB/s.
Also you are not accounting for replication. S3 costs include battle tested, multi-DC replication.
Last but not least, S3 trivially decouples compute and storage.
It means that we can host 100 different indices on S3, and use the same pool of search server to deal with the
CPU-bound stuff.
This last bit is really what drives the price an extra 5x down for many use case.
"S3 costs include battle tested, multi-DC replication."
Sometimes we pay a bit too much for this multi-replication, battle tested stuff. It's not like the probability of loosing data is THAT huge. For the 4x extra cost you could easily take a backup every 24h.
"It means that we can host 100 different indices on S3, and use the same pool of search server to deal with the CPU-bound stuff"
You can do that with NFS.
It's amazing how much we are willing to pay for a bunch of computers in the cloud. Leasing a new car costs around $350/month. You could have three new cars at your disposal for the same price as this search implementation.
> For the 4x extra cost you could easily take a backup every 24h.
It's also worth considering the cost to simply regenerate the data for something like this that isn't the source of truth. You'll lose any content that you indexed that has disappeared from the web, but that seems like a feature more than a bug.
> You can do that with NFS.
You're going to be bound by your NIC speed. You can bond them together, but the upper bounds on NFS performance are going to be significantly lower than on S3. Whether that's going to be an issue for them or not, I don't know, but a big part of the reason for separating compute and storage is so that one of them can scale massively without the other.
On AWS you can't get 128GB RAM on anything for less than $300/month (or nearly $500 on-demand). And to get multiple TB of SSD you need significantly larger instances, north of $1000/month.
Similar with DO, the closest equivalent is a 3.52GB SSD, 128GB RAM, 16 vCPU droplet for $1240/month.
If you need raw power instead of integration into an extensive service ecosystem, dedicated servers are hard to beat (short of colocating your own hardware, which comes with more headache). And Hetzner is among the best in terms of value/money.
The sad story is you can't get anywhere close to this even with rented dedicated servers. As a German I'm happy that we have Hetzner and I use their services extensively. However if I wanted to start deploying things in the US or Asia I'd be forced to go with something like OVH which, while still a lot cheaper than AWS, is still significantly more expensive than Hetzner.
AWS is a scam not because it can’t save you money, but because they actively try to trick you into spending more money. That’s practically the definition of a scam.
Go to the AWS console and try to answer even simply things like how much did the last hour/day/week cost me? Or how about some notifications if that new service you just added is going to cost vastly more than you where expecting.
I know of a few people getting fired after migrating to AWS and it’s not because the company was suddenly saving money.
I've never seen AWS actively try to trick people into spending more money. I've seen Premium Support, product service teams, solutions architects, and account managers all suggest not to use AWS services if it doesn't fit the customer usecase. I've personally recommended non-AWS options for customers who are trying to fit a square peg into a round hole.
Can the billing console be better? Yes. But AWS isn't trying to trick anyone into anything. The console, while it has its troubles, doesn't have dark patterns and pricing is transparent. You pay for what you use, and prices have never decreased.
Hell, I know of a specific service that was priced poorly (meaning it wasn't profitable for AWS). Instead of raising prices, AWS ate its hat while rewriting the entire service from scratch to give it better offerings and make it cheaper (both for AWS and customers).
I do not support view that AWS is a scam, but price is something AWS tries to make developers not to think about. Every blog post, documentation or quick start tells you about features, but never about costs.
You read "you can run Lambda in VPC", great, but there is a fine print somewhere on a remote page, that you'd also need NAT gateway if you want said Lambda to access internet, public network wont do.
You read "you can enable SSE on S3", but it is not immediately obvious, that every request then incurs KMS call and billed accordingly (that was before bucket key feture).
Want to enable Control Tower? It creates so many services, it is impossible to predict costs until you enable it and wait to be billed.
If pricing is intended to be transparent, then why is it completely absent from the user interface? Transparent pricing would be to tell me how much something costs when I order it, not make me use a different tool or find it in the documentation
No, no, you're supposed to use their cthulhu inspired pricing tool. I mean, you've got at least a 50/50 chance of figuring out how to use it before you go permanently insane.
If you're so incompetent that you can't estimate your costs, a fixed price microserver built into a dilapidated wooden shanty with no obvious fire protection system is what you should be buying.
In order for a system to be effective at achieving a goal its owners and operators don't have to sit around a table in a smoke filled roam and toast evil. The goal good bad or indifferent merely has to be progressively incentivized by prevailing conditions.
If clarity causes customers to spend less it is disincentivized and since clarity is hard and requires active investment to maintain it decays naturally.
It's easy to see how you can end up with a system that the users experience as a dishonest attempt to get more of their money and operators, who are necessarily very familiar with the system experience as merely messy but transparent.
Neither is precisely wrong however your users don't have your experience or training and many are liable to interact with a computer not you. Your system is then exactly as honest and transparent as your UI as perceived by your average user.
I haven’t used AWS in a while but one trick that I recall was enabling service X also enabled sub dependencies. Instantly disabling service X didn’t stop services XYZ which you continued to be billed for. Granted not that expensive, but it still felt like a trap.
Other stuff was more debatable, but it just felt like dancing in a mine field.
Another example of a bit darkish pattern is listing ridiculously small prices ($0.0000166667 per GB-second, $0.0004 per 1000 GET requests). It's hard to reason about very small and very big numbers, order of magnitude difference "feels" the same. Showing such a small prices is accurate, but deceiving IMHO.
AWs is pretty bad at telling you how much something you're not running will cost if you run it but I've never had any issues knowing what something has cost me in the past.
>Go to the AWS console and try to answer even simply things like how much did the last hour/day/week cost me?
Click user@account in top right, click My Billing Dashboard, spend this month is on that page in giant font, click Cost Explorer for more granular breakdown (day, service, etc.), click Bill Details for list breakdown of spend by month.
>Or how about some notifications if that new service you just added is going to cost vastly more than you where expecting.
Billing Dashboard and then Budgets.
edit: This assumes you have permissions to see billing details, by default non-root accounts do not which might be why you're confused.
> Click user@account in top right, click My Billing Dashboard, spend this month is on that page in giant font, click Cost Explorer for more granular breakdown (day, service, etc.), click Bill Details for list breakdown of spend by month.
Sure, you see a number but I was just talking with someone at AWS who said it you still can’t trust it to be up to date especially across zone boundaries. That means it’s useful when everything is working as expected but can be actively misleading when troubleshooting.
Huge fan of Hetzner, but dedicated servers do not invalidate the value proposition of the cloud.
Ordering a server at Hetzner can take anywhere between a few minutes and a few days. Each server has a fixed setup cost of around the monthly rent. They only have two datacenters in Europe. They don't have any auxillary services (databases, queues, scalable object storage, etc.). They are unbeatable for certain use-cases, but the cloud is still valuable for lots of other scenarios.
> Ordering a server at Hetzner can take anywhere between a few minutes and a few days
At the start of the pandemic, ordering bare metal servers anywhere was faster than getting a new EC2 VM up and running...
The cloud doesn't even fulfil the value proposition of the cloud. It's significantly more expensive, and when you actually need the flexibility, none of it is available.
Sorry, let's call it "regions" then, they have multiple DCs in different cities in Germany, but for latency purposes I would consider these part of one region.
Also cause the 5950x is likely for many workloads faster which do not linearly scale across more cores than a zen2 epyc (since zen3 has huge singlethread performance improvements)
I am really starting to feel that co-location will make a big comeback. It seems cloud costs are just becoming too high for the convenience they once offered. For small projects and scale probably makes a ton of sense, but at some point the costs to scale aren't worth the up front developer cost savings.
Shared nothing is the best architecture for e-commerce search for instance.
But if you have one query every minutes or so for a 1TB dataset, it feels a bit silly to have a couple of servers dedicated to it doesn't it?
Imagine this is the case for all big data search you can think of... Logs, emails, etc. This is a waste of CPU and RAM.
Bare metal hosting is a happy medium between co-lo and cloud. You don't have much control over the network, so it might not be enough if you need faster NICs than they offer, but if you fit in their offerings, it can work well.
Otoh, the bare metal hoster I worked with is now owned by IBM, and a big competitor is owned by private equity; bare metal from cloud providers still has a lot of cloudiness associated too. Maybe colo is the way to go.
I hadn't looked at their offerings before, but they seem at least superficially good. A decent collection of datacenters, network with link aggregation is important to me.
I'd be curious to see any reports about their fire response in their recent incident as well as any changes they made or didn't make to other sites.
Always calculate what it would cost you as a company to actual my hire a good sys admin, keeping that person happy and spending the money for operating a average quality setup vs. fire and forget.
A few grant infrastructure cost for any modern company is nothing and an outage costs you much more.
You can also bet that proper backup is either missing or was paid for expensive enough.
Where they get you is that it very rarely makes financial sense to do both cloud and colo/on-prem (unless you're a massive company). It ends up being way more expensive to use the cloud, but also hire engineers to work on making an on-prem cloud. Most companies have a mixed bag of projects that are either better served by the cloud, or are okay with colo and the savings it can bring.
Assuming you don't want to do a hybrid approach, then you either push everyone onto the cloud and accept paying more, or you push everyone into colo and force the small and scaling out projects to deal with stuff like having to order hardware 3 months in advance.
Then, depending on how nice you want it to be to interact with your infrastructure, you can end up paying a lot to have people build abstractions over it. Do you want developers to be able to create their own database from a merge request or API call? If so, now you're going to have to hire someone with a 6 figure salary to figure out how to do that. It's easy to forget how many things are involved in that. You're going to have a lot of databases, so you need a system to track them. A lot of these databases are presumably not big enough to warrant a full physical server, so you have to sort out multi-tenancy. If you have multi-tenancy, you need a way to handle RBAC so one user can't bork all the databases on the host. You will also need some way to handle what happens when one user is throwing so much load at the RDBMS it's impacting other apps on that database. To accomplish that, you're going to need a way to gather metrics that are sharded per-database and a way to monitor those (which is admittedly one of the easier bits). You also generally just straight up lose a lot of the scaling features. I don't have a way to just give you more IOPS to your database on-prem. The best I can do is add more disks, but your database will be down for a long time if I have to put a disk in, expand the RAID, let it redistribute data and then power it back up. That's several hours of downtime for you, along with anyone who's on the same database. Of course, we can do replicas, and swap the master, but everyone will have to reconfigure their apps or we need something like Consul to handle that (which means more engineers to manage that stuff).
You're also probably going to need more than one of those expensive infra people, because they presumably need an on-call rotation, and no one is going to agree to be on-call the time. And every time someone quits, you have to train the new person, which is several months of salary basically wasted.
That's not to say that you don't need infra people on AWS, but you a) need a lot less of them, because they only need to manage the systems AWS has, not build them, and b) you can hire cheaper ops people, again because you don't need people that are capable of building those kinds of systems.
Once you factor in all of that stuff, AWS' prices start looking more reasonable. They're still a little higher, but they're not double the price. If anything more than a tiny, tiny subset of the AWS features are appealing, it's going to cost you almost as much to build your own as it does to just pay Amazon/Google/Microsoft/whoever.
Also, a massive thing people overlook is that AWS is fairly well documented. I can Google exactly how to set up permissions on an S3 bucket, or how to use an S3 bucket as a website. It only takes seconds, the cognitive burden is low, and the low-friction doesn't cause anyone stress. In-house systems tend to be poorly documented, and doing anything slightly outside the norm becomes a "set up a meeting with the infra team" kind of thing. It takes forever, but more importantly, it takes a lot of thought and it's frustrating.
You save on specialized engineers (Database, RabbitMQ, Ceph administrators), but you lose elsewhere.
What used to be an apache serving static files, now is S3 bucket, but it wont be easy, because you wanted your own domain, so now you need a Cloudfront because of SSL support. Their tutorial conveniently mentions it only at the step 7 ("Test your website endpoint").
You buy into Cognito, great, saved money on Keycloak administrator, but in the worst moment deep in the project you learn that there is absolutely no way to support multiple regions, even if you are willing to do some leg work for AWS. Or find that Cognito email reset flow can't go through your existing customer contact system and must go through SES only, suddenly you find developing elaborate log/event processing tool just that your customer service agent can see password reset event on their interface.
GCP CloudSQL, managed RDBMS, great! No upgrade for you other than SQL dump/restore your 10TB instance, have fun.
Cloud might be a net win still, but it is very much not as rosy as cloud evangelists want us think.
I realized how great Aws and co was when I took over the tech lead on a small startup.
That well defined setup with load balancer, multiple VMS, snapshotting as backup, von/ipsec and vlan was something I couldn't have build and maintained by myself and doing other things 15 years ago without cloud.
The worst is that a lot of "specialized" people are actually not experts at all. They are just there as support person's.
From 3 DB admins one was really good and the other 2 sucked (other company again). The one you never saw comes to you, starts thing on an oracle shell basically blind and does complex things and the other two you tell them what they forgot.
As an admin your job is seldomly to optimize queries etc. But more adding and removing users, configuring backups, playing back backups, upgrading etc.
At least for me it opened up what I can do as a single expert and faster and in better quality.
I would highly recommend normal companies who think they are not an it company (everyone is one today) to run in the cloud only.
> Also, a massive thing people overlook is that AWS is fairly well documented. I can Google exactly how to set up permissions on an S3 bucket, or how to use an S3 bucket as a website.
> In-house systems tend to be poorly documented, and doing anything slightly outside the norm becomes a "set up a meeting with the infra team" kind of thing.
I usually wasn’t really happy with AWS’ documentation. But now, considering the alternative, it find it quite lovely. Thank you for making me realize that.
Interesting! We've built similar support for decoupling compute from storage into Elasticsearch and, as coincidence would have it, just shared some performance numbers today:
It works just as any regular Elasticsearch index (with full Kibana support etc.).
The data being indexed by Lucene allows queries to access index structures and return results orders of magnitude faster than doing a full table scan.
It is complemented with various caching layers to make repeat queries fast.
We expect this new functionality to be used for less frequently queried data (e.g. operational or security investigations, legal discoveries, or historical performance comparisons on older data), trading query speed for cost.
It supports Google Cloud Storage, Azure Blob Storage, Amazon S3 (+ S3 compatible stores), HDFS, and shared file systems.
But are you solving the right problem? This sounds like someone has produced a very good and efficient version of AltaVista. Back in the 1990s, if you wanted to do classic keyword searches of the web, and find all pages that had terms A and B but not C, it would give them to you, in a big unsorted pile. The web was still small enough that this was sometimes useful, but until Google came along with tricks to rank pages that are obvious in retrospect, it just wasn't useful for common search terms.
This is super interesting. I've recently also been working on a similar concept: we have a reasonable amount (in the terabytes) of data, that's fairly static, that I need to search fairly infrequently (but sometimes in bulk). A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3. Random access of a file on S3 is pretty fast, and running in an EC2 instance means latency is almost nil to S3. Cheap, fast and effective.
We're using some custom Python code to build a Marisa Trie as our index. I was wondering if there were alternatives to this set up?
You could look at AWS Athena, especially if you only query infrequently and can wait a minute on the search results. There are some data layout patterns in your S3 bucket that you can use to optimize the search. Then you have true pay-per-use querying and don't even have to run any EC2 nodes or code yourself.
> that I need to search fairly infrequently (but sometimes in bulk).
What do you mean by search ? Full-text-search ? Do you need to run custom code on the original data ?
> A solution we came up with was a small , hot, in memory index, that points to the location of the data in a file on S3.
Yes, it's like keeping the block-index of a sstable (in rocksdb) in-memory. The next step is to have a local cache on the ec2 node. And the next step one is to have a "distributed" cache on your ec2 nodes, so you don't query S3 for a chunk if it's present in any of your other nodes.
Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.
> Come to think of it, I searched and didn't find a "distributed disk cache with optional replication" that can be used in front of S3 or whatever dataset. You can use nginx/varnish as a reverse-proxy but it doesn't have "distributed". There is Alluxio, but it's single-master.
If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap
> If you think more about this, it will be like distributed key value store with support both disk and memory access. You can write one using some opensource Raft libraries, or a possible candidate is Tikv from PingCap
Search maybe is too strong a word - "lookup" is probably more correct. I have a couple of identifiers for each document, from which I want to retrieve the full doc.
I'm not sure what you mean by running custom code on the data. I usually do some kind of transformation afterwards.
I didn't find anything either, which is why I was wondering if I was searching for the wrong thing.
How big is each document ? If documents are big, keep each of them as a separate file and store the ids in a database. If documents are small, then you want something like https://github.com/rockset/rocksdb-cloud for a building block
Combining data-at-rest with some slim index structure coupled with a common access method (like HTTP) was the idea behind a tool a key-value store for JSON I once wrote: https://github.com/miku/microblob
I first thought of building a custom index structure, but found that I did not need everything in memory all the time. Using an embedded leveldb works just fine.
You might want to check out Snowflake for something like this, it makes searching pretty easy, especially as it seems your data is semi-static? We use it pretty extensively at work and it's great.
For your usecase it'll be very cheap if you don't access it constantly (you can probably get away with the extra small instances, which you are billed per minute).
This is the kind of thing I value in Rails. Active storage [1] has been around for a few years and it solves all of this. All the metadata you care about is in the database - content type, file size, image dimensions, creation date, storage path.
The following is a stupid question, so bare with me.
I have been using search engines for about... 26 years. I have attempted to make really crappy databases and search engines. I have worked for companies that use search products for internal services and customer products. I'm not a search engineer but I have a decent understanding of them and their issues, I think. And I get why people want full-text search. But is it actually a good idea? Should anyone really be using full text search?
I actually work on search products right now. We use Solr as the general full text index. We have separate indexes and algorithms to make context and semantic inferences, and prioritize results based on those, falling back to full text if we don't get anything. The full text sucks. The corpus of relationships of related concepts is what makes the whole thing useful.
Are we (all) only using full-text because some users are demanding that it be there? Or shouldn't we all stop this charade of thinking that full-text search of billions of items of data will ever be useful to a human being? Even when I show my coworkers that I can get something done 10x faster with a curated index of content, they still want a search engine that they know doesn't give them the results they want.
Is full-text search the junk food of information retrieval?
Curious to know if anyone can explain why on-line storage is so expensive? Most places want $100+/mo for 1TB of storage, while a 1TB drive only costs $50. I understand there's management costs, cooling, electricity, physical space, etc. But those would be per drive costs, not per TB, and certainly wouldn't add up to $100/mo.
Meanwhile, there's services like Google Drive, etc which costs about $100/yr per TB. Still exorbitant, but not as much so. 3rd party software can mount it as a drive, but only for a short period of time before the token expires. So they seem happy to sell you space at less than 1/10th the cost, as long as it's harder to use.
There just seems to be a lot of cash on the table for someone to offer much cheaper storage solutions, but no one is actually doing it.
Well, there are also data integrity issues to take into account... in the simplest terms, you'd actually need 3x the available storage to "guarantee" that there won't be any bit nor data loss if a single drive (our of three) - does... Add a potential (mandatory?) off-site backup, means of restoring it, and... Not even getting into the bandwidth cost can of worms.
Again, this is in very simple (and low-scale) terms.
It seems very presumptuous to assume there is no market for people who don't need all of those things. I can back up and restore my own data. Might take a while, but that should be my choice. Nor does it explain why web services like Google Drive is 1/10th the cost per TB of the same volume size on a cloud server.
The big clouds have premium storage and absolutely gouge on bandwidth.
For a more basic offering, B2 is $60 a year and there are other services in the same ballpark. They definitely want it to be convenient for you, and make a modest profit percentage.
I haven't experienced any issues with google drive tokens, though I've mostly been using the business tier.
Don't forget the replication. S3 replicates to at least three other AZs in the same region. I know this still doesn't account for the difference but its worth remembering.
I can see extra features costing more, but a lot of people would be happy with nothing but a network connected drive. If it goes bad, it's bad. Hopefully I have a backup somewhere.
Stateless search engine is something new, for sure.
I'd be super interested to see how it evolves over time. We're [1] indexing over 1,000,000 news articles per day. We're using ElasticSearch to index our data.
Would be interested to see if there's a way to make a cross-demo? Let me know.
This looks really interesting, I wonder how they will monetize it though.
As an aside, projects like these are what keep me wondering whether I should switch from cheaper but "dumb" object stores to AWS since on AWS you can use your object store together with things like Athena etc. and get pay-per-use search / grep and a lot of other things, without the egress fees since it's all within AWS.
We really need to make this clear in our next blog post. This is not grep here. We are using the same datastructure that are used in Elasticsearch or google.
We just adapted them to be object storage friendly.
I would not call Object Storage dumb by any mean. They are a very powerful bottom-up abstraction.
We do manage to get SSD-like throughput from them.
The latency is the big issue. We had to redesign our search to reduce the number of random read in the critical to the bear minimum.
Appreciate the response. I wasn't trying to say this is grep, I fully understand that this is an inverted index which is way more interesting to build on top of S3.
I merely wanted to say that by using S3 within AWS you always have the fallback option of brute-force "grep" across your semi-structured "data lake" or whatever it's called thanks to the aggregate bandwidth and Athena.
Ah my bad! Yes, Humio (and Loki) are opting for this approach.
This does decouple compute and storage in a trivial manner. There is indeed a realm in which this brute force approach is the best approach.
We could probably make a 4D chart with QPS, data size, latency, and retention period and define regions where the elastic/SOLR approach, Humio, and quickwit are the most relevant.
We have a storage abstraction that boils down to being able to perform Range queries.
Anything that allows us to do range queries is ok.
That includes basically all object storage I know of (Google Cloud Storage, Azure Storage, Minio, you name it), but also HDFS, or even a remote HTTP2 server.
Nice! Maybe at one point you can release a general web search engine for the Common Crawl corpus? It seems even simpler than this proof of concept, but potentially more useful for people looking for a true full text web search.
There isn't an easy way today to explore or search what is contained in the Common Crawl index.
> There isn't an easy way today to explore or search what is contained in the Common Crawl index.
By that you mean searching the full text contents of their crawl, right?
The index is super easy to search nowadays -- in pretty much any language you can slap a few lines of code around a get request (using range requests [0] if needed), and explore a columnar representation of the index [1].
Cool demo. Searching for phases like "there was a" and "and there is" take a really long time. I presume that since the words are common, the document IDs mapped to those individual tokens are too long as well, so intersections etc. take longer?
> which is key as each instance issues a lot of parallel requests to Amazon S3 and tends to be bound by the network
I wonder if most of the cost comes from S3, EC2 or the "premium" bandwidth that Amazon charges ridiculously much for. Since it seems to be doing a lot of requests, it wouldn't surprise me if it's the network cost, and if so, I wonder why they would even use AWS at all then.
Could this be adapted for IPFS? Anyone with stateless client and link to index could search and become part of swarm to speed up trendy queries with redundancy.
Then update it with git like diff versioning, utilize IPNS to point to HEAD of the latest chain of the index.
What does your on-S3 storage format look like? Are you storing relatively large blobs and doing HTTP Range requests against them or are you storing lots of tiny objects and fetching the whole object any time you need it?
What we store on S3 is a regular tantivy index and another tiny data structure that we call "turbo index", which makes queries faster on object storages. For this demo, the tantivy indexes are fairly large and we issue HTTP Range requests against them.
Is this reliant on S3 or can it be used on something like minio or digital ocean spaces or backblaze2 too? Backblaze to cloudflare data transfers is free so that can reduce costs a lot plus B2 is much cheaper than S3.
How are you dealing with the fact common crawl updates their data much less regularly than commercial search engines? And that each update is only a partial refresh?
Edit: And I will say your site design is very nice.
Thank you! We did not plan to regularly update the index.
But as it takes only 24 hours to index 1B pages, the easiest way would be to reindex everything, upload it to S3 and update the metadata so the search engine will query the right segments.
Ah I understand you're showcasing the methodology for the underlying index but you're going to open source the engine. I see, great stuff then, super novel and honestly the rest of the open source search engines can definitely use some competition. Love it!
It is a web search engine. As explained in the blog post, we made the demo by generating 18k snippets and pushing them to a NLP pipeline that tries to extract the adjective phrase.
A normal search experience (displaying a 20 hits search page) requires
num segments * (1 + num terms * 2) + 20 GET requests.
We have 180 segments for our commoncrawl index.
So we can consider a generous upper bound of 1000 requests.
The GET request costs adds $0.0004 per commoncrawl search request.
Storage costs us $5 per day, so the cost of GET request starts topping storage cost at >10k request per day.
Our search engine is meant for searching large datasets, with a low number of queries: Logs, SIEM, e-discovery, exotic big data datasets, etc.
These use case have typically a low daily query rate.
For high request rate, (1 query per second) like e-commerce, entirely decoupling storage and compute is actually a bad idea.
For low request rate (< 1000 per day), using S3 without caring about the GET request cost is perfectly fine.
And in the middle, you might probably want to use another object model with a more favorable pricing model.
Searching the web is a fool's errand. Google doesn't even search the web anymore, they just mind-controlled everyone to submit nightly sitemaps to them. Google is more of an index than a search engine nowadays.
We store the URI of each shard making up the index and, optionally, partition key and value(s). Along with a few flags, we also store the shard size, creation and last modification time. This additional metadata is not required for the query planning phase and is only useful for managing the life cycle of the shards and debugging/troubleshooting.
The high response time is due to the fact that we generate 18k snippets to generate the tag cloud. Imagine this is the equivalent of clicking on page 1 to 900 on Google!
A "barack obama" phrase query generating 20 snippets runs in less than 2seconds on our 2 cheap servers.
I'll set up a "normal 20 results search setting" next week and share it an API to show the latency again.
The poster was referring to the latency of the demo and is absolutely correct. The demo can reach 30s on some query. Half of it is due to fetch 180k document generation, and half of it is a single threaded python code that has nothing to do with our product :).
Article title is "Searching the web for < $1000 / month".
Despite mentioning Rust once, of course it had to be added to the title on HN as "Search 1B pages on AWS S3 for 1000$ / month, made in Rust and tantivy".
Their price calculation doesn't mention cost of S3 requests, which very quickly adds up and is often neglected.
It costs $1 for 2.5M GET requests to S3. They have 180 shards, in a general case query seems to fetch all of them. Presumably they don't download full shard per request, but download an index + some relevant ranges. Lets say that is 10 requests per shard. So that would be 1800 S3 GET requests = ~1400 search queries cost them $1.
Assuming their service is reasonably popular and serve 1 req/second on average, that would be $1,440 per 30 days in addition to advertised $1,000 spent on EC2 and S3 storage.
Seems comparable to AWS ElasticSearch service costs:
- 3 nodes m5.2xlarge.elasticsearch = $1,200
- 20TB EBS storage = $1,638