For all the talk of needing all the cloud infra to run even a simple website, Marginalia hits the frontpage of HN and we can't even bring a single PC sitting in some guy's living room to its knees.
More than the cores and RAM, you have bigger issues with I/O (both throughput and latency) to disk and the network from cloud providers. Physical hardware, even when comparing cores/RAM 1:1, is outrageously faster than cloud services.
Problem is that those volumes are ephemeral and may not provide the reliability guarantees that the EBS volumes do, so they're only really good as cache and not for any persistent data you care about.
I would argue that a rack-mounted chassis with lots of disks is also ephemeral, just less so: most failures in a server can be fixed by swapping some parts, and your data is still there.
But AWS and its competitors don’t have an offering even close to comparable to what you can get in a commodity server. A 1U server with one or two CPU sockets and 8-12 hot-swap NVMe bays is easy to buy and not terribly expensive, and you can easily fill it with 100+ TiB of storage with several hundred Gbps of storage bandwidth and more IOPS than you are likely able to use. EC2 has no comparable offering at any price.
(A Micron 9400 drive supposedly has 7GBps / 56Gbps of usable bandwidth. 10 of them gives 560Gbps, and a modern machine with lots of PCIe 5.0 lanes may actually be able to use a lot of this. As far as I can tell, you literally cannot pay AWS for anywhere near this much bandwidth, but you can buy it for $20k or so.)
> I would argue that a rack-mounted chassis with lots of disks is also ephemeral, just less so: most failures in a server can be fixed by swapping some parts, and your data is still there.
True but this also depends on design decisions AWS made with regards to those volumes.
Indeed it could be that the volume is internally (at the hypervisor level) redundant (maybe with something like ZFS or other proprietary RAID), but there's no way to know.
Furthermore, AWS doesn't allow you to really keep a tab or reservation on the physical machine your VM is on - every time a VM is powered up, it gets assigned a random host machine. If there is a hardware failure they advise you to reboot the instance so it gets rescheduled on another machine, so even though technically your data may still be on that physical host machine, you have no way to get it back.
AWS' intent with these seems to be to act as transient cache/scratchpad so they don't seem to offer much durability or recovery strategies for those volumes. Their hypervisor seems to treat them as disposable which is a fair design decision considering the planned use-case, but it means you can't/shouldn't use it for any persistent data.
Being in control of your own hardware (or at the very least, renting physical hardware from a provider as opposed to a VM like in AWS) will indeed allow you to get reliable direct-attach storage.
I can buy a rather nicer 1U machine with substantially better local storage for something like half the 1-year reserved annual cost of this thing.
If you buy your own servers, you can mix and match CPUs and storage, and you can get a lot of NVMe storage capacity and bandwidth, and cloud providers don’t seem to have comparable products.
Something like €200/mo if you factor in the need for disk space as well. This is also Hetzner we're talking about. They're sort of infamous for horror stories of arbitrarily removed servers and having shitty support. They're the cheapest for a reason.
But with dedicated servers, are we really talking cloud?
I only rent a small server from them, but I've been happy with their support. Talked to a real human who could help me with tech questions, even though I pay them next to nothing.
Most of the problems I read about are during the initial signup stage. They ask for a copy of your passport etc and even then some people can't signup because presumably their info is triggering something in Hetzner's anti fraud checks. This sucks for those people of course.
The other common cause of issues is things like crypto which they don't want in their network at all.
This will sound like I am downplaying what people have exprerienced and/or being apologetic on their behalf but that is not my intention. I am just a small time customer of theirs. I've had 1 or 2 dedicated servers with them for many many years now upgrading and migrating as necessary. (It used to be that if you waited for a year or two and upgraded you'd get a better server for cheaper. Those days are gone.)
I've only dealt with support over email where they have been both capable and helpful, but what I needed was just plugging in a hardware kvm switch (free for a few hours - i never had to pay) or replacing a failing hard drive (they do this with zero friction). Perhaps I am lenient on the tech support staff. After all they are my people. I've been to a few datacenters and have huge respect for what they do.
On the presales side they seem to reply with a matter of fact tone with no flexibility. They are a German company after all.
I'm a bit wary I'd get lumped in with the crypto gang. A lot of what I'm doing with the search engine is fairly out there in terms of pushing the hardware in unusual ways.
It would also suck if there ever was a problem. The full state of the search engine is about 1 Tb of data. It's not easy to just start up somewhere else if it vanished.
in Azure that's roughly 5k per year if you pay for the whole year upfront.
I have the pleasure of playing with 64cores, 256gb RAM and 2xV100 for data science projects every now and then. That turns out to be roughly 32k per year.
I share your perspective on pricing. I had a discussion with my team lead about why we haven't taken on the task of running our own machines. The rationale behind it is that while server management may seem easy, ensuring its security can be complex. Even in a sizable company, it's worth considering whether you want to shoulder the responsibility or simply pay a higher cost to have Microsoft handle it. Personally, I love hosting my own infrastructure. It's fun, potentially saves me some cash, allows me to learn a lot, and gives me full control. However, I understand now that on a company scale, others may see it differently.
--edit--
I forgot to add the following: that's 32k if you run the system 24/7. Usually it's up for a few hours per month, so you end up paying maybe 2k for the whole year.
I'm curious about your network bandwidth/load. You only serve text, right? [Edit: No, I see thumbnail images too!] Is the box in a datacenter? If not, what kind of Internet connection does it have?
Average load today has at worst been about 300 Kb/s TX, 200 Kb/s RX. I've got a 1000/100 mbit/s down/up connection. Seems to be holding without much trouble.
Most pages with images do lazy loading so I'm not hit with 30 images all at once. They're also webp and cached via cloudflare, softens the blow quite a lot.
IMO it's actually incredibly well-documented and thoughtfully organized for a one-person project! You should be proud of what you've put together here!
It's a Debian server running nginx into a bunch of custom java services that use the spark microframework[1]. I use a MariaDB server for link data, and I've built a bespoke index in Java.
[1] https://sparkjava.com/ I don't use springboot or anything like that, besides Spark I'm not using frameworks.
Really? Even with mmapp'ed memory won't the CPU still register user code waiting on reading pages from disk as iowait? I'm so surprised by that that if it doesn't it sounds like a bug.
Yeah it's at least what I've been seeing. Although it could alternatively be that a lot of the I/O activity is predictive reads, and the threads don't actually stall on page faults all I/O that often.
I remember there is ksplice or something like that to upgrade even the kernel without a complete downtime. Everything else can be upgraded piecemeal, provided that worker processes can be restarted without downtime.
If the hardware itself is the reason for the long startup time, kexec allows you to boot a new kernel from within an existing one and avoids the firmware/HW init.
The people strive to have these problems! Hockey stick growth, servers melting under signup requests, payment systems struggling under the stream of subscription payments! Scale up, up, up! And for that you might want to run your setup under k8s since day one, just in case, even though a single inexpensive server would run the whole thing with a 5x capacity reserve. But that would feel like a side project, not a startup!
I'd argue that a lot of modern web engineering pretends to be built for problems must people won't have. So much resume-driven-development is being done on making select, easy parts super scalable while ignoring some elephants in the room such as the datastore.
A good example is the obsession with "fast" web frameworks on your application servers, completely ignoring the fact that your database will be the first to give up even in most "heavy" web frameworks' default configurations without any optimization efforts.
I think HN's stack is the right choice for them and that it fulfills its purpose excellently, but I do seem to recall both of their hard drives failing more or less simultaneously & HN going down for about 8 hours not that long ago.
If that happened at the SaaS company I worked at previously, it would be a bloodbath. The churn would be huge. And our customer's customers would be churning from them. If that happened at a particularly inopportune time, like while we'd been raising money or something, it could potentially endanger the company.
(I'd like to stress again this is not a criticism of HN/dang, but just to illustrate a set of requirements where huge AWS spends do make sense.)
From my experience simple systems perform better on average due to less number of interconnected gears.
Much more complex systems do not perform as consistently as simple ones, and they are exponentially harder to debug, introspect and optimize at the end of the day.
Every time I deploy a service it goes down for anything between 30 seconds and 5 minutes. When I switch indices, the entire search engine is down for a day or more. Since the entire project is essentially non-commercial, I think this is fine. I don't need five nines.
If reliability was extremely important, scales would tilt differently, maybe cloud would be a good option. A lot of it is for CYA's sake as well. If I mess up with my server, that's both my problem and my responsibility. If a cloud provider messes up, then that's a SLA violation and maybe damages are due.
> The Random Mode has been overhauled, and is quite entertaining. I encourage you to give it a spin.
Yep, this is a good example of warping the Feeling Lucky pattern into a really neat little discovery tool.
IMO it would even be cool if the site was this part first, oh and hey it's also a search engine.
(While I'm random-ing: The Arch Wiki is in there? Seriously? Just for that, I propose that it either be skinned to max vaporwave, or host a webcam pointed at a Manjaro machine, or both...I'll be waiting over here, downloading 4.1 GB of marginalia for my AUR build of PCManFM)
I haven't got the time to curate this stuff. There's like 10,000 domains in the list. It's some one off SQL script I think that generated the sample based on parameters lost to time.
The domains you get from browse:random is from a small selection yeah. But if you start traversing with "similar" there is no such limitation, only limit is that they must have a screenshot.
(There's also explore2.marginalia.nu which is not even limited to websites with a screenshot)
It really could only work around the time it existed I feel. The internet was a lot weirder back then.
One big difference then from now is that you basically need a PhD in the Canvas API (or WebGL or whatever) to accomplish something a 5 year old could do in Flash. Web design was a lot more accessible. You didn't have to worry about responsive designs and fluid layouts. You could just position:absolute everything and that was kinda fine.
I think you might have some nostalgia goggles on at the moment. There's nothing holding people back from making "weird" web pages today, they can even make them nice and responsive. One of the better concepts around HTML and CSS was separation of data and layout.
It's trivial to have a "weird" position:absolute design with a break for mobile that switches to a more fluid layout. Desktop users can have their "weird" layout but I can still read the page on mobile and you can readily crawl and index it.
People moved away from design tools like DreamWeaver that helped make "weird" stuff and instead installed WordPress or some CSS/JavaScript framework that just bakes in all the "boring" fluid layouts.
You're not necessarily wrong about Flash in terms of design or creation but your search engine wouldn't be terribly practical if everyone was still using Flash for everything. Flash allowed content packed inside SWFs but also allowed fetching external resources. You wouldn't be able to index any of that unless your crawler executed the Flash and/or inspected all the URL references for external resources.
Flash created an inaccessible deep web just like today's JavaScript website-is-an-application "sites".
Don't get me wrong, I love the old web with quirky table-based layouts, "unofficial" fansites, and personal homepages hosted on forgotten university servers in a closet. There was a vibrancy that's largely missing from today's web.
I think a big change has been tools have become more geared for boring than the creative and people treat content on the web as a side hustle. Google et all haven't helped by favoring recency over other relevance factors.
There are quite a few interesting alternatives these days. Bored Button is one. There are some that are even more like it but I'm away from grep and my notes at the moment.
Custom index software built from scratch in Java. MariaDB link database. The entire search engine runs on a PC in my living room.
You could pretty trivially shard the index by `hash(domain) % numShards`. There's no support for this because I literally only have this single server, but it wouldn't be much work.
Marginalia comes up on HN every so often, and I always look at it, think "oh that's neat!", maybe add it to my bookmarks toolbar, and then forget about it. Are there a lot of people who find themselves using it daily?
It's interesting, and I really appreciate that you aren't trying to out-Google Google. This seems like a useful tool in its own right in addition to what else exists.
I don't think that would make sense. If Google is struggling with search, a one man Google clone isn't going to do it better.
I also think that having "a google", one central search engine, is inherently a bad thing for the health of the Internet. It drives a lot of this search engine spam epidemic we're seeing.
A broader and (IMO) more interesting problem is Internet discovery.
A lot of Google's initial quality was due to the fact that the content it indexed was much higher quality.
Even besides the point that the websites they indexed were a lot less adversarial, they put a lot of emphasis on indexing academia, and were outspoken against what came to be their present mixed motives[1].
I gotta say, Marginalia seems like the best search engine I've ever used specifically for food recipes! All the links are to personal, readable, HTML websites with the recipes obviously front and center. Finally I've found out how to escape from corporate recipes on the web!
I don't use it daily, but I have reached for it multiple times in the last few weeks. I like it for finding blog posts, tutorials, comparisons, and hobby projects without getting caught up in fake articles like SEO-heavy wikis of copy-pasted content.
Use it maybe once every week or something like that. I like it.
I use it mostly for tech/programming/FOSS stuff. Especially for programming topics it can be good for filtering out all the ‘w3schools’ type of blog spam that just floods Google’s results.
(just as a reminder, these lists are only to satiate curious readers - there's no reproach for reposting! Reposts are fine on HN after a year or so: https://news.ycombinator.com/newsfaq.html)
Anecdotally, all the people I know who have recreational/pet/indoor chickens are lovely human beings, so I'm wholly in favor of this absurd industry and its success.
My first experience of the internet was telnet from Win 3 box to a X.25 PAD and then telnet to something JANET (UK) then something US based (NSF I think) and fire up Gopher or WAIS.
Later my boss asked me to look at this web thing that he had heard about. I fired up telnet and eventually found an on ramp to CERN. To me it looked rather like everything else but I'm not exactly a rocket scientist!
All seem nice until I get to see the search results.. which I cannot "read". It's very difficult to read an output that goes 5x boxes horizontally, and each such line goes then vertically on "forever". It's like yellow pages book from the 1990s.
Yeah, the contrast is one of the things I'm not entirely happy with. The positioning is also a bit off, especially if you resize the window a bit. As stated, needs polish. But I really like the idea of the search engine being a bit more transparent with how it works.
I remember being so excited about the search engine Neeva because they seemed to be building a full fledge independent web index with top notch talent, I was really bought into the idea of a premium new search experience that I could pay for (no ads) and revenue share some kinds of content. But years later they focused on crypto and AI instead, and I always find myself just googling it because I have ad block and the results are more relevant, sigh
It's really bad. For years I've been saying that I want someone to make a search engine where I am the customer and that I am happy to pay. Now kagi is here and I am to cheap to pay for it. I feel called out.
FWIW, they are not the only one where you can pay to be a customer, they just happen to think that paying-per-search is the strategy they want to go with. I also just double-checked and they do actually offer an all-you-can-eat for $25/mo and $255 if you pay annually
Now, the bad news is there is an associated discussion going on over in the "DDG integrates GPT" thread where the intersection of "I want to pay" and "I don't want to GPT anything" is damn near nil :-(
Hmm, "getting started with react" yields the following as the first match:
"https://frontendmasters.com/courses/complete-react-v5/gettin...
Getting Started with Pure React - Complete Intro to React, v5 | Frontend Masters
The "Getting Started with Pure React" Lesson is part of the full, Complete Intro to React, v5 course featured in this preview video."
It's in part a measure to limit the scope of the project (the entire thing runs off a single PC), but it's also hard to build a good language model for a language you don't speak, and I only speak English and Swedish. But if the project grows, gets more hardware, and contributors that speak other languages, then maybe this will change in the future.
I'm not sure how effective it is at present - just gave it a very quick test on searching for info on roman coins - but the concept is great. This is something that I've often wished existed.
If I'm searching for roman coins I certainly don't want to find commercial sites selling them (I know what those are), or even the well-known online national collections or auction archives... I'd like to be able to find the specialist sites built by collectors (and maybe academics) that are non-commercial and way more interesting.
In the early days of the internet some specialist content/pages were organized into "web rings" each linking to each other, but nowadays we're mostly relying on search to discover new pages, and it seems a lot of the hobbyist content is way harder to find, assuming it's even out there.
I was searching "imp constantinvs" which is part of the legend on many coins of constantine the great. Would expect to see these details listed on any hobbyist sites, as well as the commercial ones I'm not interested it.
BTW #1, 5, 6 are all good sites, but those are very mainstream - those will be top links in Google as well. #6 is purely commercial - an auction house. #5 is a coin dealer's commercial site, but has good collector resources (discussion board, Wiki, collectors galleries) as well.
Do you know of any hobbyist sites within this space? I want to check a thing, could be this corner of the internet isn't well indexed. I should be able to tell with explore2.
I'm at work right now, so these are just some examples off the top of my head. I can give more examples later if it's useful. Some of these site will include links to other collector/hobbyist sites.
Very cool project! How much hardware do I need to run this? E.g. can I run it on an old Xeon server with 12 threads and 48GB RAM with HDDs, or that'd be too slow?
The query processing is fairly crude. For better or worse, it doesn't do much special processing. Which means you basically need a website that repeatedly says "list of CPU architectures" to rank well.
Most of the pages that contain such a title are also actual lists. The index de-prioritizes documents that are mostly lists or tabular data, as they're rarely very often false positives as they often contain repeated words.
@marginalia_nu, Few months ago, you said you would consider open sourcing this search engine. Are there any tasks that us github warriors can help with?
Wiby is manually curated. Means their result align much better with the operator's vision. Marginalia has an orders of magnitude bigger index, but not all of it is as consistently good as Wiby.
No, it's really means "404 nothing found" for your search query. I searched for my company name and got nothing as well. A bit surprising since it says "search the Internet" ;-)
What advantages does this offer over something like typing "$QUERY -site:*.com" into a mainstream search engine? I think webmasters in general do a pretty good job at self-segregating their sites into commercial and non-commercial entities through the use of different top-level domains.
The advantage is a better set of sites since there are a ton of interesting little .com sites out there, and lots of unwanted sites on .org and country code domains. He also blocks some URL patterns that appear on spammy domains regardless of TLD. Try a few searches or the random results page to see the difference. It’s fun browsing.
What really drives me crazy with Google is that they think that it is ok to not label potentially paywalled articles as ads.
I get tricked so often into clicking a news snippet offered by Google only to then land on a site which not only presents me a paywall, but also does want me to accept their cookie policy before they present me the paywall.
I've always been curious about how search engines seed their scanning and index programs. Like how do you know what domains, ips, etc.. to start scanning and where is the origin?
I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.
Only this is possible as long as the index knows about the domain. Yours are, but if not, anyone can shoot me an email or something and I can poke them into the database.
The limitation for known domains is in place to avoid abuse.
1) How many pages are in your index
2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?
Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.
I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.
It wouldn't. But such islands are typically not very interesting either. The context of who links to a domain is very important for a search engine for many tasks, not just discovery.
Very cool. Reason I ask is at first glance the header "Search the Internet" to me, implies you are searching the entire internet. It sounds like a more appropriate header would be "Search the obsecure Internet"
To be fair, no search engine lets you search the entire Internet, not even Google does this.
Internet arguably doesn't even have a size. You can construct a website that's like n.example.com/m which links to '(n+1).example.com/m' and 'n.example.com/(m+1)', for each m and n between 0 and 1e308.
I did it! For every two numbers, calc.shpakovsky.ru has a static(-looking) webpage showing their sum (or difference, etc). Together with links to several other pages. The only limitation I know of is 4k URL length. Interestingly enough, major search engines are rather smart about it and cooled down their indexing efforts after some time. Guess, I'm not the first one to make such a website.
I remember reading somewhere that Google used dmoz (https://en.wikipedia.org/wiki/DMOZ) as seed page for their crawler. Not sure if it's true though...
I've actually sort of solved this recently. Marginalia's ranking algorithm is a modified PageRank that instead of links uses website adjacencies[1].
It can rank websites even if they aren't indexed, based on who is linking to them.
Vanilla PageRank can't do this very well. Domains that aren't indexed don't have (known) outgoing links, in the periphery of the rank. There's a some tricks to get these to not mess up the algorithm completely, but they basically all rank poorly. That's even without considering all the well known tricks for manipulating vanilla pagerank. The modified version seems very robust with regards to both problems.
That doesn't actually sound very different from my experience with major search engines beyond the first page. I've taken it as a bit of an law that the Internet outside of large centralized and/or moderated sites gets very fringe very quickly. Since the whole point of the search engine is to display noncommercial sites, users will inevitably face thousands of self-published blogs of varying beliefs, quality, and truthiness. As these fringe sites take up more domain names by total volume than mainstream platforms (there is only one Twitter, Facebook, et al.), I am not surprised at all that they seem to be even more voluminous here than on commercial search engines.
Isn't 1 just the result of 2? It's because we don't have an answer and because there probably isn't an answer to the second question that we need universal broadcast as you called it.
We could get away from it only if we figure out an answer to the second question but I suspect we'll never get to an answer.