Hacker News new | past | comments | ask | show | jobs | submit login
Marginalia: DIY search engine that focuses on non-commercial content (marginalia.nu)
550 points by thunderbong on April 18, 2023 | hide | past | favorite | 193 comments



For all the talk of needing all the cloud infra to run even a simple website, Marginalia hits the frontpage of HN and we can't even bring a single PC sitting in some guy's living room to its knees.


https://www.marginalia.nu/junk/just-a-fleshwound.webp

If anything it's running faster now. All you've done is warm up the caches and given the JVM a chance to optimize the hottest code.

(real talk the SSDs are running pretty near 100% utilization though)


Kudos. [0] is the best URL I have come across in the past few years

[0] https://www.marginalia.nu/junk/just-a-fleshwound.webp


"/junk/just-a-fleshwound.webp" for those on mobile


I love this!! :) How much does 24 real cores + 126GB ram cost on the cloud? A million dollars?


Yeah I wouldn't try to run this in the cloud. Would be broke as a joke in a week.

This is $5000 worth of consumer hardware, give or take.


nice


More than the cores and RAM, you have bigger issues with I/O (both throughput and latency) to disk and the network from cloud providers. Physical hardware, even when comparing cores/RAM 1:1, is outrageously faster than cloud services.


Don't use EBS, use the local SSD which are an option for most cloud VMs.


Problem is that those volumes are ephemeral and may not provide the reliability guarantees that the EBS volumes do, so they're only really good as cache and not for any persistent data you care about.


I would argue that a rack-mounted chassis with lots of disks is also ephemeral, just less so: most failures in a server can be fixed by swapping some parts, and your data is still there.

But AWS and its competitors don’t have an offering even close to comparable to what you can get in a commodity server. A 1U server with one or two CPU sockets and 8-12 hot-swap NVMe bays is easy to buy and not terribly expensive, and you can easily fill it with 100+ TiB of storage with several hundred Gbps of storage bandwidth and more IOPS than you are likely able to use. EC2 has no comparable offering at any price.

(A Micron 9400 drive supposedly has 7GBps / 56Gbps of usable bandwidth. 10 of them gives 560Gbps, and a modern machine with lots of PCIe 5.0 lanes may actually be able to use a lot of this. As far as I can tell, you literally cannot pay AWS for anywhere near this much bandwidth, but you can buy it for $20k or so.)


> I would argue that a rack-mounted chassis with lots of disks is also ephemeral, just less so: most failures in a server can be fixed by swapping some parts, and your data is still there.

True but this also depends on design decisions AWS made with regards to those volumes.

Indeed it could be that the volume is internally (at the hypervisor level) redundant (maybe with something like ZFS or other proprietary RAID), but there's no way to know.

Furthermore, AWS doesn't allow you to really keep a tab or reservation on the physical machine your VM is on - every time a VM is powered up, it gets assigned a random host machine. If there is a hardware failure they advise you to reboot the instance so it gets rescheduled on another machine, so even though technically your data may still be on that physical host machine, you have no way to get it back.

AWS' intent with these seems to be to act as transient cache/scratchpad so they don't seem to offer much durability or recovery strategies for those volumes. Their hypervisor seems to treat them as disposable which is a fair design decision considering the planned use-case, but it means you can't/shouldn't use it for any persistent data.

Being in control of your own hardware (or at the very least, renting physical hardware from a provider as opposed to a VM like in AWS) will indeed allow you to get reliable direct-attach storage.


Like this?

https://instances.vantage.sh/aws/ec2/im4gn.8xlarge?region=us...

I can buy a rather nicer 1U machine with substantially better local storage for something like half the 1-year reserved annual cost of this thing.

If you buy your own servers, you can mix and match CPUs and storage, and you can get a lot of NVMe storage capacity and bandwidth, and cloud providers don’t seem to have comparable products.


About 150€ per month as a dedicated server from Hetzner.


Something like €200/mo if you factor in the need for disk space as well. This is also Hetzner we're talking about. They're sort of infamous for horror stories of arbitrarily removed servers and having shitty support. They're the cheapest for a reason.

But with dedicated servers, are we really talking cloud?


A dedicated server is obviously not “cloud”, but that wasn’t really the point I was making.

My point was rather underlining the absurdity of using cloud for everything.

Herzner is just an example of a dedicated server provider. There are others, some in the same price range, others a bit more.

As an aside to my point, it is often cheaper and more flexible to use dedicated servers than you buy and collocate your own hardware.


I only rent a small server from them, but I've been happy with their support. Talked to a real human who could help me with tech questions, even though I pay them next to nothing.

(No affiliation.)


Good to hear. You only really hear the people complaining. With as many customers they have, it's hard to know how representative it is.


Most of the problems I read about are during the initial signup stage. They ask for a copy of your passport etc and even then some people can't signup because presumably their info is triggering something in Hetzner's anti fraud checks. This sucks for those people of course.

The other common cause of issues is things like crypto which they don't want in their network at all.

This will sound like I am downplaying what people have exprerienced and/or being apologetic on their behalf but that is not my intention. I am just a small time customer of theirs. I've had 1 or 2 dedicated servers with them for many many years now upgrading and migrating as necessary. (It used to be that if you waited for a year or two and upgraded you'd get a better server for cheaper. Those days are gone.)

I've only dealt with support over email where they have been both capable and helpful, but what I needed was just plugging in a hardware kvm switch (free for a few hours - i never had to pay) or replacing a failing hard drive (they do this with zero friction). Perhaps I am lenient on the tech support staff. After all they are my people. I've been to a few datacenters and have huge respect for what they do.

On the presales side they seem to reply with a matter of fact tone with no flexibility. They are a German company after all.


Yeah.

I'm a bit wary I'd get lumped in with the crypto gang. A lot of what I'm doing with the search engine is fairly out there in terms of pushing the hardware in unusual ways.

It would also suck if there ever was a problem. The full state of the search engine is about 1 Tb of data. It's not easy to just start up somewhere else if it vanished.


So the Cloud is the cheaper option, if you factor in energy cost and hardware depreciation.


The monthly cost of a dedicated server includes everything: bandwidth, power and hardware.

How do you figure a $5k annual cloud spend is cheaper than ~150€ per month?


I think you're confused.

~150€ in cloud costs is cheaper than $5k cost to buy the hardware the guy has in his living room.


Most people when talking about "cloud" refer to the likes of AWS/GCP/Azure rather than old-school bare-metal server hosts.


I think parent is talking about colo.


I do not understand your comment.


in Azure that's roughly 5k per year if you pay for the whole year upfront. I have the pleasure of playing with 64cores, 256gb RAM and 2xV100 for data science projects every now and then. That turns out to be roughly 32k per year.


I don't know, 32k per year is pretty steep... I mean ~35k, that's the ballpark number for the entire system (if bought new).


I share your perspective on pricing. I had a discussion with my team lead about why we haven't taken on the task of running our own machines. The rationale behind it is that while server management may seem easy, ensuring its security can be complex. Even in a sizable company, it's worth considering whether you want to shoulder the responsibility or simply pay a higher cost to have Microsoft handle it. Personally, I love hosting my own infrastructure. It's fun, potentially saves me some cash, allows me to learn a lot, and gives me full control. However, I understand now that on a company scale, others may see it differently.

--edit--

I forgot to add the following: that's 32k if you run the system 24/7. Usually it's up for a few hours per month, so you end up paying maybe 2k for the whole year.


I'm curious about your network bandwidth/load. You only serve text, right? [Edit: No, I see thumbnail images too!] Is the box in a datacenter? If not, what kind of Internet connection does it have?


Average load today has at worst been about 300 Kb/s TX, 200 Kb/s RX. I've got a 1000/100 mbit/s down/up connection. Seems to be holding without much trouble.

Most pages with images do lazy loading so I'm not hit with 30 images all at once. They're also webp and cached via cloudflare, softens the blow quite a lot.


Semi-related: do you just run a static IP out of your house?


Yup.


Very well done! On mobile now but will check out the site once home.

Do you happen to have a writeup somewhere of your tech stack?


Not OP but https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma... has a diagram and component overview!


I'm working on the documentation. It's getting there, but it's still kinda thin in many places.


IMO it's actually incredibly well-documented and thoughtfully organized for a one-person project! You should be proud of what you've put together here!


Yeah I did a huge refactoring effort very recently. I put a lot of effort in making the code easy to poke around in and I feel that works very well.

But besides that, there's still a lot left to be desired when it comes to how it actually works. Not everything is easy to glean from the code alond.


It's a Debian server running nginx into a bunch of custom java services that use the spark microframework[1]. I use a MariaDB server for link data, and I've built a bespoke index in Java.

[1] https://sparkjava.com/ I don't use springboot or anything like that, besides Spark I'm not using frameworks.


If the SSDs were really maxed out you'd see high CPU usage/load as the CPU as blocked by IOWAIT.


Not in this case, it's all mmap.


Really? Even with mmapp'ed memory won't the CPU still register user code waiting on reading pages from disk as iowait? I'm so surprised by that that if it doesn't it sounds like a bug.


Yeah it's at least what I've been seeing. Although it could alternatively be that a lot of the I/O activity is predictive reads, and the threads don't actually stall on page faults all I/O that often.


239 days up. That's brave too ;)


Rebooting is like a hour of downtime :-/

FWIW I'm going commando with no ECC ram too.


Love this new definition of ‘going commando’ lol


I remember there is ksplice or something like that to upgrade even the kernel without a complete downtime. Everything else can be upgraded piecemeal, provided that worker processes can be restarted without downtime.


If the hardware itself is the reason for the long startup time, kexec allows you to boot a new kernel from within an existing one and avoids the firmware/HW init.


StackOverflow still just runs on a pair of (beefy) SQL servers. Modern web engineering is a joke.


A lot of modern web engineering is built for problems most people won't have.


The people strive to have these problems! Hockey stick growth, servers melting under signup requests, payment systems struggling under the stream of subscription payments! Scale up, up, up! And for that you might want to run your setup under k8s since day one, just in case, even though a single inexpensive server would run the whole thing with a 5x capacity reserve. But that would feel like a side project, not a startup!


I'd argue that a lot of modern web engineering pretends to be built for problems must people won't have. So much resume-driven-development is being done on making select, easy parts super scalable while ignoring some elephants in the room such as the datastore.

A good example is the obsession with "fast" web frameworks on your application servers, completely ignoring the fact that your database will be the first to give up even in most "heavy" web frameworks' default configurations without any optimization efforts.


I belive hackernews itself is a couple of servers too.


I think HN's stack is the right choice for them and that it fulfills its purpose excellently, but I do seem to recall both of their hard drives failing more or less simultaneously & HN going down for about 8 hours not that long ago.

If that happened at the SaaS company I worked at previously, it would be a bloodbath. The churn would be huge. And our customer's customers would be churning from them. If that happened at a particularly inopportune time, like while we'd been raising money or something, it could potentially endanger the company.

(I'd like to stress again this is not a criticism of HN/dang, but just to illustrate a set of requirements where huge AWS spends do make sense.)


One big server with a flat text file DB on NVME drives AFAIK.


I believe the application code is single threaded due to its interpreter too.


it does run pretty slow under load though, and they have acknowledged it's due to this


HN has mutable data though. That's a much harder problem than indexing a large amount of static data like a search engine.


Not only that, but I don't think the custom datastore handles concurrent writes.

One thread FTW. :)


I read once Reddit is just one big sql database


Reddit was using Cassandra in 2010 (https://www.reddit.com/r/programming/comments/bcqhi/reddits_...). The experience was mentioned on HN at https://news.ycombinator.com/item?id=21694461. I expect that they started with an RDBMS and made various moves over the years.


Wikipedia too.



dang? Bueller? Anyone?


24 cores, 128 GB RAM. One could run 10-20 EC2 instances to utilize a box like this, and produce an impression of sprawling backend infrastructure.


From my experience simple systems perform better on average due to less number of interconnected gears.

Much more complex systems do not perform as consistently as simple ones, and they are exponentially harder to debug, introspect and optimize at the end of the day.


I doubt anyone would be foolish enough to claim that the site NEEDS 'cloud infra' to run.


I think it sort of depends on what you want.

Every time I deploy a service it goes down for anything between 30 seconds and 5 minutes. When I switch indices, the entire search engine is down for a day or more. Since the entire project is essentially non-commercial, I think this is fine. I don't need five nines.

If reliability was extremely important, scales would tilt differently, maybe cloud would be a good option. A lot of it is for CYA's sake as well. If I mess up with my server, that's both my problem and my responsibility. If a cloud provider messes up, then that's a SLA violation and maybe damages are due.


It depends. Both https://google.com and, say, https://www.medusa.priv.at/ are technically web sites, but the complexity of the tech that makes them work is pretty different.


That is pretty impressive - not only not on its knees, but very responsive atm.


> The Random Mode has been overhauled, and is quite entertaining. I encourage you to give it a spin.

Yep, this is a good example of warping the Feeling Lucky pattern into a really neat little discovery tool.

IMO it would even be cool if the site was this part first, oh and hey it's also a search engine.

(While I'm random-ing: The Arch Wiki is in there? Seriously? Just for that, I propose that it either be skinned to max vaporwave, or host a webcam pointed at a Manjaro machine, or both...I'll be waiting over here, downloading 4.1 GB of marginalia for my AUR build of PCManFM)


I haven't got the time to curate this stuff. There's like 10,000 domains in the list. It's some one off SQL script I think that generated the sample based on parameters lost to time.


are domains whitelisted?


The domains you get from browse:random is from a small selection yeah. But if you start traversing with "similar" there is no such limitation, only limit is that they must have a screenshot.

(There's also explore2.marginalia.nu which is not even limited to websites with a screenshot)


Stumble upon used to do this. It was really cool.


StumbleUpon was amazing. I feel like that was really peak internet, at least for me. I found so many weird, awesome things.


It really could only work around the time it existed I feel. The internet was a lot weirder back then.

One big difference then from now is that you basically need a PhD in the Canvas API (or WebGL or whatever) to accomplish something a 5 year old could do in Flash. Web design was a lot more accessible. You didn't have to worry about responsive designs and fluid layouts. You could just position:absolute everything and that was kinda fine.


I think you might have some nostalgia goggles on at the moment. There's nothing holding people back from making "weird" web pages today, they can even make them nice and responsive. One of the better concepts around HTML and CSS was separation of data and layout.

It's trivial to have a "weird" position:absolute design with a break for mobile that switches to a more fluid layout. Desktop users can have their "weird" layout but I can still read the page on mobile and you can readily crawl and index it.

People moved away from design tools like DreamWeaver that helped make "weird" stuff and instead installed WordPress or some CSS/JavaScript framework that just bakes in all the "boring" fluid layouts.

You're not necessarily wrong about Flash in terms of design or creation but your search engine wouldn't be terribly practical if everyone was still using Flash for everything. Flash allowed content packed inside SWFs but also allowed fetching external resources. You wouldn't be able to index any of that unless your crawler executed the Flash and/or inspected all the URL references for external resources.

Flash created an inaccessible deep web just like today's JavaScript website-is-an-application "sites".

Don't get me wrong, I love the old web with quirky table-based layouts, "unofficial" fansites, and personal homepages hosted on forgotten university servers in a closet. There was a vibrancy that's largely missing from today's web.

I think a big change has been tools have become more geared for boring than the creative and people treat content on the web as a side hustle. Google et all haven't helped by favoring recency over other relevance factors.


Is it trivial enough that a 5 year old could do in a point and click editor?


It could be. But modern tools don't bother. Then again, Flash's usability by a 5 year old is being a bit oversold here.


There are quite a few interesting alternatives these days. Bored Button is one. There are some that are even more like it but I'm away from grep and my notes at the moment.

Reddit even has some kinda-similar subs.


Reminds me of StumbleUpon. I still miss that.


That period gave rise to the most educational web surfing I've ever experienced.


Watch my computer struggle here: https://www.marginalia.nu/stats/


What is the stack? Can it scale up?


Custom index software built from scratch in Java. MariaDB link database. The entire search engine runs on a PC in my living room.

You could pretty trivially shard the index by `hash(domain) % numShards`. There's no support for this because I literally only have this single server, but it wouldn't be much work.



Marginalia comes up on HN every so often, and I always look at it, think "oh that's neat!", maybe add it to my bookmarks toolbar, and then forget about it. Are there a lot of people who find themselves using it daily?


I don't even use it daily. It's not a Google replacement, and it's not trying to be either. It's more of an on-ramp for the obscure web at this point.

That said, it's gotten way better at finding stuff with the last few releases.


It's interesting, and I really appreciate that you aren't trying to out-Google Google. This seems like a useful tool in its own right in addition to what else exists.


I don't think that would make sense. If Google is struggling with search, a one man Google clone isn't going to do it better.

I also think that having "a google", one central search engine, is inherently a bad thing for the health of the Internet. It drives a lot of this search engine spam epidemic we're seeing.

A broader and (IMO) more interesting problem is Internet discovery.


without/before commerce one would link to similar websites as much as possible. Now those are called competitors.

I bet one could make a facinating ranking algo that groups sites by subject then sort them by nr of links to others in that group.

So the perfect SEO would be to have a blogroll at the top of the left menu with every related website in it.

i.e. 3 stores sell the same item. Nr 1 is the one linking to the other 2. Extra points for linking to that specific product page.


Google used to be a ~one man search engine clone (and it was definitely better a few years after that than it is today).


A lot of Google's initial quality was due to the fact that the content it indexed was much higher quality.

Even besides the point that the websites they indexed were a lot less adversarial, they put a lot of emphasis on indexing academia, and were outspoken against what came to be their present mixed motives[1].

[1] http://infolab.stanford.edu/~backrub/google.html#a


Well, two-man.


I think it was actually three-man with Scott Hassan.


I gotta say, Marginalia seems like the best search engine I've ever used specifically for food recipes! All the links are to personal, readable, HTML websites with the recipes obviously front and center. Finally I've found out how to escape from corporate recipes on the web!

https://search.marginalia.nu/search?query=spanish+rice+recip...


Yeah I've got special logic in place to help find useful good recipes. I'm quite happy with how well it works, despite being extremely basic[1]

[1] https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...


You weigh kale double that of every other ingredient, as far as I can see.

Made me chuckle a bit.


I do like Kale, not gonna lie.


I use it as my daily time-waster, it has a lot more interesting stuff than you'll find on sites like Reddit or Twitter

https://search.marginalia.nu/explore/random


What have you done? I tried to cut back on Reddit, but this seems like I could go done a rabbit hole for a few hours per day.


I don't use it daily, but I have reached for it multiple times in the last few weeks. I like it for finding blog posts, tutorials, comparisons, and hobby projects without getting caught up in fake articles like SEO-heavy wikis of copy-pasted content.


Use it maybe once every week or something like that. I like it.

I use it mostly for tech/programming/FOSS stuff. Especially for programming topics it can be good for filtering out all the ‘w3schools’ type of blog spam that just floods Google’s results.


I use it from time to time when I want to read something interesting. It can be a great source of articles that feel HN-worthy, if that makes sense.


Related. Others?

Marginalia Search has received an NLNet grant - https://news.ycombinator.com/item?id=34945541 - Feb 2023 (17 comments)

A Theoretical Justification (2021) - https://news.ycombinator.com/item?id=32586273 - Aug 2022 (22 comments)

The Evolution of Marginalia's Crawling - https://news.ycombinator.com/item?id=32565052 - Aug 2022 (22 comments)

Botspam apocalypse - https://news.ycombinator.com/item?id=32339314 - Aug 2022 (346 comments)

Marginalia Goes Open Source - https://news.ycombinator.com/item?id=31536626 - May 2022 (72 comments)

Uncertain Future for Marginalia Search - https://news.ycombinator.com/item?id=31200319 - April 2022 (37 comments)

Marginalia Search: 1 Year - https://news.ycombinator.com/item?id=30823481 - March 2022 (29 comments)

Show HN: Marginalia – Exploration Mode - https://news.ycombinator.com/item?id=30047455 - Jan 2022 (53 comments)

A search engine that favors text-heavy sites and punishes modern web design - https://news.ycombinator.com/item?id=28550764 - Sept 2021 (717 comments)

(just as a reminder, these lists are only to satiate curious readers - there's no reproach for reposting! Reposts are fine on HN after a year or so: https://news.ycombinator.com/newsfaq.html)



Anecdotally, all the people I know who have recreational/pet/indoor chickens are lovely human beings, so I'm wholly in favor of this absurd industry and its success.



"Recreational Chicken" is a two word poem if I ever saw one.


Thank you. I needed this in my life.


I'll counter with the top result for "cats": http://diabellalovescats.com/catland.htm


Like HN, Marginalia is a fresh of breath air in comparison to today's SEO-optimized, monolith-dominated web.

Is there a way to donate money?


it is on the front page of their page: https://memex.marginalia.nu/projects/edge/supporting.gmi


Silly me, I assumed the "Support" button at the top of the page was there for users who need... support.

I kept looking for a "Donate" button :-P

Thank you!


@marginalia_nu this is probably actionable feedback for you


The "random" button sent me on an hour long rabbit hole and I learned about (among other things) the gopher protocol. A+, would lose that time again.


My first experience of the internet was telnet from Win 3 box to a X.25 PAD and then telnet to something JANET (UK) then something US based (NSF I think) and fire up Gopher or WAIS.

Later my boss asked me to look at this web thing that he had heard about. I fired up telnet and eventually found an on ramp to CERN. To me it looked rather like everything else but I'm not exactly a rocket scientist!

https://www.w3.org/History/1992/WWW/FAQ/WAISandGopher.html


All seem nice until I get to see the search results.. which I cannot "read". It's very difficult to read an output that goes 5x boxes horizontally, and each such line goes then vertically on "forever". It's like yellow pages book from the 1990s.


Yeah, the "magic: the gathering" layout some limitations. I want something that makes good use of a large screen though.

I've got some ideas in the pipe, but haven't had the time to give them enough polish that I'm happy with them.

This is an early draft: https://imgur.com/a/vMVO7CK


Oh I see! It displays good on mobile devices.

The draft looks nice. The text colour is a bit hard to distinguish from the surrounding background, and I don't have any eye conditions.


Yeah, the contrast is one of the things I'm not entirely happy with. The positioning is also a bit off, especially if you resize the window a bit. As stated, needs polish. But I really like the idea of the search engine being a bit more transparent with how it works.


That's much better. would love to use it.


Encouraging to hear. I'm not a big front-end enthusiast so it's useful fuel for slogging through the design process.


I remember being so excited about the search engine Neeva because they seemed to be building a full fledge independent web index with top notch talent, I was really bought into the idea of a premium new search experience that I could pay for (no ads) and revenue share some kinds of content. But years later they focused on crypto and AI instead, and I always find myself just googling it because I have ad block and the results are more relevant, sigh


There also is kagi.com which should fit all the requirements you described.


Such a bad name though. Try telling hour friends about kagi. They'll type "kaggy". Maybe it'll be popular in Japan?


FWIW Googling "kaggy search" has "Did you mean: kagi search" at the top and kagi.com as the second hit.


How would they type 'google' 25 years ago?


Their new pricing is quite unaffordable now :(


It's really bad. For years I've been saying that I want someone to make a search engine where I am the customer and that I am happy to pay. Now kagi is here and I am to cheap to pay for it. I feel called out.


FWIW, they are not the only one where you can pay to be a customer, they just happen to think that paying-per-search is the strategy they want to go with. I also just double-checked and they do actually offer an all-you-can-eat for $25/mo and $255 if you pay annually

Now, the bad news is there is an associated discussion going on over in the "DDG integrates GPT" thread where the intersection of "I want to pay" and "I don't want to GPT anything" is damn near nil :-(


Per their website, they still seem to focus on ad-less search. What am I missing?


Yay! Marginalia considers my site important and good enough to be indexed!

Honestly, this makes me really happy. I would prefer that my traffic be driven by curated search engines, even if I get less traffic.

Also, I use this. I think it's great.


Hmm, "getting started with react" yields the following as the first match:

"https://frontendmasters.com/courses/complete-react-v5/gettin... Getting Started with Pure React - Complete Intro to React, v5 | Frontend Masters The "Getting Started with Pure React" Lesson is part of the full, Complete Intro to React, v5 course featured in this preview video."


Thanks for pointing it out. I blacklisted the domain. I don't mind commercial content, but if they're using SEO like that they're being a nuisance.


Is it limited to English? I made a few queries in French and Italian and got either no results either a couple of irrelevant ones.


Yes.

It's in part a measure to limit the scope of the project (the entire thing runs off a single PC), but it's also hard to build a good language model for a language you don't speak, and I only speak English and Swedish. But if the project grows, gets more hardware, and contributors that speak other languages, then maybe this will change in the future.


I'm not sure how effective it is at present - just gave it a very quick test on searching for info on roman coins - but the concept is great. This is something that I've often wished existed.

If I'm searching for roman coins I certainly don't want to find commercial sites selling them (I know what those are), or even the well-known online national collections or auction archives... I'd like to be able to find the specialist sites built by collectors (and maybe academics) that are non-commercial and way more interesting.

In the early days of the internet some specialist content/pages were organized into "web rings" each linking to each other, but nowadays we're mostly relying on search to discover new pages, and it seems a lot of the hobbyist content is way harder to find, assuming it's even out there.


What did you search? Try just 'roman coins'

#1: http://www.romancoins.info/Content.html

#2-4: were not very good

#5: https://www.forumancientcoins.com/dougsmith/voc1.html

#6: https://www.cngcoins.com/Greek+and+Roman+Coins.aspx

#7: https://www.crystalinks.com/romecoins.html

If you search for specifically the 'as' it may be eaten as a stop word :-/


I was searching "imp constantinvs" which is part of the legend on many coins of constantine the great. Would expect to see these details listed on any hobbyist sites, as well as the commercial ones I'm not interested it.

BTW #1, 5, 6 are all good sites, but those are very mainstream - those will be top links in Google as well. #6 is purely commercial - an auction house. #5 is a coin dealer's commercial site, but has good collector resources (discussion board, Wiki, collectors galleries) as well.


Do you know of any hobbyist sites within this space? I want to check a thing, could be this corner of the internet isn't well indexed. I should be able to tell with explore2.


constantinethegreatcoins.com is one - the owner is also a dealer, but this is his private hobbyist site.

Some examples of other non-commercial roman coin hobbyist sites (that will also rank fairly highly with Google) are:

augustuscoins.com wildwinds.com beastcoins.com www.notinric.lechstepniewski.info https://www.nummus-bibleii.com/

I'm at work right now, so these are just some examples off the top of my head. I can give more examples later if it's useful. Some of these site will include links to other collector/hobbyist sites.


Hmm, several of those weren't indexed, I added them to the crawl queue. Seems like the numismatics corner of the web isn't well indexed by marginalia.

constantinethegreatcoins does show up for 'imp constantinvs' though.


Wow! Relevance looks good for queries i tried and i like the square interface too.


Very cool project! How much hardware do I need to run this? E.g. can I run it on an old Xeon server with 12 threads and 48GB RAM with HDDs, or that'd be too slow?


It's mostly hungry for RAM, and needs fast SSDs.

I've got 128 Gb RAM and more would be better. I run a test instance on 32 Gb though.


It's quite hard to search for lists of any kind. For example:

list of Italian generals

list of CPU architectures

list of positive rights

etc...

Return no relevant results. Perhaps it's not giving enough weight to the 'list' aspect?


I think there's two reasons for this.

The query processing is fairly crude. For better or worse, it doesn't do much special processing. Which means you basically need a website that repeatedly says "list of CPU architectures" to rank well.

Most of the pages that contain such a title are also actual lists. The index de-prioritizes documents that are mostly lists or tabular data, as they're rarely very often false positives as they often contain repeated words.


It doesn't seem like an issue of ranking since there are no results with lists of any sort, even at the bottom.

Even if the word 'list of ...' only appears once, it wouldn't be filtered out, right?

Out of 100 million pages, it seems like there could easily be a few hundred thousand with lists.


@marginalia_nu, Few months ago, you said you would consider open sourcing this search engine. Are there any tasks that us github warriors can help with?



Sorry, I was not aware of that. Thx


Thanks for this! It absolutely touched me. www in a way it was back in the 90 :) those memories came back when getting results to neocities.


What a cracking resource! When you need to get away from the beige web, a few clicks on Random Mode is all you need.


How does it compare to https://wiby.me/ ?


Wiby is manually curated. Means their result align much better with the operator's vision. Marginalia has an orders of magnitude bigger index, but not all of it is as consistently good as Wiby.


I get no results at all for this seemingly simple https://search.marginalia.nu/search?query=how+to+draw+a+3d+b..., presume it's been HN hugged to death?


It doesn't do semantic search or synonyms. Think keywords, not questions.

Search for "draw 3d box" or "draw a cube" and it starts giving results.


No, it's really means "404 nothing found" for your search query. I searched for my company name and got nothing as well. A bit surprising since it says "search the Internet" ;-)


nah it just have a tiny index.

search for just “3d box” or something like that.


What advantages does this offer over something like typing "$QUERY -site:*.com" into a mainstream search engine? I think webmasters in general do a pretty good job at self-segregating their sites into commercial and non-commercial entities through the use of different top-level domains.


The advantage is a better set of sites since there are a ton of interesting little .com sites out there, and lots of unwanted sites on .org and country code domains. He also blocks some URL patterns that appear on spammy domains regardless of TLD. Try a few searches or the random results page to see the difference. It’s fun browsing.


What really drives me crazy with Google is that they think that it is ok to not label potentially paywalled articles as ads.

I get tricked so often into clicking a news snippet offered by Google only to then land on a site which not only presents me a paywall, but also does want me to accept their cookie policy before they present me the paywall.

It makes me angry every time anew.


I've always been curious about how search engines seed their scanning and index programs. Like how do you know what domains, ips, etc.. to start scanning and where is the origin?


It's basically seeded with my personal bookmark list. Like a few dozen links.

Not exactly this, but close enough: https://memex.marginalia.nu/links/bookmarks.gmi

I've changed the crawler design a couple of times, but the principle for growing the set of sites to be crawled is to look for sites that are (in some sense) adjacent to domains that were found to be good.


May I submit my sites to your index? I think they'd be a good fit for the index.

https://www.thran.uk and https://wmw.thran.uk


You can add them yourself :-)

https://search.marginalia.nu/site/www.thran.uk

https://search.marginalia.nu/site/wmw.thran.uk

Only this is possible as long as the index knows about the domain. Yours are, but if not, anyone can shoot me an email or something and I can poke them into the database.

The limitation for known domains is in place to avoid abuse.


Thanks!


1) How many pages are in your index 2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?


1) At this moment about 70 million documents. I've had it at about 110 million, dunno what the actual limit is.

2) Yes. Everything is in-house.

Do you build a word index by document and find documents that match all words in the query?)

Yeah. It's actually got three indices;

* One is a forward index with `document id -> document metadata`

* One is a priority term index with `term -> document id`.

* One is a full index with `term -> (document, term metadata)`

They're all based on static b-trees.


Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.


I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.


So if there was a new domain, unlinked by anything - this wouldn't find it?


It wouldn't. But such islands are typically not very interesting either. The context of who links to a domain is very important for a search engine for many tasks, not just discovery.


Very cool. Reason I ask is at first glance the header "Search the Internet" to me, implies you are searching the entire internet. It sounds like a more appropriate header would be "Search the obsecure Internet"


To be fair, no search engine lets you search the entire Internet, not even Google does this.

Internet arguably doesn't even have a size. You can construct a website that's like n.example.com/m which links to '(n+1).example.com/m' and 'n.example.com/(m+1)', for each m and n between 0 and 1e308.


I did it! For every two numbers, calc.shpakovsky.ru has a static(-looking) webpage showing their sum (or difference, etc). Together with links to several other pages. The only limitation I know of is 4k URL length. Interestingly enough, major search engines are rather smart about it and cooled down their indexing efforts after some time. Guess, I'm not the first one to make such a website.


Haha, nice! Crawler traps are a quite old phenomenon. Been around since before Google.

Dunno about the others, but my crawler has a set depth it will crawl. It'll BFS for like 1000-10000 documents depending on some factors.


I remember reading somewhere that Google used dmoz (https://en.wikipedia.org/wiki/DMOZ) as seed page for their crawler. Not sure if it's true though...


That may be a much easier question to answer than discovery.

How do you discover relevant new domains?


I've actually sort of solved this recently. Marginalia's ranking algorithm is a modified PageRank that instead of links uses website adjacencies[1].

It can rank websites even if they aren't indexed, based on who is linking to them.

Vanilla PageRank can't do this very well. Domains that aren't indexed don't have (known) outgoing links, in the periphery of the rank. There's a some tricks to get these to not mess up the algorithm completely, but they basically all rank poorly. That's even without considering all the well known tricks for manipulating vanilla pagerank. The modified version seems very robust with regards to both problems.

[1] https://memex.marginalia.nu/log/73-new-approach-to-ranking.g...


Start with Common Crawl and go from there.


First search for a commodity item returned pages full of conspiracy theories and then drifted in to anti-vax territory.


The unfiltered Internet in all its glory!


That doesn't actually sound very different from my experience with major search engines beyond the first page. I've taken it as a bit of an law that the Internet outside of large centralized and/or moderated sites gets very fringe very quickly. Since the whole point of the search engine is to display noncommercial sites, users will inevitably face thousands of self-published blogs of varying beliefs, quality, and truthiness. As these fringe sites take up more domain names by total volume than mainstream platforms (there is only one Twitter, Facebook, et al.), I am not surprised at all that they seem to be even more voluminous here than on commercial search engines.


Interesting, what did you search for?


I yearn for an exclusionary Internet. Most voices don't matter.


Would you mind explain 1) why you want that and 2) who decides which voices do matter?


"1) why you want that"

Universal broadcast does not work (beneficially for society) in an industry built to monetize reach.

Everyone is entitled to their opinions, but voices are not equal in utility or worth.

"2) who decides which voices do matter?"

This is always the problem, isn't it? I don't have an answer for you.


Isn't 1 just the result of 2? It's because we don't have an answer and because there probably isn't an answer to the second question that we need universal broadcast as you called it.

We could get away from it only if we figure out an answer to the second question but I suspect we'll never get to an answer.


"Isn't 1 just the result of 2?"

Only in that reach will continue to be monetized regardless of its impact on society.


Isn't it already?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: