Hacker News new | past | comments | ask | show | jobs | submit login
Marginalia Goes Open Source (marginalia.nu)
358 points by georgehill on May 28, 2022 | hide | past | favorite | 72 comments



Well done and thank you. Not just for the work, but for the attitude and courage of your convictions that come across clearly in your comments here on HN.

Search is presently a neighbourhood of technology that's fallen into a downtrodden ghetto. Maybe you've sowed some seeds that can grow between the cracks. But I think there are gangs and pushers roaming the streets that would be happy for it remain derelict.

Thinking about the dynamics of open source, with respect to the many different interpretations of search that create tension within such a project: It occurs to me that whenever a somewhat mature project enters the ecosystem a robust strategy is to immediately clone-fork multiple identical copies of a project. This is like a "backup strategy". From there, form several loosely coordinated project groups that can spread risk, share threat intelligence, but each work with different models of crawling, filtering, funding, presentation and so on. Keep the original as an upstream parent. One of the children may survive.


Open source has a wonderful property: it only ever gets better as time goes on.

Thank you for permanently improving the FOSS search state of the art.


> I'm currently looking for hosting for a large term frequency data file that is necessary for several of the search engine's core functions. I really don't have the bandwidth to serve it myself. It's only a couple of hundred megabytes so it'll probably be solvable somehow.

I have a bit of (rented) space available. Most of it on Uberspace.

What are the expected traffic requirements?

Contact me if you like. Will update info in profile immediately.


Wouldn't this be a good match for BitTorrent distribution? The data will be updated regularly and then, all participants will be online and able to share their bandwidth while they download the new frequencies.


I'm not sure if it is appropriate for this use case, but wouldn't it be possible to host it on Backblaze, distribute through Cloudflare for free traffic? Or even just a torrent.


GitHub Releases can host files up to 2 GB, for free.


The search engine discussed on HN:

https://news.ycombinator.com/item?id=28550764 (September 2021, 725 comments)


> I feel GitHub has taken an incredibly toxic turn with its emphasis on social features

This the first time I've seen this view of GitHub, and I'm struggling to figure out what specifically he's referring to.


This is a great gift to the community, but I really wish it would be hosted where the community is active. It seems like a lot of people will miss out because they won't be able to discover this.

I find a lot of projects because I follow people on github that star or fork really neat projects. I download projects from bitbucket, gitlab, and other self-hosted instances.

I don't know that I've ever contributed a PR back as it's just a more friction-prone process. However, I've opened hundreds of PR's and Issues on github because it's already a part of my daily process.

Thanks again


Thinking that a community lives in a git provider is a bad idea. It shouldn't matter who hosts a git instance, and I would argue it's better to fight for diversity here.


It might be better to fight for diversity, but I was pointing out the actuality of it all. Networks effects are real, and this isn't the project I want to see die because of it.


I don't really understand this. Is there much more to it than a new login in terms of network effects?


So, github is the daily workflow for many people. You're in it, your around it, your using it. It's kind of the same as out-of-sight, out-of-mind.

If I want to, I can signup for yet another site and do git-things with this codebase, but I have to change my daily process to include a one-off change to a new location. There are real costs with this mental flow change.


I don't really see the downside. The providers seem pretty similar, and I can have more than one tab open on my browser.

But thanks for the explanation; I don't feel it myself but it's good to know!


I would guess that the chan board members and other asshats raided their projects in the past, because they regularly attack / harass these kind of projects when they're not fully open source.

And let's put it this way: the moderation tools at GitHub are a bad joke. Can't even disallow abusive language, so they're 100% ineffective.


The author of the article was saying that Github was toxic and I personally do not see an issue tracker as a social feature.


Comment threads everywhere (including on commits and lines of code), these comment threads then having emoji reactions, and the inability to disable any of that (same for the pull request feature, which also leads to more comment threads) can get somewhat annoying when a group has decided to target you for arbitrary reasons, and such potential 'needing to see harassment' being deeply embedded in one's code workflow (i.e. in your own repository) can get tedious.

Not everything needs to be turned into an embedded discussion forum.


Yeah, this is largely my objection to GitHub. I feel the design of GH encourages a performative communication style, especially from outsiders with no actual interest in furthering the project.

I also object to how the platform encourages low-effort drive-by participation from bots and the like.


"disallow abusive language"? Do you think chan board users are too stupid to know how to bypass an "abusive language" filter?


> chan board users

You mean "image board user", right?


Sneedacity debacle.


Funny you'd mention this. I installed Audacity a few hours ago and much to my dismay the application informed me upon starting it for the first time that update checks were enabled by default - in other words, Audacity had already called home and I couldn't have done anything to stop it from doing so.

So yes, they deserved the abuse.


First off, nobody deserves any sort of abuse.

Secondly, the Sneedacity incident was abuse targeted at a fork of the Audacity project, not at the original 'hostile takeover' project itself. The fork was intended to remove said abuse you feel is deserving of more abuse.


Seconded. GitHub's social features are undeniable (why distribute a VCS if not to make it more social?), but I'm curious about the toxicity.


The GitHub UI forces their view of the world and their development flow on you.

It doesn't bother me personally, but perhaps I've already been indoctrinated over the past ten years.


I think things like the personalized timeline and comments reactions may be considered toxic by some people.


(Disclaimer: github employee speaking for myself)

Reactions are not really about fostering “social” aspects of code review and issue triage as much as they are to prevent the need for a build up of low effort “+1” style comments. I view them as a community management tool that OSS maintainers would be much worse off without.


Distributed version control is about making code and patches easier to share, if anything the point is to reduce the socialization. It's a more convenient and development-centric form of mailing lists, not a more social version.


> why distribute a VCS if not to make it more social?

Because it is easier to set up?


Thanks for making this open source. What about the index? Are you open sourcing that also?


If I can solve the logistics of publishing that data, then sure. In its most compressed form it's still of order 100 Gb.

The intermediate goal is to have some standardized testing dataset of a couple of hundred megabytes to a gigabyte or so.


Like another commenter suggested, torrents might be a good solution once it's seeded


Cool. Looking forward to see the intermediate dataset.

I think you should post a ToDo list on the git repo. People can then contribute their skills.


Yeah, that's a good idea. I'm looking at a bunch of ideas for reducing the friction to contributing, still a bit of work that needs doing in that area.


TLDR; There are lots of interesting tangents here, even if you don't want to run a web search engine.

I'm in search of kindred spirits seeking to flush out the full power of the memex. I found the notes it quite interesting[1]

I've gone through quite a bit of the related content, up to watching a bit about the Gemini protocol, which is interesting.

1 - https://memex.marginalia.nu/log/08-whatever-happened-to-the-...


Great job with Marginalia. Do you plan to open source your data as well or only the code?


If I can solve the logistics, maybe. I don't have the needed bandwidth and off-site storage at this point.


What sort of sizes are we talking about? I'm thinking if it would be possible to "crowdfund" the storage-costs for a requester-pays s3 bucket for it.


I'm probably producing around 250-500 Gb data/month at this point.


What's the cumulative size for the index to date? I'm not rich by any measure, but if it's within reach I'd probably fund the storage costs.


Reach out to Jason Scott -- textfiles.com -- and see if he knows anyone who would be interested.

He might know some folks.


Any idea how compressible that is?

If it's something that compresses really well (eg text data in a database), then live compression filesystems (eg ZFS, likely others) could potentially help make that workable.


The data is either already compressed or dense binary soup, so no luck.


Idea: Host datasets as one or more torrents. Thoughts?


AGPLv3+, nice!


Do you have any high level architecture docs? Would love to hack on it.


I'm working on it. Still in the process of setting up the project, so it will be a while before any sort of useful level of documentation is in place.


Alright, this seems great. Going to test it in daily use.

I am ready to start paying money for search engine. So fed up with google SEO spam.


Probably not going to replace Google, it's at best aiming to pick up the slack where it's struggling. But dunno, maybe down the line it will have grown into a real open source alternative.


Thank god a search engine that try to sell you stuff. Glad it is open sourced. Are they from Scandinavia?


I'm based in Sweden, yeah.


Haven't there been estimates that running a globally indexed search engine product costs a billion dollars per year? As always, thank you for marginalia!


This only does a fraction of what the big search engines do, though. Still intresting to find out what is possible given those constraints.


I think the helpful thing is that if you don’t intend to keep all blogspam, you are in a relatively good position.


Very cool! And brave! Good on you.


What is Marginalia? Feels like a links database.


It's a grab-bag of projects geared toward internet discovery and information access.

It's got an internet search engine, a lo-fi wikipedia mirror, this thing <https://search.marginalia.nu/explore/random>, and more.


Actual announcement: https://memex.marginalia.nu/log/58-marginalia-open-source.gm...

My gitea instance is on a poor Raspberry Pi, probably won't survive HN front page :P


Thanks for choosing the GNU AGPLv3 license! That is the best license for services like search engines, especially if they intend to cement the license by having a diverse set of copyright holders that ensure the license doesn't change, rather than going the open core route and requiring copyright assignment via a CLA and selling proprietary versions.

What is your plan for sustaining the project in terms of contributions, funding and hardware?


Yeah, that was my analysis as well.

> What is your plan for sustaining the project in terms of contributions, funding and hardware?

Still much to figure out. The project is still fairly immature in general.

As far as hardware and funding goes, it's sort of sustainable now through not demanding very much in either, but more of both would no doubt be necessary to grow much farther. It's a bit claustrophobic not really having a proper CI machine or test instance for example.


You might want to consider forming a non-profit for non-technical ownership/governance of the project (that can receive donations etc), or alternatively moving the new open source project under a fiscal sponsor like SPI or Software Freedom Conservancy.

https://www.spi-inc.org/ https://sfconservancy.org/


If you feel the need to complain about how something doesn't align with your personal philosophical convictions and fails to satisfy your criteria for ideological purity, please write a really long and angry essay about this topic, and send it to <kontakt@marginalia.nu>.

Don't forget to press caps lock as you begin typing to save your pinky fingers, I wouldn't want to be responsible for nasty RSI.

i like your attitude. i think you are off to a good start.

that data file, could you host that on github (or gitlab maybe)?


Reminds me of this classic:

http://bash.org/?835030


Wow, love this new discovery


ALERT! WE HAVE A LIVE ONE!

listen, friend, be careful. you don't know what you are dealing with here. put down your device and back away slowly. this stuff is very potent. it's dangerous. if you keep reading, your productivity will go to zero. you may loose your job and your friends. your family may disown you.

if you continue, you are on your own. don't say i didn't warn you!


We changed the URL to that from https://git.marginalia.nu/marginalia/marginalia.nu. Thanks!


If you want, come mirror it on my gitea instance: gitea.slowb.ro. I'd be happy to take the load ^-^


Seems to be up and running fine. Gitea seems to be quite robust and reliable! I hope you have regular backups, I had a number of unfortunate IO corruptions with the PIs.

Thanks for the open sourcing


A brief "what is marginalia.nu" section would be awesome!


i found these pages:

about the search engine: https://memex.marginalia.nu/projects/edge/about.gmi

about the encyclopedia: https://encyclopedia.marginalia.nu/wiki-clean.html


I am not surprised you seem compelled to do the right thing


That's one small step for man, but one giant leap for mankind.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: