Well done and thank you. Not just for the work, but for the attitude
and courage of your convictions that come across clearly in your
comments here on HN.
Search is presently a neighbourhood of technology that's fallen into a
downtrodden ghetto. Maybe you've sowed some seeds that can grow
between the cracks. But I think there are gangs and pushers roaming
the streets that would be happy for it remain derelict.
Thinking about the dynamics of open source, with respect to the many
different interpretations of search that create tension within such a
project: It occurs to me that whenever a somewhat mature project
enters the ecosystem a robust strategy is to immediately clone-fork
multiple identical copies of a project. This is like a "backup
strategy". From there, form several loosely coordinated project groups
that can spread risk, share threat intelligence, but each work with
different models of crawling, filtering, funding, presentation and so
on. Keep the original as an upstream parent. One of the children may
survive.
> I'm currently looking for hosting for a large term frequency data file that is necessary for several of the search engine's core functions. I really don't have the bandwidth to serve it myself. It's only a couple of hundred megabytes so it'll probably be solvable somehow.
I have a bit of (rented) space available. Most of it on Uberspace.
What are the expected traffic requirements?
Contact me if you like. Will update info in profile immediately.
Wouldn't this be a good match for BitTorrent distribution? The data will be updated regularly and then, all participants will be online and able to share their bandwidth while they download the new frequencies.
I'm not sure if it is appropriate for this use case, but wouldn't it be possible to host it on Backblaze, distribute through Cloudflare for free traffic? Or even just a torrent.
This is a great gift to the community, but I really wish it would be hosted where the community is active. It seems like a lot of people will miss out because they won't be able to discover this.
I find a lot of projects because I follow people on github that star or fork really neat projects. I download projects from bitbucket, gitlab, and other self-hosted instances.
I don't know that I've ever contributed a PR back as it's just a more friction-prone process. However, I've opened hundreds of PR's and Issues on github because it's already a part of my daily process.
Thinking that a community lives in a git provider is a bad idea. It shouldn't matter who hosts a git instance, and I would argue it's better to fight for diversity here.
It might be better to fight for diversity, but I was pointing out the actuality of it all. Networks effects are real, and this isn't the project I want to see die because of it.
So, github is the daily workflow for many people. You're in it, your around it, your using it. It's kind of the same as out-of-sight, out-of-mind.
If I want to, I can signup for yet another site and do git-things with this codebase, but I have to change my daily process to include a one-off change to a new location. There are real costs with this mental flow change.
I would guess that the chan board members and other asshats raided their projects in the past, because they regularly attack / harass these kind of projects when they're not fully open source.
And let's put it this way: the moderation tools at GitHub are a bad joke. Can't even disallow abusive language, so they're 100% ineffective.
Comment threads everywhere (including on commits and lines of code), these comment threads then having emoji reactions, and the inability to disable any of that (same for the pull request feature, which also leads to more comment threads) can get somewhat annoying when a group has decided to target you for arbitrary reasons, and such potential 'needing to see harassment' being deeply embedded in one's code workflow (i.e. in your own repository) can get tedious.
Not everything needs to be turned into an embedded discussion forum.
Yeah, this is largely my objection to GitHub. I feel the design of GH encourages a performative communication style, especially from outsiders with no actual interest in furthering the project.
I also object to how the platform encourages low-effort drive-by participation from bots and the like.
Funny you'd mention this. I installed Audacity a few hours ago and much to my dismay the application informed me upon starting it for the first time that update checks were enabled by default - in other words, Audacity had already called home and I couldn't have done anything to stop it from doing so.
Secondly, the Sneedacity incident was abuse targeted at a fork of the Audacity project, not at the original 'hostile takeover' project itself. The fork was intended to remove said abuse you feel is deserving of more abuse.
Reactions are not really about fostering “social” aspects of code review and issue triage as much as they are to prevent the need for a build up of low effort “+1” style comments. I view them as a community management tool that OSS maintainers would be much worse off without.
Distributed version control is about making code and patches easier to share, if anything the point is to reduce the socialization. It's a more convenient and development-centric form of mailing lists, not a more social version.
Yeah, that's a good idea. I'm looking at a bunch of ideas for reducing the friction to contributing, still a bit of work that needs doing in that area.
If it's something that compresses really well (eg text data in a database), then live compression filesystems (eg ZFS, likely others) could potentially help make that workable.
Probably not going to replace Google, it's at best aiming to pick up the slack where it's struggling. But dunno, maybe down the line it will have grown into a real open source alternative.
Haven't there been estimates that running a globally indexed search engine product costs a billion dollars per year? As always, thank you for marginalia!
Thanks for choosing the GNU AGPLv3 license! That is the best license for services like search engines, especially if they intend to cement the license by having a diverse set of copyright holders that ensure the license doesn't change, rather than going the open core route and requiring copyright assignment via a CLA and selling proprietary versions.
What is your plan for sustaining the project in terms of contributions, funding and hardware?
> What is your plan for sustaining the project in terms of contributions, funding and hardware?
Still much to figure out. The project is still fairly immature in general.
As far as hardware and funding goes, it's sort of sustainable now through not demanding very much in either, but more of both would no doubt be necessary to grow much farther. It's a bit claustrophobic not really having a proper CI machine or test instance for example.
You might want to consider forming a non-profit for non-technical ownership/governance of the project (that can receive donations etc), or alternatively moving the new open source project under a fiscal sponsor like SPI or Software Freedom Conservancy.
If you feel the need to complain about how something doesn't align with your personal philosophical convictions and fails to satisfy your criteria for ideological purity, please write a really long and angry essay about this topic, and send it to <kontakt@marginalia.nu>.
Don't forget to press caps lock as you begin typing to save your pinky fingers, I wouldn't want to be responsible for nasty RSI.
i like your attitude. i think you are off to a good start.
that data file, could you host that on github (or gitlab maybe)?
listen, friend, be careful. you don't know what you are dealing with here.
put down your device and back away slowly. this stuff is very potent. it's dangerous. if you keep reading, your productivity will go to zero. you may loose your job and your friends. your family may disown you.
if you continue, you are on your own. don't say i didn't warn you!
Seems to be up and running fine. Gitea seems to be quite robust and reliable! I hope you have regular backups, I had a number of unfortunate IO corruptions with the PIs.
Search is presently a neighbourhood of technology that's fallen into a downtrodden ghetto. Maybe you've sowed some seeds that can grow between the cracks. But I think there are gangs and pushers roaming the streets that would be happy for it remain derelict.
Thinking about the dynamics of open source, with respect to the many different interpretations of search that create tension within such a project: It occurs to me that whenever a somewhat mature project enters the ecosystem a robust strategy is to immediately clone-fork multiple identical copies of a project. This is like a "backup strategy". From there, form several loosely coordinated project groups that can spread risk, share threat intelligence, but each work with different models of crawling, filtering, funding, presentation and so on. Keep the original as an upstream parent. One of the children may survive.