Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Search Engine for Blogs (blogsurf.io)
389 points by dbrereton on March 29, 2022 | hide | past | favorite | 93 comments
Hey HN,

Blog discovery is a problem [0] due to the decentralized nature of online writing. Everyone writes on their own site or platform, and there’s no central place that brings everything together. Google results prioritize large media publications over blogs, so we need something else.

Blog Surf is an attempt to organize all of the great online writing done by individuals. I launched this project last year as a directory of personal blogs [1], but have now rebuilt it from scratch into a full-text search engine for blog posts.

You can search for blog posts, and filter by publish date and reading time. Blogs are manually reviewed before being added.

Posts are sorted by MarketRank [2], which is a measure of popularity across various online communities. Most projects that have attempted to organize blogs lack any way to measure the quality of a post, reducing their utility. With MarketRank, you can expect the top results for any query to be something you’d want to read.

The mental model for searching Blog Surf is “I want to see the best essays on X”

There’s also a directory so you can browse blogs by category, if you want a throwback to the Yahoo days.

If you’re a blogger yourself, you can check out the rankings page to see how your blog compares to others.

If you want to play around with things, we have a search API, and the full post dataset is also available for download.

[0] https://news.ycombinator.com/item?id=28591880

[1] https://news.ycombinator.com/item?id=26506126

[2] https://dkb.io/post/market-rank




I dig this! Since you're full-text indexing blogs with an eye towards content discovery, can I tempt you towards building out a reverse link index function to enable me to browse by thread?

My use case: often times blogs will respond to other blogs, linking to the original post in the process. The nature of linking means it's very easy to follow threads backwards in time, but given the original post it's often hard to discover the responses and ongoing conversation to follow things downstream. I'd like that downstream browsing to be easier.

My hope would be that such a tool could unlock higher-quality discourse. As a reader, this would let me hijack my natural tendency to follow comment threads, and redirect that attention towards slower-paced, more nuanced, more focused writing.

Edit: hmmm... though looking further, maybe this goes against your MarketRank philosophy.


I totally agree with you on this. Being able to follow the links easily in either direction would make it easier to fall into interesting rabbit holes.

> Edit: hmmm... though looking further, maybe this goes against your MarketRank philosophy.

It doesn't at all, but curious as to why you'd think that. We're not talking about using backlinks to rank pages after all, just as a discovery tool, which I think is great.


Ah, glad to hear it's not in conflict after all! Thanks for clarifying.

As to the source of my confusion... I think I was skimming and about to turn back to work and not thinking too deeply.

At the time I wrote that edit I had just gotten as far as "pagerank values links whereas marketrank values upvotes" (heavily paraphrased interpretation of 2.1 here: https://dkb.io/post/market-rank). My reaction was "oh, maybe that means links are bad... that's an interesting perspective but I don't have time to digest it right this moment. I'll put a pin in it and return to this article later to better understand, and hey I should add an edit to my comment to signal reading comprehension."

That last thought seems pretty ironic in retrospect.


I agree with everything you said, except the popularity ranking. The value of a content shall be in the content itself; popularity is only a flawed measurement. Worse, popularity has very strong positive feedback that contributes to the great polarization of opinions.


> The value of a content shall be in the content itself; popularity is only a flawed measurement.

Popularity is certainly a flawed measurement, but it's hard to come up with a scalable way to determine quality that isn't flawed in some way.

Instead of being flawed by encouraging people to get tons of backlinks, this is flawed by encouraging people to do stuff that gets lots of upvotes.

Very open to more ideas on how to measure quality.


"a scalable way to determine quality" is the billion dollar question so I have no idea. I'd just use plain old text index with no algorithmic ranking.


This is great! Tangentially related - any maybe an interesting way to view the network you have built up - I put together a quick d3.js force visualization of the "blogrolls" for ~300 linked blogs visible here: https://jacobwood27.github.io/035_blog_graph/


This is really cool, it would be awesome to make a graph of the blogosphere like this.

I'd love to chat about this more, will send you an email.


We, sphere.com, did this starting in 2006. After a year or so, we realized the only people using the service were looking to stroke their egos.

Ice rocket, and something else (I can’t remember the name) tried it at the same time and failed.

We pivoted, which ended up leading to some unspeakable horrors.

At any rate, good luck, hope it works better for you.


The blogosphere was certainly very different in 2006 than it is now - perhaps it's worth another go?


My understanding is that the early days of blogging were mostly people sharing personal things, and now that has moved to social media.

A lot of the online writing happening now is essays on some topic, or people trying to share notes on things they learn. And I think this type of writing is more conducive to a project like this.


You might want consider using OpenSearch [1] to make it easier to add Blog Surf to browsers as a search engine that can be accessed from the location bar. I added it manually in Firefox but it would have been handy to just be able to right-click the search field and choose "Add a Keyword for this Search".

[1] https://developer.mozilla.org/en-US/docs/Web/OpenSearch


Thanks for the tip, just added an OpenSearch xml.


Google has been giving me a very hard time for a while. It's time for SEO to die. We need stuff like this.


Really like it, good job! Nice color scheme, font choice, and elegant layout.

One little thing though: changing a search phrase or word and doing a new search, I notice the results do change, but there's no way to know if it really happened. Changing a Google search, the whole page flashes empty, that way I see/sense there's something new. In your case, a change is subtle, very subtle, too subtle. In one instance I had to look carefully to see the change in results.

Maybe adding a "you searched for X" is good enough, but I guess you can come up with a better way.


Thank you!

> changing a search phrase or word and doing a new search, I notice the results do change, but there's no way to know if it really happened. Changing a Google search, the whole page flashes empty, that way I see/sense there's something new. In your case, a change is subtle, very subtle, too subtle. In one instance I had to look carefully to see the change in results.

That is a good point. I did try to remove any loading indicators because I thought it would be smoother, but maybe it's a bit too smooth for people to realize their search went through. Will think more on how to fix this.


This is a super cool product! If there was some way on blogsurf to have RSS feeds per category I'm sure that would make my RSS feed curation much easier, random and interesting. I.E. subscribe to all the blogs labeled in cybersecurity, linux, etc. Or maybe this functionality is already present and I didn't see it (I saw the RSS feeds per blog).

Unrelated: it was interesting to see my blog listed on the site. Kind of surreal that someone finds my content useful and/or interesting. Very motivating and humbling.


This is cool and the quality of content is great too especially to get the most known blogs of a topic (and I feel like the quality of content is better than all the blogs search engine I have seen).

But I don't feel like manual curation by one person is easily compatible with search engine. To me the content of your website is more suited to a weekly newsletter or something like that. Because after trying a few search "getting a job in vc", "best computer chair", "learning erlang" I'm not confident this answer better results than Google.

You've got a content size problem as you are manually curating, and this will lead to people not use your search as a default, and probably not use it as a search engine, but instead as a discovery system.

You can also try to get more blogs on your search engine, and create a community around it, if you want more more, you can follow this newsletter [1] and you will get probably 5 new blogs per day.

Congratz on the job, this is very cool

[1] https://hnblogs.substack.com


> But I don't feel like manual curation by one person is easily compatible with search engine.

I see the manual curation as more of a temporary measure in the beginning. There are various ways blog detection can be automated and scaled, but manual curation for now gives me a better understanding of the data, and ensures I don't run into random edge cases.

> Because after trying a few search "getting a job in vc", "best computer chair", "learning erlang" I'm not confident this answer better results than Google.

Right now it's more useful for very broad queries like "inflation" or "covid". The index is pretty small at the moment, but the more posts that get added to the index, the more specific queries we'll be able to find good results for.

> You've got a content size problem as you are manually curating, and this will lead to people not use your search as a default, and probably not use it as a search engine, but instead as a discovery system.

That's actually what I want! This is not a search engine to replace Google, it's a discovery tool for blog posts.

Thanks for all the feedback here, and will definitely check out the newsletter.


Many search-engine posts recently. When will someone make the Search Engine for Search Engines?


It's called a metasearch engine [0]. There was a project launched 2 years ago called Runnaroo [1] that kind of did this, but they aren't online anymore.

I think most of us in this space are willing to collaborate so something interesting could happen.

[0] https://en.wikipedia.org/wiki/Metasearch_engine

[1] https://news.ycombinator.com/item?id=23771131


I do think some form of collaboration between small search engines would be very beneficial. I've been thinking about how to make that happen. So far I've added a public API to my search engine, and published some data.

Not sure what is a good way of creating a space for collaboration...


I created a metasearch for myself based on the idea of "continuation searches". One obvious point of collaboration bwtween search engines could be a uniform API and SERP format. Currently, there are slight variations between search engines in terms of the submission URL syntax/paramaters and the HTML used to display results, not to mention HTTP method, limits on number of results and sometimes additional, optional URL parameters. The differences are generally small^1 and this makes it relatively easy to create a personal metasearch. However it could be much less cumbersome if these differences were eliminated.

1. Exceptions are, e.g., ones that require two HTTP requests per query, such as Gigablast or ones that have strange limitations, e.g., Startpage, which has become unusable for me without Javascript. Contacting their "customer support" yielded no response.

Even better would be if search engines all shared their indexes and made these available for download. This would faciltate people building new search engines without needing to have their own index. In theory it would also bring a stop to the problem of people who submit large numbers of queries since all the bulk data they need would be available for download. www indexes that comprise public information could be freely shared as public data.


An index is, lowballing it, hundreds of gigabytes of dense binary soup; probably in some custom format specific to that search engine (sometimes there's some form of hash table going on, sometimes a B-tree), almost certainly with its own idiosyncrasies concerning keyword extraction. I think reconciling API differences is probably a lot easier than making use of index data.


I still quite like the idea of having a number of independent search engines each indexing their own specialist subjects, and one or more federated search front-ends which can pull these together.

Doing it with APIs is a little tricky to make work in a usable way though. There have been various attempts at standardised APIs, e.g. OpenSearch[0], and metasearch engines like searX[1] have what are essentially pluggable scrapers, but there are still fundamental issues like getting different results at different times and having different ranking mechanisms.

Integrating at the index level could make a more usable search, but there are lots of other issues with this approach, e.g. those experienced with Apache Solr's Cross Data Centre Replication[2]. And yes, the volumes of data may also be an issue, given a search index will typically be slightly larger than the compressed data size, e.g. the 16M wikipedia docs are approx 32Gb compressed and approx 40.75Gb in a search index.

[0] https://github.com/dewitt/opensearch , unrelated to Amazon's Elasticsearch fork

[1] https://github.com/searx/searx

[2] https://solr.apache.org/guide/8_11/cross-data-center-replica...


No doubt.


I searched using a keyword (kava), and got a list of random blog posts with absolutely nothing to do with kava. https://blogsurf.io/?query=kava


The simple and unfortunate explanation is that the index is just not that big right now (only 900 blogs).

Working on increasing it significantly, but will take some time. Try it again in a month and you may find it more useful. Right now it's mostly filled with tech, business, and politics.


Thanks. I figured that, but perhaps a message saying that no results were found would be more useful than showing random blog posts. Or maybe show that message, along with "... here are some interesting posts from the past month..."


This is dope! Wondering how you determine the number of blog points a certain post gets? Is there a blog quality score that's programmatically determined? How so?

It seems you are planning to introduce automation instead of manually reviewing every submission, would be interesting if you could crawl via links from blog to blog.

I also feel you could likely just type in a url and postfix /blog to it to get a niche blog on some topic. Not sure if that's too simple but seems like it might work as a v0 for your automation.


This is really awesome! I love that you're curating the sites included, and it shows in the quality of the results. The world needs more specialized searches like this, and you've done a brilliant job with implementation. I also really love that you have a directory.

I'm going to play with the API, and that's awesome you've made that available.

[Disclaimer: also working on a new search engine, and would love to include results from this!]


This is great. What is the policy on accepting blogs to be indexed?


Interesting.

How do you figure out which are blogs and which arent'?


This is definitely very cool as I've been looking for something like this since technorati (which was originally a blog search engine).

Would love to hear details about how you created the database, the infrastructure, etc if it's not a trade secret. Kudos on the launch!


> This is definitely very cool as I've been looking for something like this since technorati (which was originally a blog search engine).

Technorati was one of the inspirations here so that's great to hear.

> Would love to hear details about how you created the database, the infrastructure, etc if it's not a trade secret. Kudos on the launch!

Sure, it's actually fairly simple! The search backend itself is running on Typesense [0], which was very quick and easy to setup.

Due to the way ranking is calculated, I can actually avoid doing any real web crawling (though, I may add that in soon to help increase the index size). Ranking is based on submission to online communities, so all I really need is those submissions.

Using the Reddit, HN and Twitter APIs, I search for any submissions related to any blogs in the database, then those submissions give me the post URLs.

Once I have the post URLs, I just need to request those specific URLs to get the post data.

Then there's scripts for things like content extraction, inflation calculation, currency conversion etc.

All of those scripts are in python.

The frontend is a simple React app built with Next. All pages are statically generated.

Let me know if there's any more questions!

[0] https://typesense.org/


Any plans on open-sourcing the code? I'm not sure if your intention is to build a business using it (or, if you were, using AGPLv3 might help prevent third-parties from unfairly competing with you), but I'm sure a number of people would be interested in trying to run this on their own hardware, building their own personal index, hacking on it to add features they find interesting for themselves, or otherwise just learning something by taking a look under the hood (I'm probably in this category myself).


This is not a business, and I would like to open source it, but it would probably be better for everyone if I wait until I clean up my garbage code, which will take some time.


I tried searching "Will Smith" and was expecting hundreds of blog posts about his Oscar thing but all results are about programming jobs and joe biden. I even changed the date range to past week but still the same..

Any idea why?


Only 900 blogs are currently indexed, and they're mostly tech, business or politics, so I wouldn't be surprised if none of them have written about the will smith situation.

I am working on increasing the amount of blogs significantly, but please bear with my modest index in the meantime.


This is neat, but it's quietly doing some kind of fuzzy matching with my query.

https://blogsurf.io/?query=furries

I actually have no idea what it's really searching for. Only one of the first ten results contains my query. Compare to HN's search which highlights the matched words so it's at least clear when it's going off-script.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Quotes provide an exact match here, but not on your search engine.


Yeah I think the bigger issue here is not highlighting the matched text, which makes it look like it's doing something totally random.

There is some typo tolerance, so it's hard to tell exactly what is matching on those pages.

I'm working on adding query operators so you can specify things like exact matching, and in general working on improving the results.

This type of feedback is very helpful, so thank you!


Great to have full-text search option!

I used to browse latest posts using https://blogsurf.io/posts link, is that functionality not available anymore?


Oops, yeah I meant to re-add that. Will get a recent posts page up soon.


I love the idea, but I couldn’t find any results that didn’t look mostly random. I thought I’d look for posts about shooting or editing video. Couldn’t find anything close, no matter how I ordered my query.


Ah yeah, there's probably not any blogs that talk about video editing. The index is fairly small right now and does better for tech/business/politics queries at the moment. Will work on increasing the index.


The tags are pretty limited and have some duplicate entries. Submissions also require a tag for them to be posted, so if they don't fit what's already there, a wrong one will have to be supplied.


Yeah tagging is definitely a weak point as the current focus is on search. I just removed the tag requirement.

If you're submitting a blog and we don't already have a tag that fits, you can add it to the "Notes" section.


Thank you for making that change so quickly


I love the idea and I cannot stress how many times I regret finding some undistracting results from those knowledgeable people who put time and energy to produce high-quality materials instead of search-engine-optimized dull content! That said, a mere full-text search lacks understanding context. I tried getting some results about gardening and plants, but returned results where anything from bitcoin to power plants, or even DNA, and not vegetation.


That's an awesome implementation and works really neat. I have been thinking to add this capability to https://refined.blog/ . also if you need tagged blog sites you can use our bloglist. also i previously posted in hn so there are some good blogs in here ( https://news.ycombinator.com/item?id=27973836 )


Thanks! And that's awesome, I could definitely use more blogs, will check it out.


The search algorithm seems to be extremely poor. I searched for the exact blog post title of some very good blog posts which in other search engines end up in the top 5 if you search for their topic and on blogsurf it didn’t look like it appeared in the results at all. That’s very very weak, especially the first 10 results had nothing to do with the topic I searched for. It was just super high profile websites which had remotely mentioned a few of the keywords.


I think the problem might be that the blogs aren't added yet. I also couldn't find some quite popular blogs, but I submitted them, so hopefully this will change in the future.


This is great. As a frequent Google user, I breathed a sigh of relief seeing ad free search results from individuals.

Curious - how do you know whether a site is a blog versus something else?


He’s sheared in the blog post - manual review of all blogs


Potentially quite useful. But I ran into one snag. I searched on my friend and frequent blogger Ben Nadel. But at the top all the posts were about Angular.

What I wanted were all his posts that weren't about Angular. So I tried adding -angular which works in Google. It pulled up one non-angular post and all the rest were the original ones that are there when you load the page. Add that one feature and I will probably use it a lot.


Currently don't support any query operators, but yeah it would be very useful. Will add that functionality soon.


Cool Idea, I love search engines with content made by real people. I’m not sure how many of these you have, but you might be able to pull some more blogs from https://bloggingfordevs.com/trends/ or https://blogdb.org/blogs


Some good sources, will check them out!


This is awesome! I see some blog posts of mine already on here but using an outdated URL.

Are there any plans to check for redirects and update the URL or to recrawl?


Google used to have a good blog search. Biggest problem was SEO spam blogs - but even these used to fall down the rankings.

Sadly it died of Google deprecating an API. https://en.wikipedia.org/wiki/Google_Blog_Search


For business news articles: https://yup.is


Dmitri showed this to me a couple of weeks ago, and I was super-impressed, enough so that even though he sent me a note about it at the end of the night, I stayed up to respond to him. This makes me feel like the spirit of Technorati has a chance of making a comeback someday.


I love this. I am always on the lookout for material written by individuals; but, it's surprisingly hard on the modern web.

Tbh I'll probably use the random bit more than search but definitely going to keep checking back to pad my RSS feeds with interesting content.


> Tbh I'll probably use the random bit more than search

That's interesting to hear, and fits well with the goals of the site. I want it to be more of a "discovery engine" than a "search engine". Search is one path to discovery, random posts are another, there are probably more.

One thing I'm thinking of adding is the ability to easily see the blog posts that any given post links to. If you see an interesting post, you could pull up everything that may be related.

> definitely going to keep checking back to pad my RSS feeds with interesting content.

Sadly not every blog has RSS, and many RSS feeds are incomplete. Another thing I would like to build is auto-generated RSS feeds for all blogs, which would also make it easy for people to programmatically parse any blog and do interesting things.


What year is it?

Don't hate blogs and happy for resurgence, but repeating an uphill battle with indexing like it's 2007.

Also, random interesting posts on front page are like from 2009, 2011, 2015...... What? That's the freshest more relevant content?


> Also, random interesting posts on front page are like from 2009, 2011, 2015...... What? That's the freshest more relevant content?

Why would you expect a section titled "Random Interesting Posts" to have the freshest more relevant(?) content?


Right now the "random interesting posts" are a random selection from the top 1000 of all time.

However, if you want fresher content, you can use the date range selector and set it to "Past Week" for the best posts of the week.


Dunno, I typically find fresh and interesting mutually exclusive.


Freshest as in more recent and not outdated, relevant content. An index of blogs being built now doesn't need to be sharing old essays from 2009, if they're even indexed for whatever weird reason. Newer, updated thoughts (that may reference prior materials, sure), fresher. Instead of same old posts being shared over and over from forever ago(esp in tech when things change so much/evolve)


yeah, reminds me of working at technorati in 2005


Lovely! I agree with others that this is promising. Thanks for sharing.

One point of feedback: Searching for "C#" seems to bring up C articles and no C# ones, so I suspect perhaps the "#" isn't being included.


I encourage all of you to submit the blogs you are following. Only one of the ten blogs I'm following was supported. Even quite well known once like oldnewthings or lemir's blog.


Bookmarked. Going to use it.

Now if someone would just make a better Reddit search.

And then another one for high value properties like mayo clinic, wikipedia, GitHub etc, I will not need to use Google anymore.


I tried searching for "saashub", and non of the results had a single mention of that term. Do you know what is the reason? Some stemming?


This is a great idea! It's very tech centric, though (at least judging from your directory).


Would you mind sharing some stats on the index? Are you populating it with manual curation?


There are currently 900 blogs in the index. Every blog is manually reviewed, so this number will grow slowly and steadily over time, until I maybe automate things.

https://blogsurf.io/about


I will advice against any automation for the foreseen future of your project. Having a few guys (volunteers) helping with manual review shouldn't be so hard to find.


Do not automate, not in this early stage, yet.


Excellent idea, I was thinking of creating something similar. My new homepage, for sure.


I love seeing new search engines and this appears to be very well done. The web needs better visibility into its blogs. Bravo!

I wonder about the use of MarketRank. For instance, search for "COVID" and your top hit will be from Alex Berenson, a well-known purveyor of outright COVID misinformation. Is this post "interesting"? Yup. Is it trustworthy? Absolutely not.


Love the idea, but I can't search for exact matches using quotes :(


query operators coming soon


This is cool! how are you thinking about attracting users (besides HN)


Yes! Death to SEO! Really great tool; thank you for sharing.


I love the flaming comic sans Directory header, lol


Great idea mate. Keep up the good work.


Where can I see an index of all posts?


I love this, it would be cool if we can submit blog to be indexed. Of course you can review first to avoid spam.


There's a link to exactly that in the header of the site: https://blogsurf.io/submit.


Very excited to see this! I noticed some of my favourite bloggers don't appear to be indexed for whatever reason: Josh Comeau, Julia Evans, Amelia Wattenberger. Any idea why these aren't indexed and if you plan to add them? I wonder if you could get a list of some of the most popular blogs on HN (perhaps the maintainer of upvotetracker can help) to add to the index.


this is very cool, thank you




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: