
Show HN: A Simple Search Engine - shakna
https://kuurio.com
======
MR4D
You have a bug somewhere. Need to convert to lowercase for index and searches.

For instance:

\- searching "houston" gives me a page full of results;

\- searching "Houston" gives me zero results;

\- searching "HOUSTON" give me zero results.

Also, just a recommendation - if you are going to do a "Show HN", then you
should have a good writeup about the tech used somewhere. We a re a curious
bunch, and love to not only see what others are doing, but learn a bit about
it.

Finally, your results are fast, so that's a good start.

~~~
shakna
Well. Forgetting lowercase but doing the rest of the search parsing was a bit
of a mistake. That should now be fixed, though it might take a little while to
propagate through the caches.

\---

There's really, really, not much to the tech behind it. Nothing surprising,
and nothing really special.

The website itself is 446 lines of a Python bottle app, behind an aggressive
nginx cache.

For example, the homepage is:

    
    
        @bottle.route("/")
        @bottle.view('home')
        def home():
            # Get some random articles
            random_articles = []
            for url in list(set(get_recent_urls(N=9))):
                random_articles.append(record(url))
    
            random.shuffle(random_articles)
            random_articles = random_articles[:3]
    
            # Get the top tags
            tags = top_keywords(N=10)
    
            count = url_count()
            update = datetime.now()
    
            return {"articles": random_articles, "tags": tags, "count": count, "update": update, "tag_kind": 'Trending'}
    

There's no... "magic" there.

The link-fetcher is slightly larger at around 700 lines of Python, but most of
it is taking edge-cases into account. Like finding where the timedata is
hidden on a BBC page (comes as an epoch), or on CNN's lite site where it
doesn't have a tag (and has to be manually parsed), etc.

The vast majority, about half of the fetcher, is a series of fallbacks to keep
trying until it finds something, before it finally gives up.

~~~
MR4D
Thanks for sharing!

How do you retrieve paywalled links (or do you skip those)?

~~~
shakna
I've tried to somewhat curate where the links are retrieved from, but the
fetcher isn't javascript aware.

In cases such as paywalls or heavy JS usage, you'll probably see an empty
summary.

------
jszymborski
The cool: Super lightweight website, and the world certainly needs alternative
search engines that aren't owned by ad companies, so kudos.

The critical: It'll be very interesting to see this website when there are far
more than 3K urls in the index.

btw, it's working just fine for me on firefox.

Wishing you the very best!

~~~
shakna
The index will be interesting as it gets bigger.

I've put a preference on timedata. So any three sites with a recent article
can hit the frontpage. Because I want hackaday appearing as much as I want the
BBC, etc.

Sites that I can't grab the publishing time for will still appear, but they
get ranked down and you'll need to find them by searching.

------
BlackLotus89
Like everyone already said not usable in firefox. When testing a chromium
based browser I get no results. Tried test, foo, image, covid19 and pulse.

So I think it's probably broken...

~~~
shakna
You won't see a difference in results between browsers. There's 0 Javascript
on the site.

If you're seeing something funky with the CSS in Firefox... Then you probably
have a dark mode addon. And the addons and Chrome disagree with how dark mode
should work. Try disabling it.

I'm seeing 40+ responses for each of those words when searching, across
Firefox, Chrome, Firefox Android, Chrome Android, and even Lynx.

~~~
gelatocar
The problem in firefox is from this bit of code:

    
    
        /*
            Darkmode
        */
        @media (prefers-color-scheme: dark) {
            html {
              background-color: white;
              filter: invert(100%);
            }
    
            input {
              filter: invert(20%);
            }
        }
    

It seems in firefox the invert doesn't apply to the background color of the
html element so you end up with yellow on white.

~~~
shakna
Yep. However Chrome doesn't invert the background unless I specify it.

And as far as I'm aware, Firefox doesn't actually have a dark mode preference
outside of Preview as of yet. There are plenty of addons - but they all have
wildly varying behaviour.

~~~
detaro
Firefox has had support for it for several versions.

~~~
shakna
You're right. My bad. Tripped up by privacy.resistFingerprinting forcing it to
`light`.

Edit: And I believe dark mode should now be working across all the major
browsers.

------
sangupta
Simple search (non-cached) like "hello world" or "i love you" throw a bad
gateway. It took 10+ seconds for "hello world" when it worked.

For my name "sandeep" it threw 0 results, however a fine line says 2870 URLs
in Index. I assume 2870 is not the entire corpus of the index - if yes, then
search is extremely slow.

~~~
shakna
That is the entire current corpus.

It sounds like you were hitting the site whilst it was under a bit of heavy
load. From my logs, about the time you posted here on HN, someone was tossing
`siege` at the site, and it is not a heavyweight server.

Running certain searches do seem to be able to trigger a pathological
response, however, so I'll need to look into that a bit more. Likely to do
with some of the nltk stuff it uses when it tries to handle searching
summaries.

~~~
sangupta
Will try it again tomorrow. I could not find any information on site as to
what will make this search engine different from what we already have?

Also, the corpus is now 3162, roughly 250 links in last 2 hours - which is way
too slow for a real world scenario. I though like the page for its simplicity
and the experiment.

~~~
shakna
I don't imagine that I, as a lone person, can actually build an engine to ever
compete with Google or Bing, etc.

It is not link crawling the entire web. Instead it is trying to find
information that is current. You'll find relevant and current information from
the BBC, CNN, and NPR for example. As well as stuff from places like LowTech
magazine, the CCC, and FSF, etc.

I haven't really put any information on the page at all about it. But mostly
it scratches my own itch. But I wouldn't see it becoming a competitor if
you're looking for "anything & everything", ever.

(Though that 2 hours is actually you watching the nginx cache expire. The
database updates hourly.)

~~~
sangupta
Got it. So it sources data from a whitelisted set of sites and updates every
hour. If the curated list can scratch my itch, I would love to come back.

You mention you used some sort of NLP (mention of nltk before) - is it for
summaries or reducing noise, or for bringing context to search terms.

~~~
shakna
Yeah, pretty much.

I'm using nltk for - generating most (not all) of the summaries, and for
building most of the tags that get attached to each article. It's also being
used when searching the text of a summary.

------
nickreese
Page doesn't appear to be usable for me. I get a white screen with yellow text
and lines in Firefox.

~~~
shakna
I only use Firefox. So it's probably not just that in itself.

It sounds like you're using something that adds enables a dark mode preference
to Firefox - and currently some of the Firefox addons and Chrome disagree
about some of the finer details of how that should process.

Try it without asking for dark mode.

~~~
shakna
I think this should now be resolved.

