Hacker News new | past | comments | ask | show | jobs | submit login

What I'm thinking is a little far out but it came up recently on a project where I'm using Postgres FTS (it's slow but I guess I might as well link it for now[0] -- please do not HN hug it).

Basically, I read on the internet (and was surprised by) the fact that setweight can be used and combined with individual terms on tsvectors, and then those tsvectors can be combined and they keep their weightings.

Some code from that project to illustrate:

    UPDATE podcasts
    SET fts_doc = setweight(to_tsvector(COALESCE(title, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(homepage_url, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(podcast_idx_itunes_author, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(podcast_idx_itunes_ownername, ' ')), 'A')
                  || setweight(to_tsvector(COALESCE(podcast_idx_host, ' ')), 'A')
                  || setweight(to_tsvector(array_to_string(categories, ' ')), 'B')
                  || setweight(to_tsvector(COALESCE(description_html, ' ')), 'D')

Basically I'm making tsvectors out of chunks of the document, weighting them differently then recombining with other vectors without losing the weightings -- I'm thinking this could be applied to words identified by the corpus-level algos.

So my simplistic thinking here is that if you've done the corpus level processing, you could build an intermediate data structure and re-evaluate each search document with the appropriate weighting. It would likely be quite the lengthy stored procedure, but seems like setweight could support the usecase? Maybe I'm being a bit optimistic.

[0]: https://podcastsaver.com




If you could figure that out, it would be an awesome plugin.

PS podcastsaver looks neat!

some quick feedback:

1) your "switch back to light mode" icon looks a LOT like a gear for a settings menu. I turned on dark mode, did a search, saw the "back to light mode" icon and thought "huh, the dark mode toggle is settings now? Weird choice, let's see what's there..."

2) the show notes seem truncated. It would be helpful for me to be able to search the show notes for a defined set of podcasts. Sometimes I remember that a podcast mentioned a product or service that I wanted to check out, but I can't remember the name of the product or the overall episode, and it's painful to find the right one by scrolling back through everything in my pod catcher.

3) are you tracking Podcasts 2.0? Some interesting additional stuff to index there. https://origin.fm/blog/podcasting-2point0/


Sorry I just got around to implementing some of your feedback and didn't realize that podcasting 2.0 was the Podcast Index -- That is the main data source!


Thanks for the detailed feedback!

On (1) I can definitely see that — will fix!

(2) yeah I need to go to the source for that, I think podcast index data might have been why? I’m going to double check.

(3) no I’m not! Thank you for the pointer!

I’m going to work on all of this (and tackle the speed issue)


Hm, I'm not totally following, but... would you have to recalculate all row values every time the corpus changes? I guess that could work for a seldom-changing corpus, not sure how popular a use case that is. I suspect most people would not be interested in such an approach, instead either making do without TF/IDF, or moving to a non-pg solution.


> would you have to recalculate all row values every time the corpus changes

Yep, I mean this is always the case for corpus-level algos right?

No reason you can’t do it iteratively —- postgres has triggers…

Oh but actually thinking about it, it could be a function! You’d just need access to that intermediate representation.

> I suspect most people would not be interested in such an approach, instead either making do without TF/IDF, or moving to a non-pg solution.

Well people would be happy if it was there at all, I think. Then they could at least make the choice or have a decent option.

It probably won’t be as performant as other solutions which can make more drastic architecture changes but… might still be worth having


> Yep, I mean this is always the case for corpus-level algos right?

I am not sure which parts of which calculations lucene (Elastic Search and Solr) does on the fly vs pre-calculates after any change to corpus, because it's more or less transparent. I mean, I guess that's not entirely true -- there are definitely index-rebuilds that happen after updates, and for larger-scale things they can be resource-intensive enough that you have to account for them (for very small-scale things you can more or less ignore them), maybe it's just that Solr/ES have architectures built around accounting for that and giving you tools to deal with it with various approaches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: