What I'm thinking is a little far out but it came up recently on a project where I'm using Postgres FTS (it's slow but I guess I might as well link it for now[0] -- please do not HN hug it).
Basically, I read on the internet (and was surprised by) the fact that setweight can be used and combined with individual terms on tsvectors, and then those tsvectors can be combined and they keep their weightings.
Basically I'm making tsvectors out of chunks of the document, weighting them differently then recombining with other vectors without losing the weightings -- I'm thinking this could be applied to words identified by the corpus-level algos.
So my simplistic thinking here is that if you've done the corpus level processing, you could build an intermediate data structure and re-evaluate each search document with the appropriate weighting. It would likely be quite the lengthy stored procedure, but seems like setweight could support the usecase? Maybe I'm being a bit optimistic.
If you could figure that out, it would be an awesome plugin.
PS podcastsaver looks neat!
some quick feedback:
1) your "switch back to light mode" icon looks a LOT like a gear for a settings menu. I turned on dark mode, did a search, saw the "back to light mode" icon and thought "huh, the dark mode toggle is settings now? Weird choice, let's see what's there..."
2) the show notes seem truncated. It would be helpful for me to be able to search the show notes for a defined set of podcasts. Sometimes I remember that a podcast mentioned a product or service that I wanted to check out, but I can't remember the name of the product or the overall episode, and it's painful to find the right one by scrolling back through everything in my pod catcher.
Sorry I just got around to implementing some of your feedback and didn't realize that podcasting 2.0 was the Podcast Index -- That is the main data source!
Hm, I'm not totally following, but... would you have to recalculate all row values every time the corpus changes? I guess that could work for a seldom-changing corpus, not sure how popular a use case that is. I suspect most people would not be interested in such an approach, instead either making do without TF/IDF, or moving to a non-pg solution.
> Yep, I mean this is always the case for corpus-level algos right?
I am not sure which parts of which calculations lucene (Elastic Search and Solr) does on the fly vs pre-calculates after any change to corpus, because it's more or less transparent. I mean, I guess that's not entirely true -- there are definitely index-rebuilds that happen after updates, and for larger-scale things they can be resource-intensive enough that you have to account for them (for very small-scale things you can more or less ignore them), maybe it's just that Solr/ES have architectures built around accounting for that and giving you tools to deal with it with various approaches.
Basically, I read on the internet (and was surprised by) the fact that setweight can be used and combined with individual terms on tsvectors, and then those tsvectors can be combined and they keep their weightings.
Some code from that project to illustrate:
Basically I'm making tsvectors out of chunks of the document, weighting them differently then recombining with other vectors without losing the weightings -- I'm thinking this could be applied to words identified by the corpus-level algos.So my simplistic thinking here is that if you've done the corpus level processing, you could build an intermediate data structure and re-evaluate each search document with the appropriate weighting. It would likely be quite the lengthy stored procedure, but seems like setweight could support the usecase? Maybe I'm being a bit optimistic.
[0]: https://podcastsaver.com