
Show HN: CX_DB8 A word-level extractive summarizer powered by text embeddings - Der_Einzige
https://github.com/Hellisotherpeople/CX_DB8
======
Der_Einzige
Reposting because even though I got 0 upvotes on the last post, github
insights shows that a TON of people saw this via HN - and I got an additional
10 github stars after I posted this. Ugh Lurkers

Hi all, I wrote this in my spare time, originally aimed at the American
Competitive Policy Debate community (though it's applicable to anyone) - some
notes.

1\. No one seems to have made anything like this (queryable, word level
extractive summarization) , and I was pissed because when I was doing
competitive debate I really wanted this tool (I spent days doing this stupud
underlining and highlighting of evidence that I'd like back).

2\. This is also equivalent to a "Semantic Search" engine. I've seriously
thought of trying to write a word2vec powered ctrl-F replacement for a web-
browser. I'm convinced that CX_DB8 is proof that this could be neat.

3\. While my idea of making a tool that can make any document "say what I want
it to say" was cool, the reality is that it's kinda hard to make it look like
a document says the opposite of what it naturally says. To a word vector
model, "X is bad" and "X is good" are usually very similar, meaning that the
summaries given for each query are very similar.

4\. Is there anything I could improve? I feel like there's gotta be something
better than Cosine Similarity and mean pooling for the creation of document
vectors out of word vectors. I saw something implying that TF-IDF weighting
doesn't help - which sucks - so I'm skeptical.

Thank you to all the people who contributed the packages that makes CX_DB8
possible

