
Show HN: Fist – Persistent Full Text Search and Indexing Server Written in C - max0563
https://github.com/f-prime/fist
======
necovek
This seems to be a bit of a stretch in calling it "full-text search" engine: I
only casually browsed the code from my phone so I might have missed it, but I
don't see any support for stemming, which is a critical piece of any FTS
engine (with support for many languages, which further complicates stuff).

This seems to be a simple word indexer instead, which is quite a different
proposition to FTS.

~~~
taneq
I'm not overly familiar with search as a field - why is stemming "critical to
FTS" rather than a nice enhancement? I would have thought the defining
criterion for "full text search" was that it searched the full text of the
database?

~~~
ComodoHacker
The key is "with support for many languages". For many languages (other than
English) with heavy use of suffixes and/or prefixes, FTS without stemming is
kind of useless.

~~~
ianai
I didn’t quite understand what you meant. I saw your comma and thought you
were developing another sentence and thought entirely. I understood you meant
FTS search needs stemming to be useful after skipping he comma. Just FYI.

~~~
ComodoHacker
Yes, you understood correctly. English is not my native, sorry.

~~~
ianai
I used to be a heavy comma user. Still am. But try being ‘radical’ and not
using a comma more often. (It’s not actually radical at all)

------
max0563
Hey all, Fist project creator here. Just wanted to thank everyone for the
support and the comments. I’ve been working on this project for a few months
now mainly as a passion project. It’s awesome to see some people taking an
interest.

I could definitely use some help on this. Whether it be giving advice or
writing actual code.

If anyone would like to chat more about this I have setup a slack channel for
be project. Here is the invite link

[https://join.slack.com/t/fist-
global/shared_invite/enQtNjcyN...](https://join.slack.com/t/fist-
global/shared_invite/enQtNjcyNzY4MTUwMDg0LTRiYzM5ZWNkOTMwODYzODRjNDQzNThiYjdhNjgzZDUxZGYxODRjOTI4NTcwYmYzYmI5MTViYjFiNGFlNWEwYjY)

Thank you again for all the support, I’m excited to continue development.

~~~
michelpp
As others have pointed out there are some features considered essential in
this space that would be good additions to the code:

Stemming: (removing prefix/suffixes walked, walking, walker -> walk) The
snowball library contains many language stemmers
[https://snowballstem.org](https://snowballstem.org)

Stopping: removing many common words like the/and that almost every document
will surely match anyway. Again, per language.

Relevance ranking: Generally some variant of TF/IDF like BM25. The book
"Managing Gigabytes" is an excellent intro to the subject and information
retrieval in general:
[https://people.eng.unimelb.edu.au/ammoffat/mg/](https://people.eng.unimelb.edu.au/ammoffat/mg/)

A Document/Term/Value model, this is how Lucence, Xapian, and other IR systems
model the store. It's worth sticking with that same pattern.

Good luck!

~~~
max0563
This is very helpful. I am working on all of this, but summarizing it all here
helped me a lot. Thank you.

------
karterk
Looks like the project is still under development but I think that having a
http based interface would make this much more accessible to be used easily
from multiple languages.

Disclosure: I work on a similar open source search engine
([https://github.com/typesense/typesense](https://github.com/typesense/typesense))

~~~
madhadron
This is an interesting assertion. Aside from JavaScript from the browser,
which is tied to HTTP, what language can't use a TCP interface but can use an
HTTP one?

~~~
karterk
I was merely highlighting the convenience of a HTTP based RESTish API. It will
be a familiar pattern instead of a custom text based protocol over tcp.

------
pas
[https://github.com/tantivy-search/tantivy](https://github.com/tantivy-
search/tantivy) \- Lucene in Rust (I have no idea about API or feature parity)

~~~
fulmicoton
Thanks for the publicity :)

------
busymom0
If the acronym is Fist, shouldn’t you change order of words to:

Fist - (F)ull - (i)ndex (s)erver for (t)ext

~~~
Lowkeyloki
I look forward to the day I can stop googling and start fisting.

~~~
max0563
This is one hell of a comment. Thank you, sir.

------
mitjam
For simple full text search I like to use SQLite fts3 and Xapian which is a
feature rich and mature full text search engine.
[https://xapian.org/](https://xapian.org/)

------
latenightcoding
Similar: [https://github.com/apache/attic-
lucy](https://github.com/apache/attic-lucy) (recently archived)

------
simonhamp
Surely “sift” would’ve been a better anogram?

“Search (and) index full text”

~~~
dewey
[https://sift-tool.org](https://sift-tool.org)

------
levidurfee
Would using an event handler make this faster?

[http://libevent.org/](http://libevent.org/)
[http://www.kegel.com/c10k.html](http://www.kegel.com/c10k.html)

------
lamchob
Very cool project. Which algorithm do you use for indexing/retrieval?

~~~
max0563
Thanks for this!

Right now it is not very complicated. An inverted index of the text that is
sent over is created and added to the database. The DB is just a big hash map.
Searching right now is just an O(1) lookup in this hash map of the text being
searched in this hash map.

I have plans to improve this of course. There is a lot of room for
improvement. I am also planning on adding scoring using TF-IDF, but it's not
done yet.

Any help with this would be more than appreciated!

------
actionowl
There is also Groonga:

[https://github.com/groonga/groonga](https://github.com/groonga/groonga)

But this looks like it intends to be a lot simpler.

------
lokl
I've been using [http://sphinxsearch.com](http://sphinxsearch.com) for many
years. Why should I consider Fist?

~~~
rurban
I rather keep using xapian. That's a proper indexing and search-engine (C++),
much better than sphinx or Elasticsearch. Elasticsearch just has better client
integration for things like MS Word or PDF, if you want to jump to the
highlighted line in the document.

~~~
lokl
I haven't looked at Xapian in many years and I can't remember the reason I
originally picked Sphinx instead of it. In what ways do you think it is "much
better"?

~~~
rurban
That was my conclusion when I implemented our company search engine.
Lucene/Sole was nice, had better frontends, but a horrible backend in Java
needing too much memory. Sphinx was small and fast, but lacked all the
important additional functionality I needed. Eg. customized ranking, aliases,
language detection and stemming, a good Google-like query parser. I probably
forgot most, this was 20 years ago.

------
sagichmal
It is irresponsible to the point of professional negligence to write new
software in C.

~~~
drocer88
Isn't C the fastest, most energy efficient and has a very small code
footprint?

see : [https://thenewstack.io/which-programming-languages-use-
the-l...](https://thenewstack.io/which-programming-languages-use-the-least-
electricity/)

You'd think that something like a text search would benefit from being as fast
as possible.

Aren't most of the other languages written in C because it is fast? (Except
for Rust, of course, which now appears to be a very reasonable alternative ).

~~~
sagichmal
C is fast, small, and energy efficient. It is also impossible to write safe,
exploit-free code in C by hand. The juice ain’t worth the squeeze.

------
VvR-Ox
Haha here you have the right title song for this thread:
[https://www.youtube.com/watch?v=4dXc4ilXOLA](https://www.youtube.com/watch?v=4dXc4ilXOLA)

* sorry stupid word-play

~~~
max0563
Hahaha, I love this! Got me fired up xD

------
singron
I find it hard to believe that this is "fast" as advertised when it isn't even
compiled with optimizations:
[https://github.com/f-prime/fist/blob/9049066e8d49ff41f2272ec...](https://github.com/f-prime/fist/blob/9049066e8d49ff41f2272ec74a78d46313bcd66e/fist/Makefile#L3)

~~~
max0563
Project Dev here, if you'd be willing to help me with that part I would be
greatful. When I talk about "fast" I mean all searches are O(1) due to the
magic of hash maps.

~~~
rurban
Fast FTS use reverse indices with trigrams, not primitive hash maps.

~~~
vram22
Any links to papers or docs about that? Interested.

------
qalmakka
Using plain makefiles with "gcc" hardcoded in is atrocious, at least use
$(CC). This also raises in me the suspicion that it hasn't been tested with no
compiler other than GCC at whatever version the developer had installed on his
machine, which is bad.

~~~
max0563
Project developer here: Would gladly accept some help with this if you'd be
willing to offer it.

~~~
asymptotically2
I just submitted a PR #4 which changes your Makefile to work a little better.
Hopefully it makes the parent commenter stop crying too :)

