
Toshi: An Elasticsearch competitor written in Rust - whoisnnamdi
https://github.com/toshi-search/Toshi
======
georgecalm
> What is a Toshi?

> Toshi is a three year old Shiba Inu. He is a very good boy and is the
> official mascot of this project. Toshi personally reviews all code before it
> is commited to this repository and is dedicated to only accepting the
> highest quality contributions from his human. He will though accept treats
> for easier code reviews.

------
arminiusreturns
Does anyone know of any other prod ready elasticsearch alternatives? I'm
working on a logging infrastructure project for shipping syslog, and it seems
no one these days just uses plain central syslog and ES is the standard, but
it seems bloated.

I've been tempted to just ship straight to a dB and skip all these crazy
shippers and parsers and all the other middle men in the equation.

Also, why has no product unified monitoring and logging? AFAIK that's why
Splunk is worth it is you have the budget (I dont)

~~~
noahl
Back when I worked at Google, the standard log processing tool was Dremel. You
could get exactly the same thing by shipping your logs to BigQuery.

I haven't checked, but I bet it's cheaper than ES for data that's mostly cold,
like logs. You will need a separate monitoring solution though.

~~~
mentat
If you want streaming inserts to BQ, that becomes the biggest cost. Dataflow
could be used to turn inserts into batch and gather interesting metrics that
you don't want to hit BQ for, but I don't think anyone's open sourced anything
in this space. I've implemented streaming inserts to BQ for logs "at scale"
and it was at least an order of magnitude cheaper than splunk still. Happy to
talk via email.

~~~
dominotw
what was the monitoring solution at the end, where were BQ results going to?

~~~
mentat
It was generic log aggregation used mostly for incident response and forensics
as well as some offline metrics. There were a bunch of metrics that were being
created (like 3 different ways) on box with parsing that we were looking at
moving into the log processing stream. We had a chat bot that people could use
to interact with common queries as well as standard SQL interaction via UI and
API auth'd by Google IAM.

------
nathcd
Neat! I hope this goes far, it'd be great to have a faster/lighterweight
Elastcsearch.

Something similar I'm really hoping to see is Tantivy in a Postgres extension,
so I can stop playing the game of trying to keep my search engine and database
in sync. Seeing pg-extend-rs ([https://github.com/bluejekyll/pg-extend-
rs](https://github.com/bluejekyll/pg-extend-rs)) on HN the other week got me
thinking about it again. Does anyone know whether this is feasible or if
anyone is working on something in this vein?

~~~
profquail
Out of curiosity —- have you looked at using Postgresql’s full text search
functionality to implement your search engine (e.g. [1])? If so, what do you
get out of the combination of Postgres + Elasticsearch that you chose it over
just the Postgres full text search?

[1] [http://rachbelaid.com/postgres-full-text-search-is-good-
enou...](http://rachbelaid.com/postgres-full-text-search-is-good-enough/)

~~~
Ralfp
Major problem with Postgres full-text search that those articles don't dwell
into too much is that unless your documents are in one of the "chosen
languages", you are more likely to find support for your language in search
engine (like ElasticSearch) than get it on PostgreSQL.

You can convert existing dictionaries available to format Postgres understand,
but this is annoying pain point if you happen to be an open source project
like CMS or communication platform.

~~~
hashhar
I don't get the hype about elasticsearch at all. Elasticsearch is more suited
to searching logs. It doesn't have powerful sort functions, doesn't allow you
to use multiple sort parameters etc.

Apache Solr is more suited to search. Lots of document filters, query filters,
the index itself is highly configurable and the ability to sort on multiple
parameters is great. LTR is also something too good to miss out on.

~~~
rpedela
Can you explain what you mean by "multiple sort parameters" because it looks
like you can to me in ES [1]. There is a well maintained LTR plugin for ES.
Honestly Solr and ES are more similar than different. There are a few things
Solr has which ES doesn't and the reverse is true too.

1\.
[https://www.elastic.co/guide/en/elasticsearch/reference/curr...](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-
request-sort.html)

~~~
hashhar
Thanks for letting me know about Elasticsearch LTR.

------
pornel
There are some real gems among Rust crates. I'm using tantivy[1]. It has been
super easy to set up and it's _faaast_.

[1]: [https://crates.rs/crates/tantivy](https://crates.rs/crates/tantivy)

~~~
coldtea
Toshi is built on top of tantivy if I'm not mistaken...

~~~
kevsim
Yep. Very first line of the README

------
latenightcoding
I love these type of projects. I am a big fan of Elasticsearch, but sometimes
it feels overly complex, bloated and memory-hungry. I hope someday a decent
Rust/C++ alternative will take over. I was following the development of Apache
Lucy ([https://lucy.apache.org/](https://lucy.apache.org/)) but the project
has been retired now.

~~~
bluejekyll
> I hope someday a decent Rust/C++ alternative will take over.

I found this part of your comment interesting, given that Rust is in someways
being offered as an alternative to C++ for similar use cases.

What would you like to see different in a language that you see as the common
issue with C++ and Rust?

edit: I misread the parent comment, the question should be disregarded.

~~~
jazoom
I'm pretty sure he/she was saying they want to see something like
Elasticsearch written in Rust/C++ rather than Java and made to be less bloated
and less complex.

I would also be very interested in that. Elasticsearch usually works very well
for me, especially the latest version, but it feels very heavy, and seems to
only be getting worse in that regard.

~~~
bluejekyll
Oh! I completely misread that. Thanks for clarifying.

------
CSDude
Tantivy does not allow schema evolution, but Lucene does, this is a major
blocker for dynamic indices

~~~
fulmicoton
So technically, it is supposed to allow you to add field... but it is not well
tested so I try to keep that a secret.

I should probably work on a proper scenario for schema evolution.

------
markhenderson
It would be great if this achieved API parity with ES. Being able to swap out
parts of the ELK stack would make tools like kibana even more powerful.

~~~
mdaniel
I hear you, and I know why you'd say that, but _wow_ the API surface-area of
ES is ginormous. Maybe the 80-20 rule goes a long way here, but I wouldn't
expect API parity to be a simple matter of exposing the same REST endpoints --
it's the payload that'll be the headache

I actually strongly considered that just with Solr, which has the extreme
benefit of using the same query language under the hood, but the more I
scratched the more I found it would be a horrific amount of work

~~~
jazoom
Plus the Elasticsearch API isn't especially nice to use. I haven't tried their
new SQL, since it requires an enterprise licence or something.

~~~
mdaniel
Do you mean encoding the lucene queries as JSON objects into the ES endpoints,
or do you mean the actual lucene syntax (as would be surfaced by kibana et
al)?

~~~
jazoom
I mean the Elasticsearch API. Kinda what you were referring to in the first
part of your sentence, but I don't know why you'd say it like that, especially
since the Elasticsearch API covers other things, such as mapping indexes and
other cluster administration.

------
badrabbit
Love the idea, saw another one in
2018([https://www.gravwell.io/](https://www.gravwell.io/))

At least in my experience the web interface is a huge gap. Kibana is ok but
Splunk and it's query language and visuals are much better. Anything that
competes with Splunk is great imo.

~~~
remasis
Hey, Corey from Gravwell here. Huuuge updates coming to the web interface this
month! You can see teaser screenshots on our website and a blog post coming
this week using the new UI.

Subscribe to the blog (we are non-spammy) to get the email announcement.

~~~
fulmicoton
Hi! -tantivy main dev here-. Do you have a engineering dev blog that disclose
information about your backend?

~~~
cthuen
We haven't published a lot on the backend architecture, but there's some info
in the docs. I think it would be interesting to chat. Can you hit me up at
info@gravwell.io?

------
amelius
Also check out the roadmap:

> [https://github.com/toshi-
> search/Toshi/blob/master/roadmap.md](https://github.com/toshi-
> search/Toshi/blob/master/roadmap.md)

It seems that the main added functionality of ES, namely clustering, will
still take a while to be implemented.

------
lettergram
Nice! I’m going to have to check it out.

At the same time, I’m curious on the performance differences with postgres.
I’ve been able to get very quick queries from Postgres:

[https://austingwalters.com/fast-full-text-search-in-
postgres...](https://austingwalters.com/fast-full-text-search-in-postgresql/)

The only time it’s slower to the point of using elasticsearch (for my
usecases) is something like log search (so far).

~~~
ilikehurdles
Well that’s a big use case that ES is meant for. There’s also the fact that ES
is easy to shard. At my previous gig we easily had over terabyte of data
written into ES indices every day, with hundreds of TBs worth of documents
searchable in various indices, and none of this is even counting the logging
use case. Used it extensively for calculating various aggregations (for
reporting/analytics) of events.

------
ivoras
Cool project!

But there's an attitude attached which amuses me:

> Toshi will always target stable Rust and will try our best to never make any
> use of unsafe Rust. While underlying libraries may make some use of unsafe,
> Toshi will make a concerted effort to vet these libraries in an effort to be
> completely free of unsafe Rust usage. The reason I chose this was because I
> felt that for this to actually become an attractive option for people to
> consider it would have to have be safe, stable and consistent. This was why
> stable Rust was chosen because of the guarantees and safety it provides.

It's an admirable goal, though the fact that it's stated prominently as one of
the first things on the front page gives off somewhat of a "doth protest too
much" vibe. It's not like "safety" is rare these days. Any project written in
Go, Python, Java, C#, Erlang, JS, and a myriad of others will be "safe" as far
as memory access is concerned, and in many cases this safety will be easier to
achieve than in Rust. As far as error handling safety, so far exceptions seem
to be more expressive, though the jury is out there.

Basically, if a project stays away from C and C++ and libraries written in
them, it's more likely it will be hit by a hardware problem than an inherent
language safety / security issue. Luckily, "safety" is for the largest part
the default for modern projects.

~~~
troutwine
There's an inside baseball discussion happening here with the larger Rust
community that I think, maybe, you're missing. You are absolutely correct in
saying that Go, Python et al offer memory safety in a way that low-level
languages do not. These languages use techniques that incure some kind of
runtime penalty. Erlang, for instance, copies memory like it's going out of
style. Now, with Rust you get a similar kind of memory safety but, somewhat
uniquely, without the same category of runtime penalty. There's still some,
sometimes, and that's where the community discussion around 'unsafe' happens
in Rust.

So, you're writing in Rust and you'd like to write code which has the absolute
most aggressive performance possible. In Rust parlance this _maybe_ means you
need to do unsafe things: turn off bounds checks, fiddle with the raw memory
of a structure, allow multiple threads to access the same memory without
synchronization, interact with mutable globals and so on. That's great, but,
you've potentially opened up your program to crashes or security issues. If
you're a solo author that's a trade-off you can make based on your needs.
Here's the rub: if you use similar techniques in a crate -- a shared library
for all to use -- then you've opted everyone using your crate into the same
trade-off, one which they might not have otherwise chosen for themselves. What
the Toshi project is saying here is that it's design preference is to avoid
opting into this trade-off, preserving all the guarantees that Rust can
provide at _possibly_ the expense of absolute performance.

There's a safety-focused subset of the Rust community that takes the presence
of an 'unsafe' in a body of Rust code very seriously and this project is
participating in that conversation.

~~~
bluejekyll
This is very well stated. I would clarify one thing in particular. I think,
hard to speak for everyone, most people have settled on the idea that high-
level libraries should avoid unsafe (like this one) and rely on libraries that
need to work around safety restrictions.

This allows the “unsafe” libraries to have fewer lines of code and more
isolated testing, with greater coverage.

~~~
nickpsecurity
You can also, if they're written in C, verify the absence of everything you'd
worry about using the automated and lightweight tooling available for C. Then,
fuzz it to be sure. Then, port it back to equivalent rust maybe with C2Rust.
Then, equivalence testing to make sure the two have same output. That's my
current recommendation for medium-assurance apps in Rust that have to use
unsafe code. Oh yeah, gotta turn overflow checks on in the safe Rust for best
effect but might have performance hit.

I'm still thinking about the rest of concurrency problems and side channels
that Rust's type system doesn't cover. Trying to find out what's as easy as
above. For concurrency, I'm eyeballing Eiffel's SCOOP, DTHREADS, and
eventually will study whatever Pony is doing.

~~~
bluejekyll
> You can also, if they're written in C, verify the absence of everything
> you'd worry about using the automated and lightweight tooling available for
> C. Then, fuzz it to be sure. Then, port it back to equivalent rust maybe
> with C2Rust.

Not sure I totally understand (or if I do, totally agree). For all FFI in
Rust, we have to drop down to unsafe interfaces. I like the model that's
generally happening in this area where there is an auto-generated sys crate
(with bindgen), then an FFI crate that does all the Rust <-> C interop. This
tends to work pretty well.

A lot of C/C++ (native library) validation tools just work with Rust
artifacts, so I don't personally see a lot of value for writing in C as well
as Rust, unless we're talking about a rewrite from C to Rust.

> still thinking about the rest of concurrency problems and side channels that
> Rust's type system doesn't cover.

What things are you thinking about beyond the Send/Sync traits? I've found
those to be very expressive, and appropriately restrictive.

~~~
nickpsecurity
"A lot of C/C++ (native library) validation tools just work with Rust
artifacts,"

Tools like RV-Match and Astree Analyzer can prove absence of entire classes of
errors using static analysis. Frama-C and SPARK Ada can do that with
annotations with high amount of proof automated. There's optional, runtime
checks for stuff not proven if you don't want to or can't do it by hand. C
also has lots of open-source tools for static/dynamic analysis, test
generation in many forms, and so on. In my Brute-Force Assurance concept, you
convert a program into C and/or Java to throw all the automated tooling you
can at them, fixing whatever real issues are found. Then, the last benefit is
C has a formally-verified compiler should someone want to eliminate compilers
as an error source. Rust-to-C is still valuable for all these reasons.

So, is the Rust tooling at the level that you can do all that without
converting it to C?

"What things are you thinking about beyond the Send/Sync traits? I've found
those to be very expressive, and appropriately restrictive. "

I don't use Rust yet. I just know what I learn from folks like you who do.
When I studied it, the docs said their type system blocked some but not all
concurrency problems. I don't know where it's at currently on various types of
races, deadlocks, and livelocks. Those are the main problems improved models
or analyzers should try to solve.

~~~
bluejekyll
KRust looks like the rv-match for Rust:
[https://news.ycombinator.com/item?id=16970050](https://news.ycombinator.com/item?id=16970050)

But comparing Rust to C is difficult. On the one hand they compile down to the
same thing, on the other when not using unsafe, the type system itself allows
you to express strong proofs, especially with state machines.

I don’t generally need to make use of these tools, I would point you to the
ring project, where they are very interested in formal proofs:
[https://github.com/briansmith/ring](https://github.com/briansmith/ring) they
might have some interesting options and a few of the people over there are
very capable in answering these questions.

> I don't know where it's at currently on various types of races, deadlocks,
> and livelocks.

For dataraces, there is a very strong story. In terms of deadlocks, I’m not
aware of anything here. For livelocks, I think it’s generally possible define
the state of a system such that you can make sure you aren’t in conflict with
other threads, so better, but not fundamentally different than other threaded
languages. In other words, if you define cross thread state in an appropriate
way, you can prove that you can’t get into a livelock situation.

~~~
nickpsecurity
"KRust looks like the rv-match for Rust"

I was in that comment section bringing up RV-Match. The difference is that RV-
Match is a bunch of static-analysis functionality built on a comprehensive
semantics for C. KRust is a tiny subset of Rust with RV-Match-style analysis.
I did bookmark it in case it could be useful for someone trying to build that.

"where they are very interested in formal proofs"

The only thing I mentioned that would be doing a formal proof was the
lightweight stuff like Frama-C and SPARK Ada that don't require proof so much
as annotation (eg like borrow checker) that run through automated provers. The
rest was all push-button or no-proof-needed tools that say something has no
errors, specific errors, or a mix of them and false positives. RV-Match and
Astree Analyzer will straight-up tell you that specific errors don't exist
with low, false positives. The test generators that work on structure of your
code get deep into lots of errors from different combinations of inputs and
control flow. None of these require proof. People building these usually test
them on FOSS projects, sometimes the same ones, often finding new errors in
them.

"In terms of deadlocks, I’m not aware of anything here. "

Good to know. That be the focus area for now since you said livelocks are a
good situation. I'll still keep an eye on the side for anything checking that.

"they might have some interesting options and a few of the people over there
are very capable in answering these questions."

Thank ya very much. I'll hit them up, too.

------
supernintendo
Cute!

~~~
fxfan
dang?

------
zozbot123
Yet another great piece of work by the Rewrite-It-In-Rust Task Force (a sister
organization to the Rust Evangelism Strike Force)!

