
Search at Slack - isabellat
https://slack.engineering/search-at-slack-431f8c80619e#.cqkhzbv5d
======
subpixel
Slack is great, but it's adoption by popular opensource communities is
problematic.

Why? Because opensource communities are on the free plan, which limits search
once you have 10k messages. I've had experiences where I wanted to revisit a
question I had asked in a Slack channel the previous week, and been unable to
find it.

As a result, everyone burns out faster b/c the same questions get asked and
answered over, and over.

Couple this with the fact that channels are not indexed by Google and you get
a black box where valuable Q&A content and discussion goes to die.

~~~
jakebasile
I agree. I am dismayed when I see open source projects using Slack in lieu of
IRC or a mailing list. It means I'd be forced to use their awful client (which
is slow, buggy, and far too resource intensive for a chat application) or use
their awful IRC integration. This is all in addition to the issue you raise of
Slack being a black hole beholden to a profit motivated entity.

Just use IRC. It's practically impossible to avoid Slack at any startup now,
but I'd love to be able to avoid it in FOSS.

~~~
scrollaway
I've been a die-hard defender of IRC and yet, I have been using it less and
less since I started using Discord.

I can finally have a single platform for communication. Voice chat, text chat,
group chats, friends list, async communication, unlimited logs (no 10k max msg
nonsense), webhooks/integrations that let me do far more than IRC bots ever
did. All of it under one account. Oh and the client doesn't suck, unlike
Slack's. It's fast. The voice quality is superb.

As far as productivity goes, I get far more done with it than I ever did with
IRC. The addition of being able to hop on voice very quickly is insanely good.
Screensharing and video chat coming this year as well, I'm pretty excited.

It's to the point that I bought Discord Nitro (their premium offering) the day
it was released, for no other reason than to give them money.

I hope the question of protocol openness gets resolved; until then, IRC just
doesn't cut it for me anymore. IRCCloud.com helps, but their interface is
super slow with lots of channels and IRC itself simply has no support for the
thousands of improvements that have been made in communications the past
30-something years.

~~~
subpixel
Discord looks amazing but it's marketing materials says it's aimed at gamers
and compares the product to Skype, not Slack. I actually joined Reactiflux on
Discord today and was pretty confused as to what I had gotten myself into and
why the messages where being read aloud in a computer voice.

~~~
scrollaway
Hehe yeah they sure are playing the "targeting gamers" card really well. I
think it's kind of obvious they're shooting for far more than that. They're
competing with Slack without competing with Slack, it's very clever.

Short of video calls though, Discord is essentially a drop-in replacement to
Slack. We've been using it at my company, it works so damn well. I moved to it
for our open source community as well. I use Matterbridge for a three-way
mirror between IRC and Gitter as well:
[https://github.com/42wim/matterbridge/](https://github.com/42wim/matterbridge/)

~~~
mahyarm
I do wonder when they will make a business edition.

------
ankit-singh
Looks like solr is being used to rank messages using a few features and then
re-ranking the top n at application layer using another set of features. This
would constrain the search quality as 1) you have no control as to how these
two set of features interact and 2) messages ranked low based on first set of
features could be highly relevant according to the second set. Another
possible approach could have been using a custom scorer to influence scoring
at lucene level, thereby combining all the features at a single point. Was
this approach evaluated? If so, any insights as to what could be a limitation?

~~~
isabellat
Great question. We rank in two stages for a number of different reasons.
First, it would be too expensive from a performance perspective to rank all of
the messages in your corpus. Second, some of the features that we use to rank
are much more easily accessed at the application layer. It would require more
of an engineering effort to make these signals accessible in SOLR. The first
pass which is done in SOLR is a high recall, low precision pass. The second
pass through our custom ranker is a high precision pass. It is possible that
we would lose some messages that might end up being important in the first
pass but it's a tradeoff between performance and accuracy. Hope this helps
answer your question.

------
dgreensp
The subject of how bad Slack's search is comes up all the time in talking to
friends and co-workers, and I wonder if the described ranking changes are
enough.

Usually when I'm searching, I'm looking for a particular message, possibly
even one I read earlier that day, and I may know a few things about it, like
who sent it and that it had an important link, but I still can't necessarily
find it! The results are also presented in a giant cartoony way that makes me
page through many pages. Tokenizing my search into "keywords" means that even
if I know a substring of what I'm looking for, it doesn't come up as relevant,
or the tokenizer tokenized the text differently. This is also why GitHub
search can't find a lot of things.

What I would want in a search experience is the equivalent of Control-F over
the list of messages I've actually seen.

~~~
jliszka
Stay tuned! We're working on improving phrase matching as we speak.

As for the Control-F thing... stay tuned on that too :)

------
leothekim
This sounds like a step towards making Slack more of a knowledge repository
and possibly a wiki replacement. One of my (many) qualms with knowledge
repository tools like Google Sites is that searching them is basically
useless. Another is that knowledge in these repos becomes stale really
quickly. If you can put meaningful information in a Slack post, you can take
advantage of the recency-focused nature of chat and smarter searching
algorithms like the one described here, and essentially make an internal
Google for your organization. Kudos to the Slack team, very interested to see
how this evolves.

~~~
trafficlight
I've been searching for a solution to this as well. Our Slack holds an amazing
amount of information, but it's really difficult to curate that information
efficiently.

It'd be cool to highlight a piece of information and insert it into a wiki-
style site.

~~~
andygcook
We're working on this at Tettra.co. Building a wiki that works in tandem with
Slack. Would love for you to check it out [http://tettra.co](http://tettra.co)

Feel free to email me with any uestions too - andy@tettra.co

~~~
trafficlight
Very interesting. We'll definitely check it out.

------
joe_fro
This is a very informative article. If you're interested in getting started
with search relevancy I would also suggest the book:
[https://www.manning.com/books/relevant-
search](https://www.manning.com/books/relevant-search)

Which was very helpful to me.

~~~
donretag
I enjoy reading Doug Turnbull's blog and most of his writing, but I found this
book tedious. I purchased it despite reading that terrible first chapter which
is available as a free sample. Perhaps it could be that I am already too well-
versed in the subject.

Any book where you learn at least one thing new is always worth it, so I do
not regret having this book in my library.

~~~
softwaredoug
Thanks for the kind words and the useful critique. I found it tricky to strike
a balance between new budding search devs and people more advanced. I may
write a more advanced book at some point.

~~~
donretag
And thank you for considering my reply as "kind words". :) I would not have
been so harsh if I knew you were going to read the reply! That first chapter
though. So many points repeated.

For those that want to read more: [http://opensourceconnections.com/about-
us/doug-turnbull/](http://opensourceconnections.com/about-us/doug-turnbull/)
or really anything at
[http://opensourceconnections.com/blog/](http://opensourceconnections.com/blog/)

~~~
softwaredoug
Haha. That chapter was revised a lot and ultimately shortened a ton in the
final version. I agree it had a lot of repetition. Hopefully it's better in
the final printing.

~~~
donretag
I read the advance version of the first chapter and never the one in the
actual book when it finally was released. Will see if I have that original
version and do a comparison.

I tried to see your talk at last year's Elaticon, but it was packed due to the
small room. Not surprising since Elasticsearch tends to minimalize search in
favor of analytics/logging. So few talks regarding pure search.

------
amelius
Open-source desperately needs more search-tool projects.

Lucene/Solr/Elasticsearch are nice, but they need competition, especially
outside the Java world.

~~~
scaryclam
I'm not really sure what you're trying to say in your comment. Sure, there's a
lot of Java going on there, but does that really matter? They're tools and you
can interact with them from the language of your choice.

I get the competition part, but none of the above are exactly stagnant, so I'm
wondering what you'd like to see more competition achieve.

Not trying to be difficult, just curious in case I missed something from your
comment :)

~~~
amelius
> so I'm wondering what you'd like to see more competition achieve

There's a lot that competition can help improve, for example in the areas of
performance, robustness, and also in functionality, e.g. better NLP for better
understanding of queries and translating them into results, image/audio
search, etc. And competition can also come up with surprising new features
that we can't even think of right now.

This, plus it is (imho) quite weird that we have only one source of code for
one of the basic branches of Computer Science.

------
dilap
I thought this was going to be an announcement about how they finally fixed
search...

A suggestion: When I search, what I want 99% of the time to happen is that the
current window I'm looking at _quickly_ gets filtered to my search query.
Ranking doesn't matter, just show exact matches ranked by time.

1% of the time, I want something else.

~~~
jliszka
Stay tuned :)

------
isabellat
Hi, I'm one of the authors on the post, happy to answer any questions.

~~~
throwthisawayt
This sounds like a really interesting problem to work on. Are you hiring
anyone for this team (esp those new to search)?

~~~
donretag
My experience as an experienced search developer is that there are very few
search specific positions, if any, outside of the Bay Area. My current job is
working remotely for a SF based company, where I do everything but search. :(

~~~
chimeracoder
Slack's search team is based in NYC: [https://medium.com/@noah_weiss/starting-
up-slack-s-search-le...](https://medium.com/@noah_weiss/starting-up-slack-s-
search-learning-intelligence-group-in-the-new-nyc-office-
af6523090789#.ghwbqvzgb)

------
danpalmer
Today I learned Slack has a "relevant" option for search terms. Maybe I should
try it again - I had stopped using search entirely because of the results
being fairly irrelevant.

~~~
samcrawford
My experience is similar. Slack is great, but the whole search experience
remains terrible for me. Results are largely irrelevant, jumping back and
forth into conversations takes a long time, the UI sidebar has very little
space, and if I'm looking for a file it's hard to remember if it was sent as a
URL or as a file in slack. Perhaps I'm spoilt by Gmail search, but this is the
one area of slack that I think is sorely lacking.

------
rajhans
Great work! A few questions for the author(s): In the article, you have listed
9 feature extractors/templates. In the final model, what's the total number
(or rough magnitude) of features? How much data (or ballpark estimate) did you
train this on? Did you try to deal with potential difference in distribution
across your data sampling sources?

------
kusmi
I've never used Slack because last I remember they don't allow local
installation, therefore no access to any documents that were dropped in, is
this still the case? I have been using mattermost instead, and wrote a bot
which processes all documents uploaded into mattermost, extracts metadata,
creates tags and summaries for each, archives the documents in an ECM
-categorizing them by their tags, and opens them up for full text solr search.
My understanding is that a custom solution like this wouldn't be possible with
Slack, or would otherwise require more hacky solutions?

~~~
paulcole
> they don't allow local installation

This is like saying Ford sells cars they don't allow you to fly.

If you want something that flies, buy a plane.

------
bearcobra
I'd love it if they made it easier to search just the channel or conversation
you have in focus. Having the filter autocomplete based on what channel your
on seems like a good middle ground.

~~~
isabellat
If you use the shortcut 'Command-F' it will search the current channel or
conversation by auto-populating the search bar with 'in:channel_name'. Hope
that's what you are looking for! Here's a list of all shortcuts:
[https://get.slack.help/hc/en-us/articles/201374536-Slack-
key...](https://get.slack.help/hc/en-us/articles/201374536-Slack-keyboard-
shortcuts).

~~~
bearcobra
Thank you! This is what I get for navigating with my mouse.

------
softwaredoug
Neat. We've been building a Learning to Rank plugin for Elasticsearch.
Feedback and contributions very welcome

[https://github.com/o19s/elasticsearch-learning-to-
rank](https://github.com/o19s/elasticsearch-learning-to-rank)

~~~
lacksconfidence
I've also been building a learning to rank plugin for elasticsearch, we might
want to sync up.

I hadn't set it up to sync to github as it was just internal development, but
i've started the sync and it will show up at
[https://github.com/wikimedia/search-ltr](https://github.com/wikimedia/search-
ltr) soon.

It's got a bit more of the integration with elasticsearch put together,
including storing models in cluster state and a rest interface for managing
them. It's a bit more of a direct port of the solr plugin rather than a
rewrite from the ground up so there are also some oddities that don't yet make
sense. Refactors will certainly be done. It's also tied a little less directly
to RankLib, such that i can convert and load in MART models trained by
lightgbm or xgboost which have done pretty well in my offline tests and are
able to utilize resources on my training machine much more efficiently than
ranklib's LambdaMART (although in terms of results, the ranklib implementation
is pretty good).

~~~
softwaredoug
Neat! Want to email me at dturnbull AT o19s.com

We store models as a custom scripting language which takes care of
distributing the model around the cluster, caching and basic CRUD operations.
This was the hard thing to figure out, at first we looked at a REST plugin but
it seemed cumbersome and hard to integrate with the query DSL. But I'm curious
how you guys got around those pain points:)

------
jtoberon
Question for the author: how do you actually deploy your model? Do you have a
dependency on Spark in your production system?

~~~
lacksconfidence
Being that this is a SVM, which is typically evaluated as a simple linear sum
of weights, I imagine they reimplemented that in the application layer. Would
be curious how they handled the normalization steps (reimplement that as
well?)

~~~
jliszka
Yep. We normalize our features as part of training, and the stdevs of each
feature are part of the resulting model, along with the weights. (The means
are always 0 because of the way we construct our training set.) The weights we
use in production are actually normalized_weight / stdev.

------
notforgot
Hey, Slackers, you can't find what isn't there.

Most of the time I search for information I need is because I don't know
anything about that part of the software. I never found this kind of
information in Slack.

Parse the company docs, or our rep, and now we're talking.

