Hacker News new | past | comments | ask | show | jobs | submit login
Search at Slack (slack.engineering)
197 points by isabellat on Feb 8, 2017 | hide | past | favorite | 91 comments



Slack is great, but it's adoption by popular opensource communities is problematic.

Why? Because opensource communities are on the free plan, which limits search once you have 10k messages. I've had experiences where I wanted to revisit a question I had asked in a Slack channel the previous week, and been unable to find it.

As a result, everyone burns out faster b/c the same questions get asked and answered over, and over.

Couple this with the fact that channels are not indexed by Google and you get a black box where valuable Q&A content and discussion goes to die.


I agree. I am dismayed when I see open source projects using Slack in lieu of IRC or a mailing list. It means I'd be forced to use their awful client (which is slow, buggy, and far too resource intensive for a chat application) or use their awful IRC integration. This is all in addition to the issue you raise of Slack being a black hole beholden to a profit motivated entity.

Just use IRC. It's practically impossible to avoid Slack at any startup now, but I'd love to be able to avoid it in FOSS.


I've been a die-hard defender of IRC and yet, I have been using it less and less since I started using Discord.

I can finally have a single platform for communication. Voice chat, text chat, group chats, friends list, async communication, unlimited logs (no 10k max msg nonsense), webhooks/integrations that let me do far more than IRC bots ever did. All of it under one account. Oh and the client doesn't suck, unlike Slack's. It's fast. The voice quality is superb.

As far as productivity goes, I get far more done with it than I ever did with IRC. The addition of being able to hop on voice very quickly is insanely good. Screensharing and video chat coming this year as well, I'm pretty excited.

It's to the point that I bought Discord Nitro (their premium offering) the day it was released, for no other reason than to give them money.

I hope the question of protocol openness gets resolved; until then, IRC just doesn't cut it for me anymore. IRCCloud.com helps, but their interface is super slow with lots of channels and IRC itself simply has no support for the thousands of improvements that have been made in communications the past 30-something years.


Discord looks amazing but it's marketing materials says it's aimed at gamers and compares the product to Skype, not Slack. I actually joined Reactiflux on Discord today and was pretty confused as to what I had gotten myself into and why the messages where being read aloud in a computer voice.


Hehe yeah they sure are playing the "targeting gamers" card really well. I think it's kind of obvious they're shooting for far more than that. They're competing with Slack without competing with Slack, it's very clever.

Short of video calls though, Discord is essentially a drop-in replacement to Slack. We've been using it at my company, it works so damn well. I moved to it for our open source community as well. I use Matterbridge for a three-way mirror between IRC and Gitter as well: https://github.com/42wim/matterbridge/


I do wonder when they will make a business edition.


They're marketed towards casual users but Discord is probably the best chat and collaboration software I've used so far. I definitely prefer it to Slack for professional-type stuff.

You can use Discord any want you want. If you join some popular Discord server, odds are it'll be full of spam and Internet humor, but obviously you can do whatever like on your own server(s).


I like Discord a lot, it's far and away better than TeamSpeak and Vent. It's also better than Slack for private group conversations. But it is still a closed system owned by a for profit company.


I agree with your concerns of open source projects using slack, but:

> I'd be forced to use their awful client

Who forces you ? You can use slack on the web can't you ? You don't need to have yet another browser engine running on you computer.

> Just use IRC.

Please don't … IRC is the opposite of user friendly: it has no good web interface so the casual user won't come in because he doesn't want to install and learn a new software (IRC client). But slack isn't the only option here, it's not even the best open by far, Gitter[1], Mattermost[2] and Discord[3] are alternatives to IRC which aren't Slack.

[1]: https://gitter.im/ [2]: https://about.mattermost.com/, they don't provide chat hosting but several organisation do host mattermost servers (Framasoft for instance https://framateam.org/) [3]: https://discordapp.com/ targeted at gamers, which is a good sign of quality, but fine for general purpose use.


> Who forces you ? You can use slack on the web can't you

Slack's web client is still slow. Their "native" client is just a Chrome wrapper over their web app, with some glue code to hook up notifications.

> it has no good web interface so the casual user won't come in

https://webchat.freenode.net works fine. If you need or want help with a project and can't be asked to spend a few moments opening up an IRC client (native or web), then I don't know what to tell you.

> Gitter, Mattermost, and Discord

Gitter and Discord are both closed source, proprietary systems made by companies who want to make money. Open source projects shouldn't rely on them if possible.

Mattermost is OK, but requires running your own server. IRC is free, and there are public servers specifically made for open source projects.


The consensus on our team (40% macos/60% linux) that the Slack client is too slow. Usability around search can be improved as well. To the extent that we've started looking at alternatives.


We use Windows and have found that version of the client. to hang sporadically at least once a day. Pretty frustrating, i've just given up on it and solely use the web app


> Usability around search can be improved as well.

Curious if this statement is about the most recent experience post improvements mentioned in tfa?


The client might be a legitimate issue, but being an information black-hole is no different than IRC. If you wanted a history, you should have made your client or a bot log that history.


If IRC was as simple and elegant and extensible as some would suggest, there wouldn't be a posse of persistent chat companies worth tens or hundreds of millions of dollars.


IRC is simple enough for developers of and those using open source projects. Slack and their ilk have more of a place handling internal communications where non-technical people need to be involved.


Or Discord, despite it being intended for something else.


Have you considered mattermost ?


Please also consider Rocket.Chat, you can test our SaaS version at https://rocket.chat/deploy for free.


That's an easy problem to fix http://slackarchive.io/


Yup. We have a 2800-person community Slack for an open source project. Slack Archive set us up for free, and it works well.

But we still direct people to Stack Overflow since the Q&A is more discoverable there.


Why are people using a chat app like a wiki to begin with


Because codifying, organizing, and keeping up-to-date general knowledge on a wiki is hard work, whereas banging out a quick answer to a specific question on slack is comparatively easy.

Which maybe should be a challenge to anybody looking to build the next generation of knowledge repositories.


Because a bad Wiki people use is better than a good Wiki people don't.


Why aren't they? One of the great strengths of the internet is that information doesn't usually go away; most of the time, you can refer back to something said 10 years ago with little extra effort. Slack breaks that.


Honestly, I'd expect these "public" Slack groups of FOSS projects to have search-engine-accessible archives, just like Usenet and mailing lists do through Google Groups. Why do I need to know what Slack even is, to be able to find+read the answer someone gave someone else to a question over Slack?


Isn't that the norm for chat apps though? It's not like they're breaking anything... people are just expecting something that doesn't really exist?


Every chat app I've ever used going back to ICQ keeps permanent logs, except for Slack. (Technically Slack does too, it just keeps them on its own server and won't let you look unless you pay.)


Actually no: you can look all you like without paying, it's just kind of clumsy. If you "export your data from Slack" (yourteam.slack.com/services/export), you get the whole history, even on the free plan.

I've repeatedly considered writing a bot that would—on a regular schedule—poke this page with a headless browser to generate a dump, download said dump, and ingest it into ElasticSearch (which I'd then expose through a web search, or maybe just spit out batched archive pages into a static-site S3 bucket and let Google index them.) Such a bot would be a good companion to https://github.com/rauchg/slackin for FOSS teams.

But I haven't done any of that yet, because I get the feeling that putting enough attention on this little "feature" would get it quickly locked down.


Dunno, I have 10 year old irc chat logs sitting on my computer.


It's also problematic in that it chisels away at what it means for a project to be open source. The focus seems to be purely on the software licence, and increasingly less about the wider project tooling.

GitHub and Slack provide a huge amount of utility. But they also feel hollow to me. It feels harder and harder to opt-out of using closed tools.


What is Matrix / Riot missing today to get these open source communities to start using an open source federated chat room protocol?


The 2 links for UI/UX comparison ... https://support.discordapp.com/hc/en-us/articles/11500046858... https://get.slack.help/hc/en-us/articles/202528808-Searching... ... I think on Discord it is based on the context of what the secondary thing is you are doing outside of chat (gaming - it will show you what game the others are playing - or any app based on file process - which is pretty slick) and on Slack it will be primarily thinking you are at work so it's all about quicker access to files/presentations/etc.


IRC isn't that hard. Just register a channel on freenode (which exists for the purpose of facilitating open source!), there's already a web interface [1]. You can stick a nice link on your project's webpage. It's even less work for users than Slack, you don't have to register, just punch in a nickname.

Someone in your project can manage to setup a logbot that dumps logs onto a webserver, which will be indexed by google. I suspect there are services that will do it for you, so you might not even have to setup the bot yourself. If there isn't one I'd have half a mind to build one, if it gets more projects using IRC.

[1] https://webchat.freenode.net/


I have experienced this as well with the Chef Community Slack channel. It is a wonderful resource and it's super convenient to have easy access from my phone, etc but there is so much useful information in there that won't be accessible by others in the future.


Looks like solr is being used to rank messages using a few features and then re-ranking the top n at application layer using another set of features. This would constrain the search quality as 1) you have no control as to how these two set of features interact and 2) messages ranked low based on first set of features could be highly relevant according to the second set. Another possible approach could have been using a custom scorer to influence scoring at lucene level, thereby combining all the features at a single point. Was this approach evaluated? If so, any insights as to what could be a limitation?


Great question. We rank in two stages for a number of different reasons. First, it would be too expensive from a performance perspective to rank all of the messages in your corpus. Second, some of the features that we use to rank are much more easily accessed at the application layer. It would require more of an engineering effort to make these signals accessible in SOLR. The first pass which is done in SOLR is a high recall, low precision pass. The second pass through our custom ranker is a high precision pass. It is possible that we would lose some messages that might end up being important in the first pass but it's a tradeoff between performance and accuracy. Hope this helps answer your question.


Besides the technical answer already given, this is a pretty standard architecture for search ranking and other problems where your fine grained decisions are too expensive to run on everything.


The subject of how bad Slack's search is comes up all the time in talking to friends and co-workers, and I wonder if the described ranking changes are enough.

Usually when I'm searching, I'm looking for a particular message, possibly even one I read earlier that day, and I may know a few things about it, like who sent it and that it had an important link, but I still can't necessarily find it! The results are also presented in a giant cartoony way that makes me page through many pages. Tokenizing my search into "keywords" means that even if I know a substring of what I'm looking for, it doesn't come up as relevant, or the tokenizer tokenized the text differently. This is also why GitHub search can't find a lot of things.

What I would want in a search experience is the equivalent of Control-F over the list of messages I've actually seen.


Stay tuned! We're working on improving phrase matching as we speak.

As for the Control-F thing... stay tuned on that too :)


This sounds like a step towards making Slack more of a knowledge repository and possibly a wiki replacement. One of my (many) qualms with knowledge repository tools like Google Sites is that searching them is basically useless. Another is that knowledge in these repos becomes stale really quickly. If you can put meaningful information in a Slack post, you can take advantage of the recency-focused nature of chat and smarter searching algorithms like the one described here, and essentially make an internal Google for your organization. Kudos to the Slack team, very interested to see how this evolves.


I've been searching for a solution to this as well. Our Slack holds an amazing amount of information, but it's really difficult to curate that information efficiently.

It'd be cool to highlight a piece of information and insert it into a wiki-style site.


We're working on this at Tettra.co. Building a wiki that works in tandem with Slack. Would love for you to check it out http://tettra.co

Feel free to email me with any uestions too - andy@tettra.co


Very interesting. We'll definitely check it out.


This is a very informative article. If you're interested in getting started with search relevancy I would also suggest the book: https://www.manning.com/books/relevant-search

Which was very helpful to me.


I second this recommendation, I used Relevant Search to build https://www.findlectures.com


Thanks! (This is Doug Turnbull :-p).


I enjoy reading Doug Turnbull's blog and most of his writing, but I found this book tedious. I purchased it despite reading that terrible first chapter which is available as a free sample. Perhaps it could be that I am already too well-versed in the subject.

Any book where you learn at least one thing new is always worth it, so I do not regret having this book in my library.


Thanks for the kind words and the useful critique. I found it tricky to strike a balance between new budding search devs and people more advanced. I may write a more advanced book at some point.


And thank you for considering my reply as "kind words". :) I would not have been so harsh if I knew you were going to read the reply! That first chapter though. So many points repeated.

For those that want to read more: http://opensourceconnections.com/about-us/doug-turnbull/ or really anything at http://opensourceconnections.com/blog/


Haha. That chapter was revised a lot and ultimately shortened a ton in the final version. I agree it had a lot of repetition. Hopefully it's better in the final printing.


I read the advance version of the first chapter and never the one in the actual book when it finally was released. Will see if I have that original version and do a comparison.

I tried to see your talk at last year's Elaticon, but it was packed due to the small room. Not surprising since Elasticsearch tends to minimalize search in favor of analytics/logging. So few talks regarding pure search.


Thank you for the book suggestion. Happy you found the article informative!


Open-source desperately needs more search-tool projects.

Lucene/Solr/Elasticsearch are nice, but they need competition, especially outside the Java world.


Shameless plug: I'm working on RediSearch, an open-source, in-memory Redis module written in C that does search. Not nearly full-featured as Lucene and friends, of course, but it's a very young project. http://redisearch.io


I'm not really sure what you're trying to say in your comment. Sure, there's a lot of Java going on there, but does that really matter? They're tools and you can interact with them from the language of your choice.

I get the competition part, but none of the above are exactly stagnant, so I'm wondering what you'd like to see more competition achieve.

Not trying to be difficult, just curious in case I missed something from your comment :)


> so I'm wondering what you'd like to see more competition achieve

There's a lot that competition can help improve, for example in the areas of performance, robustness, and also in functionality, e.g. better NLP for better understanding of queries and translating them into results, image/audio search, etc. And competition can also come up with surprising new features that we can't even think of right now.

This, plus it is (imho) quite weird that we have only one source of code for one of the basic branches of Computer Science.


Take a look at http://bitfunnel.org/, an open-source version of some of the algorithms from Bing. It's not ready for production use but you may enjoy following along in the implementation of a large search engine. The design notes and blog are also worth reading.


IMO this is like saying Hadoop or Spark or Kafka needs more competition outside the Java (JVM) world. Elasticsearch, at least, has a dead simple web API which makes it accessible from any platform, along with a slew of clients and integrations with other tools/platforms.


This. ElasticSearch is one of only two pieces of Java (the other one being Flyway) that's so good we use it anyway. Not too bad anymore with Docker, it's a REST API after all.

In the end, ES really proved to be the least bad search server there's out there. The real crux isn't search, it's language. And as Lucene is made by technical linguists (a really rather special bunch) and Java is still universities' darling, it's unlikely their effort can be redone in a non-JVM language anytime soon.


Have you looked at OkLog?


I've used Sphinx (http://sphinxsearch.com) in a previous job. If it fits your needs then it's pretty nice.


Why do you care that those black boxes are implemented in Java?


There's two main reasons I can see: performance is impacted by the implementation language and because officially supported SDKs/client libraries are usually written, and get updates, in the implementation's language first e.g. Elasticsearch and Java.


I thought this was going to be an announcement about how they finally fixed search...

A suggestion: When I search, what I want 99% of the time to happen is that the current window I'm looking at quickly gets filtered to my search query. Ranking doesn't matter, just show exact matches ranked by time.

1% of the time, I want something else.


Stay tuned :)


Hi, I'm one of the authors on the post, happy to answer any questions.


This sounds like a really interesting problem to work on. Are you hiring anyone for this team (esp those new to search)?


My experience as an experienced search developer is that there are very few search specific positions, if any, outside of the Bay Area. My current job is working remotely for a SF based company, where I do everything but search. :(




Awesome article! The article mentions the signals that the model found were most significant for a message. Curious to know if they're listed by order of significance in the article? And if not was wondering whether one or two signals were predominantly more significant than the rest.


Our top signals are age of the message, whether the message contains a link and whether the message is from the channel that you are currently viewing. We have several different models in production right now but those signals are generally the strongest.


Is there any possibility in the future users will be able to search ALL messages from a channel/private message, ever? It seems like Slack search cuts off after a certain point, and doesn't index into archived messages.


The free version includes search up to 10k of your team's most recent messages. The standard version includes unlimited searchable message archives. https://tinyspeck.slack.com/pricing/slack-for-teams


Today I learned Slack has a "relevant" option for search terms. Maybe I should try it again - I had stopped using search entirely because of the results being fairly irrelevant.


My experience is similar. Slack is great, but the whole search experience remains terrible for me. Results are largely irrelevant, jumping back and forth into conversations takes a long time, the UI sidebar has very little space, and if I'm looking for a file it's hard to remember if it was sent as a URL or as a file in slack. Perhaps I'm spoilt by Gmail search, but this is the one area of slack that I think is sorely lacking.


Great work! A few questions for the author(s): In the article, you have listed 9 feature extractors/templates. In the final model, what's the total number (or rough magnitude) of features? How much data (or ballpark estimate) did you train this on? Did you try to deal with potential difference in distribution across your data sampling sources?


I've never used Slack because last I remember they don't allow local installation, therefore no access to any documents that were dropped in, is this still the case? I have been using mattermost instead, and wrote a bot which processes all documents uploaded into mattermost, extracts metadata, creates tags and summaries for each, archives the documents in an ECM -categorizing them by their tags, and opens them up for full text solr search. My understanding is that a custom solution like this wouldn't be possible with Slack, or would otherwise require more hacky solutions?


> they don't allow local installation

This is like saying Ford sells cars they don't allow you to fly.

If you want something that flies, buy a plane.


I'd love it if they made it easier to search just the channel or conversation you have in focus. Having the filter autocomplete based on what channel your on seems like a good middle ground.


If you use the shortcut 'Command-F' it will search the current channel or conversation by auto-populating the search bar with 'in:channel_name'. Hope that's what you are looking for! Here's a list of all shortcuts: https://get.slack.help/hc/en-us/articles/201374536-Slack-key....


Thank you! This is what I get for navigating with my mouse.


Neat. We've been building a Learning to Rank plugin for Elasticsearch. Feedback and contributions very welcome

https://github.com/o19s/elasticsearch-learning-to-rank


I've also been building a learning to rank plugin for elasticsearch, we might want to sync up.

I hadn't set it up to sync to github as it was just internal development, but i've started the sync and it will show up at https://github.com/wikimedia/search-ltr soon.

It's got a bit more of the integration with elasticsearch put together, including storing models in cluster state and a rest interface for managing them. It's a bit more of a direct port of the solr plugin rather than a rewrite from the ground up so there are also some oddities that don't yet make sense. Refactors will certainly be done. It's also tied a little less directly to RankLib, such that i can convert and load in MART models trained by lightgbm or xgboost which have done pretty well in my offline tests and are able to utilize resources on my training machine much more efficiently than ranklib's LambdaMART (although in terms of results, the ranklib implementation is pretty good).


Neat! Want to email me at dturnbull AT o19s.com

We store models as a custom scripting language which takes care of distributing the model around the cluster, caching and basic CRUD operations. This was the hard thing to figure out, at first we looked at a REST plugin but it seemed cumbersome and hard to integrate with the query DSL. But I'm curious how you guys got around those pain points:)


Question for the author: how do you actually deploy your model? Do you have a dependency on Spark in your production system?


Since its just a dot product between the learned weights and the feature vector, we do this in the application layer as lacksconfidence surmised.


Being that this is a SVM, which is typically evaluated as a simple linear sum of weights, I imagine they reimplemented that in the application layer. Would be curious how they handled the normalization steps (reimplement that as well?)


Yep. We normalize our features as part of training, and the stdevs of each feature are part of the resulting model, along with the weights. (The means are always 0 because of the way we construct our training set.) The weights we use in production are actually normalized_weight / stdev.


Hey, Slackers, you can't find what isn't there.

Most of the time I search for information I need is because I don't know anything about that part of the software. I never found this kind of information in Slack.

Parse the company docs, or our rep, and now we're talking.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: