It does need to be said that when Lucene had a set of features this small, it was also pretty tiny. And, if those are the needs, one could still download it and it will probably run on modern JVM:
lucene-1.4.3.jar 2004-11-29 14:13 316K
It is a bit bigger of course, but that's because it already had stemmers, multilingual support, multiple Query Parsers including phrase, RAM and Disk directory implementations and a bunch of other advanced features one can easily see by unzipping the jar file.
And, you could have always used Lucene directly as an embedded search library. Amazon does, for example, for their customer facing search as their scalability patterns do not align with either Solr or Elasticsearch.
And if you do use Lucene directly, you can choose just the libraries that apply to your use case. I think at minimum, you can get away with maybe 3 jars (core, queryparser, analyzers-common) and that's just over 5Mb for the latest Lucene, which includes things like: http://blog.mikemccandless.com/2021/03/open-source-collabora...
Sometimes, there is a disconnect between implementation that is very flexible and messaging that just shows 'use everything' garden path.
Can you explain this to a non-programmer?
But basically I suspect it is like this.
Vast Majority of actual search magic (vector analysis and heavy math and optimized index management) is in Lucene. But Lucene does not care much about getting data into the system (json, csv, xml, etc) or getting it out. It just has internal representation of documents and fields. Plus it does not do any sort of multi-index management, needed for scale.
So, both Solr and Elasticsearch build on top of Lucene to do the user-friendly and scaling parts. Solr allows to send data in multiple formats, Elasticsearch sticks to JSON. Solr has multiple way to build search queries (url params, xml, json, changed over time), Elasticsearch sticks to JSON. Solr is a bit rigid about how schema (field collection and type definition) is done, Elastisearch is a bit more hands-off. Solr has composable pre-processor chains and explicit field analyzer chains, Elasticsearch focuses more on logs and has external pre-processors and custom scripting language. Solr uses Apache Zookeeper to coordinate distributed state, Elasticsearch rolled their own. Solr has one particular way to split data into replicas and shards and throw them from one server to another, Elasticsearch has another. Solr is perfectly happy to run a single node=server=collection=core. Elasticsearch starts with fully distributed cloud setup.
All of these are trade-offs on top of actual search. That's why most of the consultants work with both Solr and Elasticsearch, the most difficult search optimization concepts are same for both.
Bloomberg, I think, uses Solr directly and builds on top of that. They contributed Machine Learning ranking to Solr, for example. They also use Elasticsearch for log analysis, I believe.
Amazon's needs are different again. Their data comes in through different routes (direct from databases?), their multi-tier replication and sharding strategy have different needs, they don't need user-facing user interface, etc. So, they take Lucene directly, and build the rest on top. And then contribute back to Lucene, which makes it available to both Solr and Elasticsearch. Everybody benefits in the end.
and some companies convince themselves that search isn't essential to them and they end up with awful search not much better than an SQL like query and when they do UX evaluations of how people use their site say look - nobody uses our crappy search let's not put any time into fixing it!
You can use a single node Solr, if you want UI and send data in multiple formats and may eventually need to scale.
You can use Elasticsearch if you know you are starting big and deal with scaling from the start.
You have the choice. It is not just clustered Elasticsearch vs code your own.
May vary dependent on if you have a lot of data, but if so you probably have people to handle lots of data problems anyway.
It's short and to the point. And then I implemented all that ... in PHP and MySQL :)
It feels daunting at first, but once you understand what it wants you to do, it's actually not that hard (for this particular paper, and this particular approach).
However, you do want to employ a stemming library to normalize word forms.
SQLite is no heavyweight dependency àla Apache Lucene. It also helps that it is bundled in billions of Android and iOS devices.
You would likely be called out if you tried to hide that fact in order to seem more accomplished than your are. If for instance your dad is the owner of the company you are working at, you’d be rightfully be called out if you tried to hide the fact that your accomplishment is lessened by the fact that you copied your fathers DNA.
No one is claiming the author did anything wrong in copying the source material for his article. However it is wrong to not provide proper attribution, especially when it’s as easy as an “inspired by link” at the bottom.
I recently build program in Go that takes wikipedia article and gets all dependencies then using tfidf*count ranks concepts in order of "importance". Seems quite good for math articles to get list of more basic concepts to understand first.
As a gimmick, I created 36 groups, with group 1 containing the most important concepts:
Spoiler, the top ten concepts were:
Function · Set · Number · Integer · Real Number · Point · Property · Finite · Ring · Relation Theory
BTW: Glad the archive has a copy, because I do not.
A suggestion, if you want to make the code more friendly. Either:
1. Write a note to README.md that once you exit the program, you loose the indexed data.
2. Save the indices to disk. :)
thanks for the article
return [token for token in tokens if token]
It is iterating over a list (tokens) and creating a temporary variable (token). It tests the truthyness of it (if token), which means None and '' will return False and thus be excluded, and then returns it (the first token).
list(filter(lambda x: bool(x), tokens))
Wait..this seems unreasonable?
Isn't this how it works with elasticsearch or solr? or even google for that matter..a search for green won't return evergreen..
I really hate Confluence nowadays (and I really liked in the past, when I was using it randomly):
- Counter productive syntax that is not even equal to the one of Jira
- Search functionality that cannot find anything (I spend HOURS trying to find my content, let alone somebody else's). Some days I just wish I had the DB at hand / a folder with the whole pages exported to grep it myself.
- Proprietary file format - good luck migrating away from it
I'm really sad to say these things, I like Atlassian as a company but the more I use Atlassian products, the more I realize they're bloated and have _broken_ features that were implemented quickly just to tick some more boxes on their features sheet.
As part of the search team, I worked on a project where we deliberately rewrote the whole product search engine in Python and Cython, including our own algorithms manipulating documents for deletion, low latency reindexing after edits, and more.
We did this because SOLR was too slow and the process of defining custom sort orders (for example, new sort orders defined by machine learning ranking algorithms, and needing to be A/B tested regularly) was awful and performance in SOLR was poor.
It was a really fun project. One of the slogans for our group at the time was “rewriting search in Python for speed.”
The ultimate system we deployed was insanely fast. I doubt you could have made it faster even writing the entire thing directly in C or C++. It became a reference project in the company to help avoid various flavors of arrogant dismissal of Python as a backend service language for performance critical systems.
There certainly are usecases where Lucene based solutions aren't the best fit. But I think the claim that you couldn't make something faster by moving away from Python is outlandish.
I read that as a statement that they implemented a proper and bespoke algorithm, not that the speed of Python is greater than C. I am surprised that you read it that way. Who in their right mind would say Python speed is faster than C speed?
>I doubt you could have made it faster even writing the entire thing directly in C or C++.
> a statement that they implemented a proper and bespoke algorithm, not that the speed of Python is greater than C.
Many extension module implementations in Python are literally as fast as pure C (not just nearly as fast with minor extra CPython overhead, but literally as fast as pure C by deliberately bypassing CPython VM loop and data models).
Because you're writing regular Python for a production service though, and not artificially writing optimized examples, then you will occasionally have to pay extra costs.
Are you perhaps a bit too invested in your own narrative?
I imagine Python performance would be terrible.