
Elasticsearch for Beginners: Indexing Your GMail Inbox - SuperKlaus
https://github.com/oliver006/elasticsearch-gmail
======
gecko
I've been doing a whole blog series on doing this also:
[http://bitquabit.com/post/having-fun-python-and-
elasticsearc...](http://bitquabit.com/post/having-fun-python-and-
elasticsearch-part-1/) . It's intereting to see a different take on it.

~~~
fheisler
I recently posted something similar focused on NLP analysis of Gmail messages
in Python/pandas, with some notes on storing in Elasticsearch as well; glad to
see others are covering that side of it, as I'd love to come back to this
project and take it a bit further!
[http://engineroom.trackmaven.com/blog/monthly-challenge-
natu...](http://engineroom.trackmaven.com/blog/monthly-challenge-natural-
language-processing/)

------
jptoto
This is a totally shameless plug but if you'd like to learn Elasticsearch from
scratch, I've got an introductory course up on Pluralsight.
[http://www.pluralsight.com/courses/elasticsearch-for-
dotnet-...](http://www.pluralsight.com/courses/elasticsearch-for-dotnet-
developers)

------
jrgnsd
It's the first time I see github's Readme's being used as a blogging tool. Is
this common? I've started to link to a Vagrant/Ansible repo for my setup /
code intensive posts, but having the code and the text encapsulated as a repo
is quite novel.

~~~
saurik
[https://github.com/raganwald-
deprecated/homoiconic/blob/mast...](https://github.com/raganwald-
deprecated/homoiconic/blob/master/README.markdown)

------
chdir
There are a couple of libraries listed below. Would using any of them make
life easier with ElasticSearch + Python?

\- [https://github.com/elasticsearch/elasticsearch-
py](https://github.com/elasticsearch/elasticsearch-py) (low level lib, from
ES)

\- [https://github.com/elasticsearch/elasticsearch-dsl-
py](https://github.com/elasticsearch/elasticsearch-dsl-py) (high level lib,
from ES)

\-
[https://github.com/mozilla/elasticutils](https://github.com/mozilla/elasticutils)
(high level lib from Mozilla)

There are a few more, but they are either obsolete or don't have much
traction. There's also django-haystack, but that's specific to django.

~~~
mrmondo
We (Infoxchange) use the official elasticsearch-py, I believe it's not without
it's frustrations but we've integrated Elasticsearch with part of our large
database of health services in Australia as part of a complete re-write
(Django) of a very old application (Perl).

You can try it out here:
[https://www2.hsnet.nsw.gov.au](https://www2.hsnet.nsw.gov.au)

Try searching for something like: psychiatrists near sydney cbd

The site is self is mostly just a front-end (as designed by / for a client)
running across several Docker containers for the database (PostgreSQL) which
is indexed into Elasticsearch and queried via the Elasticsearch API / Python
ES Libraries.

If you're interested in Elasticsearch with Python / Django check out our
(pretty crappy at the moment) tech blog: [https://ixa.io](https://ixa.io) or
our github: [https://github.com/infoxchange](https://github.com/infoxchange)

------
Fritsdehacker
I've been thinking about making my own email searchable with elasticsearch.
The main thing holding me back is security. With elasticsearch listening on
localhost:9200, anyone with local access can read all your mail. Even if you
would do this on a computer over which you have full control, even a tiny
breach would leak all your mails.

I realize this tutorial is just meant to get started with elasticsearch and
not meant as a tool to make your email searchable. Still would be interesting
to take this to the next level.

------
spaceman10
Not sure if people are still here. I tried moving through this and it appears
to be failing on the import... I am running a vagrant and get everything
installed just fine.

I don't know how to invoke the script properly...

I've tried so many ways. This seems like it would give results... though it
does nothing much.

python index_emails.py test.mbox

Any help or tips are appreciated! This has been a fun project so far.
Stumbling at the end. Thanks!

~~~
spaceman10
error and check and throwing mad -vv after python2.7 resulted in some sort of
standard out directions.

python2.7 index_emails.py --infile=test.mbox

above is working

------
pp19dd
Just a word of caution: elasticsearch allows everyone access to the indexed
data, by default. If you're doing this on a world-reachable machine with
sensitive data, you should probably lock it down or make sure it's locked
down.

There are a number of authentication solutions, and they will require
additional configuration -plugins like jetty and elasticsearch-http-basic.

------
Animats
The whole point of GMail was supposed to be that it was searchable. Did Google
break that, or what?

If there's a demand for this, it might be worthwhile to build IMAP servers
with more indexing. It's easy to request searches with IMAP, but the
performance can be a problem for IMAP servers that aren't real databases.

~~~
mrweasel
If you want to get started with something like ElasticSearch it helps having
large dataset to play with. Using your gmail/mail archive gives you plenty of
weirdly shaped data to have fun with.

I don't think this has much to do with the search in gmail not being
sufficient or broken.

------
superasn
Very interesting. This is a very useful and practical way of learning new
things instead of reading an article about it. I don't know python programming
but I was able to understand each and every bit of it and I will be coming
back to this if I ever need to incorporate Elasticsearch.

------
gcr
The 'notmuch' mail indexing system uses Xapian. I can grep through my 200k
messages in seconds.

[http://notmuchmail.org/](http://notmuchmail.org/)

Since it's implemented as a "library" of sorts, there are interfaces for
emacs, command line, GTK, mutt, ...

------
ladzoppelin
Wow little tutorials like this with easy attainable data are so helpful.
Thanks for posting.

------
bluefox
Analysing the "Turn mbox into JSON" section

[http://paste.lisp.org/display/145050](http://paste.lisp.org/display/145050)

------
tterrace
What was the performance like for those queries?

~~~
SuperKlaus
I have > 110k messages indexed and responses came back within ~300ms, less
than 100ms with a warm cache.

------
thrownaway2424
Couldn't this be "Indexing your mbox files"? It seems applicable to any
mailbox that is in or can be in that format. Except for the x-gmail-labels
part, of course.

Anyway if you do feel like you want to accomplish the stated purpose of
finding which emails are taking up space, you can search in gmail with the
word "larger", as in "larger:20MB".

------
curiously
so when should you use elasticsearch? can't you get away with doing

    
    
        SELECT id FROM pages WHERE title LIKE "%elastic"

~~~
frankwiles
Those sorts of SQL queries aren't very fast compared to a dedicated
index/search system like ElasticSearch. Especially when millions of rows are
concerned.

Think of search engines like ElasticSearch and Solr as being purpose built for
"search" rather than ad hoc querying.

They offer more advanced searching features like faceting and synonyms, if
your example had been "SELECT id FROM pages WHERE title LIKE '%dog'" you could
set things up so that matches for 'dog', 'dogs', 'doggie', 'puppy', 'pup',
'canine', and 'mans best friend' all returned the same results.

~~~
jobposter1234
While you are absolutely correct that this won't scale, and doesn't have a lot
of advanced features like faceting and synonyms, it's still a useful
technique. I use the SQL LIKE operator all the time, especially when I just
need a simple way of searching my data.

If you under a million rows, LIKE is reasonably performant.

------
piratebroadcast
Would LOVE to see this in Ruby rather than Python. My boss wants me to learn
ElasticSearch.

~~~
andyl
me too...

