
Analyzing Your Browser History Using Python and Pandas - applecrazy
https://applecrazy.github.io/blog/posts/analyzing-browser-hist-using-python/
======
drej
It seems a bit odd to export the data in the command line to then work with it
in pandas, given that it's a stock sqlite3 database. I'd just use
sqlite3.connect and pd.read_sql_query directly. (I know, the db will be locked
when Chrome is running, but you can always shutil.copy it to a temp location.)

~~~
applecrazy
Hmm something to keep in mind for next week’s post.

~~~
paprikawuerzung
Had the same idea with issuing the sqlite query directly from python.
Unfortunately the browser has to be closed in that case, so I just copied the
sqlite "History" file to temp location:
[https://github.com/davidgengenbach/chrome_history](https://github.com/davidgengenbach/chrome_history)

------
rahiel
These two loops:

    
    
      raw_data = [line.split('|', 1) for line in [x.strip() for x in content]]
    

can be simplified to a single loop:

    
    
      raw_data = [line.strip().split('|', 1) for line in content]
    

Using str.replace here is also non-idiomatic:

    
    
      plt.title('Top $n Sites Visited'.replace('$n', str(topN)))
    

How about using str.format instead:

    
    
      plt.title('Top {n} Sites Visited'.format(n=topN))

~~~
hultner
You could even use the new format strings from Python3.6.

    
    
      f"Top {topN} Sites Visited"

------
whym
Neat. As a Firefox user, it appears that this does the trick:

    
    
        sqlite3 ~/.mozilla/firefox/$YOUR_PROFILE_ID/places.sqlite "SELECT datetime(visit_date/1000000,'unixepoch'),url FROM moz_historyvisits JOIN moz_places ON place_id=moz_places.id ORDER BY visit_date DESC" > hist.txt
    

(Looks like the timestamp is in UTC here.)

\----

Does the original post's code perhaps count two visits to the same URL as one?
Something like

    
    
        SELECT datetime(visit_time/1000000-11644473600,'unixepoch'),urls.url 
        FROM visits JOIN urls ON visits.url=urls.id
        ORDER BY visit_time DESC;
    

might be worth a try.

~~~
nerdponx
Now if only Buku [0] used this to poll and automatically synchronize my
bookmarks...

[0]: [https://github.com/jarun/Buku](https://github.com/jarun/Buku)

------
linux2647
You can skip the entire SQLite output to file by reading directly from the
database: [http://pandas.pydata.org/pandas-
docs/version/0.20/generated/...](http://pandas.pydata.org/pandas-
docs/version/0.20/generated/pandas.read_sql.html)

------
7ewis
Just tried this, but am getting an error with the datetime. How could I fix
this?

    
    
      Traceback (most recent call last):
        File "hist.py", line 14, in <module>
          data.datetime = pd.to_datetime(data.datetime)
        File "/usr/local/lib/python2.7/site-packages/pandas/core/tools/datetimes.py", line 373, in to_datetime
          values = _convert_listlike(arg._values, True, format)
        File "/usr/local/lib/python2.7/site-packages/pandas/core/tools/datetimes.py", line 306, in _convert_listlike
          raise e
      pandas._libs.tslib.OutOfBoundsDatetime: Out of bounds 
      nanosecond timestamp: 1601-01-01 00:00:00
    

The time on my Mac is correct, and datetime gets the correct time?

~~~
applecrazy
Aah I got this strange issue too. Search for that 1601-01-01 time stamp in the
exported text file and change it to something else that isn’t before the Unix
epoch

~~~
7ewis
That sorted it, cheers!

How long is the Chrome database kept for? I use Chrome Canary and the standard
Chrome, but only had a few hundred results in each, a lot less than I was
expecting. I can't remember the last time I cleared my history either.

Cool graph though anyway!

~~~
applecrazy
I’m not sure how long the history is stored, but I had over 2k entries for six
months of history.

And, by the way, I’m not done with this series. Next week I’ll try to predict
browsing patterns using this dataset.

------
andythemoron
Awesome first content post, man! I like yours better than mine:
[https://andythemoron.com/blog/2017-03-12/Scraping-For-Fun-
An...](https://andythemoron.com/blog/2017-03-12/Scraping-For-Fun-And-Profit)

One note you may want to add (or step you may want to tweak) is that the
initial sqlite3 command won’t work if chrome is currently open. I just copied
the History folder to somewhere else and did the same thing, as per the
superuser thread you helpfully linked to.

------
leblancfg
There's a video (now pay-walled) from Cameron Davidson Pilon of Bayesian
Methods for Hackers fame, where he takes this approach to build a basic Markov
Chain model to predict patterns in his browsing.

[https://dataorigami.net/collections/all/products/create-
mark...](https://dataorigami.net/collections/all/products/create-markov-
chains-using-your-chrome-browsing-history)

------
lowpro
Great post for a quick read! You could maybe have taken your google searches
from within the URL and Stack Overflow question titles and done word frequency
analysis and maybe learned what languages you code in most often/what type of
problems you have most as well. Good intro for a quick read though.

------
daKexx
I don't think that the pulled data is *extremely messy(. You can also read
`hist.txt` with:

    
    
      pd.read_csv('hist.txt', sep='|', header=None)

~~~
applecrazy
Not quite. I tried that but some of the entries had vertical bars in URLs
(Amazon urls), making Pandas complain and refuse to import the data.

~~~
daKexx
Ahh I see. Then you should maybe us the Amazon example in your post to clarify
why you are using this approach. Nice post by the way.

~~~
applecrazy
Thanks for the feedback. Will edit my post soon.

------
schemathings
I have a far more mundane use case - you have friends over, play songs off of
youtube all night, after the fact, would like to publish a playlist on
youtube, but it's not simple using browser history. I always figured I'd have
to write a plugin and find clusters of short youtube sessions and go from
there, but as a backend guy, I might play around with this stack instead.

------
aportnoy
> Since Pandas probably doesn’t do this out of the box

It does, use df[column_name].str.split(expand=True), where df is your
DataFrame.

~~~
applecrazy
I had pipe symbols in URLs. Will it split only on the first pipe symbol?

Edit: nevermind, it does do that ([https://pandas.pydata.org/pandas-
docs/stable/generated/panda...](https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.Series.str.split.html))

~~~
cstuder
Why do you mind? You throw away the rest of the URL in the next step anyway.
:-)

------
spondyl
I'm learning Python currently and this is the sorta stuff that's neat to see!

I haven't had any reason to use pandas yet so it'll be a good excuse.

It's a project that's not very long, requires data everyone has and gives a
cool insight at the end.

Nice!

~~~
cdubzzz
Also learning Python and intending to learn about pandas and other data tools
with this:
[https://github.com/jakevdp/PythonDataScienceHandbook](https://github.com/jakevdp/PythonDataScienceHandbook)

I haven’t made it very far yet, but so far it seems worth recommending.

------
JoshMnem
I think the location on Linux (Ubuntu 16.04) is:

    
    
        ~/.config/google-chrome/Default
    

or:

    
    
        ~/.config/chromium/Default

~~~
applecrazy
Thanks. Will fix soon.

------
rasz
This type of history stats is buildin out of the box in Vivaldi browser.

------
lucasverra
What about the cmd+shift+N stuff ;)

~~~
applecrazy
I’m not sure, there might a way to obtain the incognito cache and determine
browser history from that, if that’s what you’re after.

------
paprikawuerzung
Thanks a lot for this code! If someone is interested and does not want to copy
the code by themself, I copied the code into a jupyter notebook:
[https://github.com/davidgengenbach/chrome_history](https://github.com/davidgengenbach/chrome_history)

------
ReverseCold
Vivaldi does this for me in the history page itself. Pretty weird to use
Python just to see a distribution of what sites you visited.

Iirc stock chrome also has a way to view this.

~~~
applecrazy
Author of the post here.

The main purpose of the post was to demonstrate collection and cleaning of
data and give an overview of it through a basic visualization.

In the future, I'd like to show how this data is a gold mine of information,
using it to predict browsing trends, create a profile of interests, and more.

Mainly to show why ad tracking/selling of user browser history is bad, but
also to teach some data science techniques along the way.

~~~
delta1
OT: which Hugo theme are you using? Custom?

~~~
applecrazy
I'm using the Minos Hugo theme.

