Hacker News new | comments | show | ask | jobs | submit login
Analyzing Your Browser History Using Python and Pandas (applecrazy.github.io)
164 points by applecrazy 31 days ago | hide | past | web | favorite | 38 comments



It seems a bit odd to export the data in the command line to then work with it in pandas, given that it's a stock sqlite3 database. I'd just use sqlite3.connect and pd.read_sql_query directly. (I know, the db will be locked when Chrome is running, but you can always shutil.copy it to a temp location.)


Hmm something to keep in mind for next week’s post.


These two loops:

  raw_data = [line.split('|', 1) for line in [x.strip() for x in content]]
can be simplified to a single loop:

  raw_data = [line.strip().split('|', 1) for line in content]
Using str.replace here is also non-idiomatic:

  plt.title('Top $n Sites Visited'.replace('$n', str(topN)))
How about using str.format instead:

  plt.title('Top {n} Sites Visited'.format(n=topN))


You could even use the new format strings from Python3.6.

  f"Top {topN} Sites Visited"


thanks! The code in my post is mainly more of a POC than anything else, but in my next post I’ll try to be more Pythonic.


Neat. As a Firefox user, it appears that this does the trick:

    sqlite3 ~/.mozilla/firefox/$YOUR_PROFILE_ID/places.sqlite "SELECT datetime(visit_date/1000000,'unixepoch'),url FROM moz_historyvisits JOIN moz_places ON place_id=moz_places.id ORDER BY visit_date DESC" > hist.txt
(Looks like the timestamp is in UTC here.)

----

Does the original post's code perhaps count two visits to the same URL as one? Something like

    SELECT datetime(visit_time/1000000-11644473600,'unixepoch'),urls.url 
    FROM visits JOIN urls ON visits.url=urls.id
    ORDER BY visit_time DESC;
might be worth a try.


Now if only Buku [0] used this to poll and automatically synchronize my bookmarks...

[0]: https://github.com/jarun/Buku


Thanks. As a not so savvy Firefox user this was just what I needed.


You can skip the entire SQLite output to file by reading directly from the database: http://pandas.pydata.org/pandas-docs/version/0.20/generated/...


Just tried this, but am getting an error with the datetime. How could I fix this?

  Traceback (most recent call last):
    File "hist.py", line 14, in <module>
      data.datetime = pd.to_datetime(data.datetime)
    File "/usr/local/lib/python2.7/site-packages/pandas/core/tools/datetimes.py", line 373, in to_datetime
      values = _convert_listlike(arg._values, True, format)
    File "/usr/local/lib/python2.7/site-packages/pandas/core/tools/datetimes.py", line 306, in _convert_listlike
      raise e
  pandas._libs.tslib.OutOfBoundsDatetime: Out of bounds 
  nanosecond timestamp: 1601-01-01 00:00:00
The time on my Mac is correct, and datetime gets the correct time?


Aah I got this strange issue too. Search for that 1601-01-01 time stamp in the exported text file and change it to something else that isn’t before the Unix epoch


You can automate this by changing the query line to : sqlite3 data/ChromeHistory "SELECT datetime(last_visit_time/1000000-11644473600,'unixepoch','localtime'),url FROM urls WHERE last_visit_time > 0 ORDER BY last_visit_time desc" > data/hist.txt


That sorted it, cheers!

How long is the Chrome database kept for? I use Chrome Canary and the standard Chrome, but only had a few hundred results in each, a lot less than I was expecting. I can't remember the last time I cleared my history either.

Cool graph though anyway!


I’m not sure how long the history is stored, but I had over 2k entries for six months of history.

And, by the way, I’m not done with this series. Next week I’ll try to predict browsing patterns using this dataset.


e.g.:

  data.loc[data.datetime == '1601-01-01 00:00:00', 'datetime'] = '1970-01-01 00:00:00'


Awesome first content post, man! I like yours better than mine: https://andythemoron.com/blog/2017-03-12/Scraping-For-Fun-An...

One note you may want to add (or step you may want to tweak) is that the initial sqlite3 command won’t work if chrome is currently open. I just copied the History folder to somewhere else and did the same thing, as per the superuser thread you helpfully linked to.


There's a video (now pay-walled) from Cameron Davidson Pilon of Bayesian Methods for Hackers fame, where he takes this approach to build a basic Markov Chain model to predict patterns in his browsing.

https://dataorigami.net/collections/all/products/create-mark...


Great post for a quick read! You could maybe have taken your google searches from within the URL and Stack Overflow question titles and done word frequency analysis and maybe learned what languages you code in most often/what type of problems you have most as well. Good intro for a quick read though.


I don't think that the pulled data is *extremely messy(. You can also read `hist.txt` with:

  pd.read_csv('hist.txt', sep='|', header=None)


Not quite. I tried that but some of the entries had vertical bars in URLs (Amazon urls), making Pandas complain and refuse to import the data.


Ahh I see. Then you should maybe us the Amazon example in your post to clarify why you are using this approach. Nice post by the way.


Thanks for the feedback. Will edit my post soon.


I have a far more mundane use case - you have friends over, play songs off of youtube all night, after the fact, would like to publish a playlist on youtube, but it's not simple using browser history. I always figured I'd have to write a plugin and find clusters of short youtube sessions and go from there, but as a backend guy, I might play around with this stack instead.


> Since Pandas probably doesn’t do this out of the box

It does, use df[column_name].str.split(expand=True), where df is your DataFrame.


I had pipe symbols in URLs. Will it split only on the first pipe symbol?

Edit: nevermind, it does do that (https://pandas.pydata.org/pandas-docs/stable/generated/panda...)


Why do you mind? You throw away the rest of the URL in the next step anyway. :-)


I'm learning Python currently and this is the sorta stuff that's neat to see!

I haven't had any reason to use pandas yet so it'll be a good excuse.

It's a project that's not very long, requires data everyone has and gives a cool insight at the end.

Nice!


Also learning Python and intending to learn about pandas and other data tools with this: https://github.com/jakevdp/PythonDataScienceHandbook

I haven’t made it very far yet, but so far it seems worth recommending.


Thanks for the kind words! Be sure to stay tuned next week, where we’ll be using the same data to predict browsing patterns


I think the location on Linux (Ubuntu 16.04) is:

    ~/.config/google-chrome/Default
or:

    ~/.config/chromium/Default


Thanks. Will fix soon.


This type of history stats is buildin out of the box in Vivaldi browser.


What about the cmd+shift+N stuff ;)


I’m not sure, there might a way to obtain the incognito cache and determine browser history from that, if that’s what you’re after.


Vivaldi does this for me in the history page itself. Pretty weird to use Python just to see a distribution of what sites you visited.

Iirc stock chrome also has a way to view this.


Author of the post here.

The main purpose of the post was to demonstrate collection and cleaning of data and give an overview of it through a basic visualization.

In the future, I'd like to show how this data is a gold mine of information, using it to predict browsing trends, create a profile of interests, and more.

Mainly to show why ad tracking/selling of user browser history is bad, but also to teach some data science techniques along the way.


OT: which Hugo theme are you using? Custom?


I'm using the Minos Hugo theme.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: