
Extracting 10 years of USAspending.gov data into CouchDB - bbgm
http://www.full360.com/blog/Extracting-Government-Spending-Data-Talend-Stored-CouchDB
======
chaosmachine
off topic, but that logo sure looks familiar...

<http://images.google.com/images?q=xbox+360+logo>

~~~
ks
And the domain is "www.full360.com" as well. Perhaps they originally wanted to
be a games blog :-)

~~~
ramarnat
ha! no we were never meant to be a games blog - full 360 was about touching
all points of analytics in the enterprise. There are only so many ways to
represent 360 degrees - so it ended up that way unintentionally

------
bbgm
Disclosure: Part of my responsibilities at AWS include the AWS Public Data
Sets program

~~~
smoody
Nice. As an aside, would it be possible to share your thoughts about the
performance of CouchDB when loaded-up with that much data?

~~~
ramarnat
Loading the data, including parsing the xml and converting it to json was
about 50,000 records per hour on a c1.medium aws instance.

Just transforming the data from one json format to another and loading to a
new couchdb is much faster - about 200,000 records per hour. The server does
trip over sometimes on the bulk load, and requires a restart. This happens
once every 600-700k records

Reading the data is extremely quick, While creating the views on an existing
database is slow, once created, accessing the data is very fast using the keys
in the views

~~~
smoody
Thanks!

------
ashot
this would be a great addition to public data sets, though I imagine for that
to happen it would need some sort of viable plan to keep the data in sync

~~~
ramarnat
Now that I have this up, I was hoping to be able to work with the
usaspending.gov team to get a feed or extension to the api, that gives me the
changed records since the last upload. Then update the aws snapshot with this.
Do this on the same timeline that the usaspending.gov does it, monthly

