Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How does Mint.com work?
35 points by grep on July 22, 2010 | hide | past | favorite | 39 comments
Mint.com connects to your bank account, credit card account, etc, and downloads your data. How do they do that? Do they need any special authorization from the banks or it's an open API? How about security? Can you please explain their back-end system?

The same question applies to Blippy and inDinero. Anyone know what these guys are doing on the back end to get the transaction data?




This is Jessica from inDinero. We use the same technology that Mint uses -- namely, integrating with a third party service called Yodlee. They take care of aggregating financial data through various means. Screen scraping, direct OFX feeds, etc...

The typical question is, why do Mint and inDinero use Yodlee instead of building the solution out themselves?

1) Security Liability. No startup should ever have to deal with the problems that go with storing passwords to financial accounts. Yodlee is in the business of security, they have direct feeds with major banks, making it much easier (and safer) to just integrate with them.

2) Mass quantity of banks. Screen scraping from so many banks is a pain in the *. It isn't standard either -- compare the bank website of Wells Fargo to that of a local credit union that asks 5 security questions upon login. In short, it's a brutal nightmare.

3) None of our businesses are in the business of screen-scraping. If Mint had to spend the first year of business integrating with banks, they wouldn't be successful. And even once the integrations are done, you have to maintain them in the event that the bank changes their login page or interface. In short, it's not worth any startup's time to do manual screen scraping themselves.

Would be happy to discuss further if you DM me.


Would you consider working with Iphone app developers to make money tracking and spending actionable? I've wanted this tool for a few years now. If I set my expenses, say utilities, rent, insurance, and car payments in a system that also tracks my purchases and income, I could see my daily budget for other more spontaneous expenses. I could open up my app to find what my drinking/nightlife expenses can be today, or tally up the day's spending to see if I'm still on track with that utility bill.

The thing I've found frustrating about financial planning tools is that they seldom give me any useful information about the purchase I want to make right now. Instead, they tell me about the malleable resource called money, and offer me the chance to look at it from a 3rd person perspective.


Feel free to ignore this if it's something you can't talk about... but can you confirm if Yodlee works with startups concerning licensing costs? Do they do something where you pay based on the number of accounts? Or do they ask for equity?

I've seen a number of startups use Yodlee or CashEdge and I was always curious how they could afford this or how the setup worked. I can't imaging they pay what Mint was reported to have paid (about $2mil/yr) to license the software.


Straight screen scrapin' yo. I worked for a similar startup that collected more detailed information than yodlee/mint, it was a product for financial managers instead of consumers. We collected over a 1mil transactions per night from over 3000 financial institutions. It was no joke. You might think screen scraping is silly but the bottom line is if a bank had an api (OFX, and very few do offer OFX) or formatted data downloads(csv,xls) the data tended to be stale or incorrect. Reasoning behind that is more eyeballs are on the web pages and so bugs/inconsistencies are noticed quicker. There was more of an expectation for the web pages to be accurate.


I work for a fairly successful start-up in the advertising industry.. We've done a MASSIVE amount of screen-scraping. phpQuery ftw.

I had a few months solid of nothing but scraper-building.

Besides, what's Google if not a screen scraper? :D

If anybody thinks it's "silly" they probably have a characterization in their mind that's not really how it works in the wild when done skillfully.


Would you please share some code samples or favorite blog posts on how you're using phpQuery as part of a scrape app? What set of tools/libraries are you using?

I'd really like to hear more about the current state of the art (without you telling any company secrets). I have quite a lot of experience scraping utility and government websites (no javascript) in perl + LWP... but I'm getting a little tired of perl and am looking to give a new toolset a try. Perferably one that can handle a broader range of modern websites.


I will throw you 1 bone, but past that, I'm careful because my work here is not my own, it's my employers.

libxml_use_internal_errors(true);

This little function call is the secret to solving what seem at first to be intractable memory-leaks. The trouble is the scraper uses libxml, and libxml issues a notice/warning every time the HTML is malformed. Without this call, those errs will bubble up to the PHP error handler and it'll murder performance and memory usage.

One more i suppose...

If you scrape inside a loop (and unless you're using a distributed job queue,if you're scraping more than one URL at a time you almost certainly are), missing a unloadDocument() call is going to cost you each time you iterate. The objects it creates, IIRC, have some circular dependency issues and if you don't explicitly unloadDocument you'll run into trouble. (Suppose it should be OK tho if you've enabled that GC feature in PHP 5.3)

And, generally, a tip... sometimes it's tempting to write a simple regex instead of a chain like, say, pq($this->node->find('a')->get(0))->attr('href')

Write the chain. Regex is just too brittle.


Thanks for the tips. You most likely saved me a bunch of headache and I appreciate it.

I'm pretty intrigued by a library that can apparently handle ajax/json updates and content creation. Heh... I thought PHP was only for page generation and had no idea you could purpose it for something like webscraping.

So it should be fun playing with it.

My email's in my profile if you ever want to talk shop.


I have been using quite a bit of python and beautifulsoup.

http://www.crummy.com/software/BeautifulSoup/


php|architect's recently released book entirely on this topic: http://www.phparch.com/books/phparchitects-guide-to-web-scra...


I have been using spynner, written in python. It uses pyQtWebkit, and jQuery. Use the trunk version at http://code.google.com/p/spynner/


What if the bank is very innovative one and makes frequent changes to their web pages?


Best joke I heard all week!


Changes happen and is why I was employed. But this is mitigated by the fact that most big banks probably have miles of red tape just to deploy a fix for a typo and the smaller banks used off the shelf products that were rarely upgraded and when they were we identified these products and could group banks together that used similar software and so the gathering of data was essentially the same.

edit: we also built and deployed to production every night, so we had no problem keeping up. Sometimes we'd even deploy midday if we felt like a fix needed to go out immediately.


Can you elaborate in how "screen scraping" works?


Most "screen scraping" these days is just extracting content from web pages.

1) Write a program that can load webpages as if it was a user of the site.

2) Have it save everything it loads.

3) Write a program that can extract the data you care about out of the html and put it into a more useful format (or into a database or something)

Many languages have libraries for this or you can use a tool like cURL or wget. I do this a lot with perl and the LWP family of modules but the sites I work on don't use javascript or dom manipulation... There's so much javascript and ajax out there now though that I'm not sure if you can scrape those kinds of sites with perl.


Screen scraping means: you write a web crawler which loads up the web page (in this case, takes your bank login user name and password, puts them in the login form on the bank website, pretends to be you and loads up the relevant web pages). Then you write an html parser which grabs the relevant bits from the bank's web page (account balance, number, name, etc.) and stores those bits somewhere useful in the local database.


My bank (all of them where I live) uses a challenge code that I have to enter on a separate hardware card reader. So only human login permitted :-(


"Screen Scraping" seems to be a sort of misnomer here. Essentially they are just loading a URL and extracting the information from whatever is returned. Whether that happens intelligently, or if they are just making specific scrapers for each bank, I have no idea.

But basically, 'look for the number in this div region, this is the account balance'.. etc.



You load the contents of a web page into a string and use a regex to grab the data.

HA! Just kidding about the regex.


djb_hackernews, what set of tools did you use for your scrapes?


At least prior to their acquisition by Intuit, Mint's backend was powered by Yodlee. This TechCrunch article provides a little background: http://techcrunch.com/2009/09/18/mint-is-yodlees-youtube/


It says that Yodlee received $2mil/year, isn't that a bit expensive for startups like inDinero to pay? (I'm assuming inDinero uses the same system)


I wouldn't be surprised if it is priced based on number of accounts.


Yodlee are generally amenable to negotiating with startups. After the Mint situation they might be willing to look at some form of equity share if you could demonstrate traction.


and Blippy, I assume they use Yodlee or they've created their own system?


"How do they get the data" doesn't seem mysterious once you give them your logins. "Is this safe and why or why not" is the question I'd be much rather have answered.


Don't bank user T&Cs specifically disallow sharing of your username and passwords though? Why would the bank be happy with a third party interrogating their systems in this way and allow the massive flow of traffic to a [successful] companies IP addresses? Surely there has to be some oversight from the bank?

I think this would be illegal in the UK under Computer Misuse Act, or somesuch, as it's unauthorised access to a computer system; the user doesn't have the authority to grant others access.


Once you give them the login info, how can they get the data from the bank?


See the answers under screen scraping; basically they load the web pages up, login as you, and parse the html to find what they need...


ironically you can just grep it:)

curl -d "username=foo&password=bar" https://yourbank.com | grep balance


Last I looked into this, Yodlee's security pages were a good place to start because they have a lot of the key words to look up.

http://www.yodlee.com/security_overview.shtml

A lot of the "how" is meeting security wickets (physical, application, transport, audit, examination).


They do use Yodlee but you still raise some interesting questions. Should the bank really be allowing access by a third party using stored two-factor credentials?


I don't think Blippy uses Yodlee. Maybe they are using some kind of bank API?


It's not screen-scraping. They use Yodlee and CashEdge.


Think it should be "How mint.com works?" or "How does mint.com work?"


Maybe someone from inDinero could join the conversation?


Did quicken back then also used screen scrapping? Most banks at the time didn't even had websites.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: