Mint.com connects to your bank account, credit card account, etc, and downloads your data. How do they do that? Do they need any special authorization from the banks or it's an open API? How about security? Can you please explain their back-end system?
The same question applies to Blippy and inDinero. Anyone know what these guys are doing on the back end to get the transaction data?
This is Jessica from inDinero. We use the same technology that Mint uses -- namely, integrating with a third party service called Yodlee. They take care of aggregating financial data through various means. Screen scraping, direct OFX feeds, etc...
The typical question is, why do Mint and inDinero use Yodlee instead of building the solution out themselves?
1) Security Liability. No startup should ever have to deal with the problems that go with storing passwords to financial accounts. Yodlee is in the business of security, they have direct feeds with major banks, making it much easier (and safer) to just integrate with them.
2) Mass quantity of banks. Screen scraping from so many banks is a pain in the *. It isn't standard either -- compare the bank website of Wells Fargo to that of a local credit union that asks 5 security questions upon login. In short, it's a brutal nightmare.
3) None of our businesses are in the business of screen-scraping. If Mint had to spend the first year of business integrating with banks, they wouldn't be successful. And even once the integrations are done, you have to maintain them in the event that the bank changes their login page or interface. In short, it's not worth any startup's time to do manual screen scraping themselves.
Would you consider working with Iphone app developers to make money tracking and spending actionable? I've wanted this tool for a few years now. If I set my expenses, say utilities, rent, insurance, and car payments in a system that also tracks my purchases and income, I could see my daily budget for other more spontaneous expenses. I could open up my app to find what my drinking/nightlife expenses can be today, or tally up the day's spending to see if I'm still on track with that utility bill.
The thing I've found frustrating about financial planning tools is that they seldom give me any useful information about the purchase I want to make right now. Instead, they tell me about the malleable resource called money, and offer me the chance to look at it from a 3rd person perspective.
Feel free to ignore this if it's something you can't talk about... but can you confirm if Yodlee works with startups concerning licensing costs? Do they do something where you pay based on the number of accounts? Or do they ask for equity?
I've seen a number of startups use Yodlee or CashEdge and I was always curious how they could afford this or how the setup worked. I can't imaging they pay what Mint was reported to have paid (about $2mil/yr) to license the software.
Straight screen scrapin' yo. I worked for a similar startup that collected more detailed information than yodlee/mint, it was a product for financial managers instead of consumers. We collected over a 1mil transactions per night from over 3000 financial institutions. It was no joke. You might think screen scraping is silly but the bottom line is if a bank had an api (OFX, and very few do offer OFX) or formatted data downloads(csv,xls) the data tended to be stale or incorrect. Reasoning behind that is more eyeballs are on the web pages and so bugs/inconsistencies are noticed quicker. There was more of an expectation for the web pages to be accurate.
Would you please share some code samples or favorite blog posts on how you're using phpQuery as part of a scrape app? What set of tools/libraries are you using?
I will throw you 1 bone, but past that, I'm careful because my work here is not my own, it's my employers.
This little function call is the secret to solving what seem at first to be intractable memory-leaks. The trouble is the scraper uses libxml, and libxml issues a notice/warning every time the HTML is malformed. Without this call, those errs will bubble up to the PHP error handler and it'll murder performance and memory usage.
One more i suppose...
If you scrape inside a loop (and unless you're using a distributed job queue,if you're scraping more than one URL at a time you almost certainly are), missing a unloadDocument() call is going to cost you each time you iterate. The objects it creates, IIRC, have some circular dependency issues and if you don't explicitly unloadDocument you'll run into trouble. (Suppose it should be OK tho if you've enabled that GC feature in PHP 5.3)
And, generally, a tip... sometimes it's tempting to write a simple regex instead of a chain like, say, pq($this->node->find('a')->get(0))->attr('href')
Thanks for the tips. You most likely saved me a bunch of headache and I appreciate it.
I'm pretty intrigued by a library that can apparently handle ajax/json updates and content creation. Heh... I thought PHP was only for page generation and had no idea you could purpose it for something like webscraping.
So it should be fun playing with it.
My email's in my profile if you ever want to talk shop.
Changes happen and is why I was employed. But this is mitigated by the fact that most big banks probably have miles of red tape just to deploy a fix for a typo and the smaller banks used off the shelf products that were rarely upgraded and when they were we identified these products and could group banks together that used similar software and so the gathering of data was essentially the same.
edit: we also built and deployed to production every night, so we had no problem keeping up. Sometimes we'd even deploy midday if we felt like a fix needed to go out immediately.
Most "screen scraping" these days is just extracting content from web pages.
1) Write a program that can load webpages as if it was a user of the site.
2) Have it save everything it loads.
3) Write a program that can extract the data you care about out of the html and put it into a more useful format (or into a database or something)
Screen scraping means: you write a web crawler which loads up the web page (in this case, takes your bank login user name and password, puts them in the login form on the bank website, pretends to be you and loads up the relevant web pages). Then you write an html parser which grabs the relevant bits from the bank's web page (account balance, number, name, etc.) and stores those bits somewhere useful in the local database.
"Screen Scraping" seems to be a sort of misnomer here. Essentially they are just loading a URL and extracting the information from whatever is returned. Whether that happens intelligently, or if they are just making specific scrapers for each bank, I have no idea.
But basically, 'look for the number in this div region, this is the account balance'.. etc.
Don't bank user T&Cs specifically disallow sharing of your username and passwords though? Why would the bank be happy with a third party interrogating their systems in this way and allow the massive flow of traffic to a [successful] companies IP addresses? Surely there has to be some oversight from the bank?
I think this would be illegal in the UK under Computer Misuse Act, or somesuch, as it's unauthorised access to a computer system; the user doesn't have the authority to grant others access.