I worked at Placester for a couple of years and built the system that imports data from real estate agencies. When I left last year, we had coverage with around 90% of the MLS's in the US. Most of what you say is right, but some clarifications and context:
You don't need to be a brokerage to get access to the MLS feed. Each MLS has their own policies for how you can display the data, what logos, size and text needs to be shown on the page with their listings though. Which means it unrealistic to build Zillow/Trulia site off of MLS data. Placester builds them for individual real estate agents which is significantly easier for keeping the MLS happy.
Some MLS's are great and will give you access to the data without much hassle, others are not and you have to pay a lot of money. Even once you get access, you will get almost no technical help or useful documentation on integrating with them. Since MLS's are almost never related, you still need to talk to 300+ different companies in order to get coverage of the US.
There is a standard that most MLS's follow for their data, which is called RETS . I would say about 80% of MLS's use RETS, the problem with RETS is that it's a standard in the same sense that CSS was a standard 10 years ago. The original library I wrote for RETS  is open sourced, and is littered with examples ,  and  (to name a few), of inconsistencies across RETS servers.
If you can work through all of that, you're golden. It took us about 1.5-2 years to get the experience of seeing how MLS's work in order to simplify the integration process down to 1-2 days, with RETS typically requiring no (or minimal) engineering work.
Thanks for all your work on ruby-rets! It works like a champ for pulling in listings from MLSPin and CCIAOR. As you mentioned, dealing with all the "certified" RETS vendors is a nightmare. For the uninitiated (and fortunate), RETS has its own querying language called DMQL which is inconsistent across versions and MLS vendors. Even trivial tasks like importing photos are handled with vast difference across MLS vendors.
Despite all the technical hurdles, given their resources, I would be shocked if Zillow and Trulia DIDN'T import the majority of their Mlsdata via RETS. Most MLS providers allow 3rd party access to the data feeds. There is no way all the listing data is re-entered via agents.
Seconded! RETS is a goddamn nightmare. For years we've used librets and it's the worst. When we found your Ruby-only library we got it working with our feeds within the day and I can't tell you how relieved we are not have to deal with librets compilation.
I become more and more amazed at the price of storage but in the same breath I am still horrified at the price of data at scale.
Right now I have a VPS at Carat Networks to throw crap on and I pay $15/m. For that I get 50GB of space and 500GB of transfer. I understand the speed and reliability is greatly improved with S3, but as a simple file host, it still makes sense for me to throw it on a vps or low-end dedicated server at 1/10 - 1/3 the cost of ^this^ projection.
" … I am still horrified at the price of data at scale."
This isn't "the price of data at scale", this is "the price of flexible, reliable, available data at scale"
I think what some people don't understand, is that Amazon _aren't_ trying to compete on price.
With Amazon, you're paying a premium for the ability to scale, both up and down, very quickly.
Rackspace, Linode, and some-guy-subletting-racks-in-some-local-datacenter can easily beat EC2 prices for "general purpose servers". What Amazon does differently is let you quickly and easily go from 1 "server" to 10 or 100 servers, then switch most of them back off again 4 hours later. I deal with a great local hosting guy, who can (and does) fast track provisioning for me at times, but if I called him and said "Ummm, the CEO is on Oprah tonight, I need 100 additional webservers, a load balancer or two, and a dozen database slaves; to keep my not-architected-for-scale-but-suddenly-in-need-of-it web app alive at 8:30pm tonight", there's no way he'd be able to do it. And even if he _could_ there's now way he'd agree to if I said "and I only want to pay for it all until midnight, then shut all the extra down and go back to charging me for my single instance".
"$1000 per terabyte per year" might seem crazy expensive if a sensible alternative for your data storage requirements is to go to BestBuy and grab a 2TB external drive for ~$100. But that's a _very_ different thing to what Amazon are selling...
I would rather pay 1/5th the price of Amazon for all the other days when I am not on Oprah though.
Our current CDN provides for 4k per month what amazon would charge 18k for.
Yes, that 4k is on a 12 month contract that we had to negotiate. We are paying for about 4 times the bandwidth per month that we are actually consuming at the moment, but it's just so much cheaper overall and the bandwidth we don't consume each month rolls over to the next. (We plan to consume it all one day!)
I firmly believe that the vast majority of AWS customers are paying for flexibility that they are not actually using 99% of the time.
This is 100% accurate. My point was that S3 is good at being scalable infinitely, but you can use low-end hardware to scale up to point x for storage. This will be small and fail quickly when it comes to media companies, but for most web apps who need an image host or cdn, it'll go a long way at a fraction of what aws charges. My problem I guess is that I see younger companies looking at aws, linode and rackspace as the _only_ solution and I think that's unwise.
I realize I'm slowly going offtopic, sorry! AWS still rocks and I wish I could use my amazon gift cards there.
Comparing a VPS to S3 is apples/oranges. The redundancy/backups/scale S3 provides over a single VPS is very valuable to people. This argument is silly on a post about S3. Yes if you want a cheap webserver to dump things you can get that. You can also get an EC2 instance with plenty of space pretty cheap to just dump things as well.
Very true, and that's why S3 is excellent, but I still feel there's a lot of value in using low-end servers until you're running at a large enough scale where redundancy actually matters (not just we should use this cause it's what everyone else is doing).
This is the reason why companies like your host can offer 500GB or terabytes or even "unlimited bandwidth" for such a low price -- its because they sell to a lot of people and pray that 90% of them won't even come close to using their full bandwidth allotment.
If everyone that was paying for the 500GB was using anywhere close to 500GB at that price that company would go bankrupt very quickly.
Testing the math "at break even" - $15 / $2.34 = 6.4 megabits/second. 6.4 megabits/second at 30 days in gigabytes = 2,073.6 gigabytes.
So, there's enough margin for everyone to be using the 500GB without that ISP going bankrupt. (Yes, I realize that they have costs for servers, cooling, real-estate, diesel, staff, security, etc..., but this shows we're in the right ballpark with a 4x margin)
Right, at 500GB it's definitely a reasonable cost -- I think VPS providers like the parents host are much more sensible with what they advertise.
Virtual hosts like Bluehost ("UNLIMITED Domain Hosting, UNLIMITED GB Hosting Space, UNLIMITED GB File Transfer") and Dreamhost ("Disk Storage Unlimited TB + 50GB Backups, Monthly Bandwidth, Unlimited TB") however are the ones who are especially bad with their advertised offers (all for around $5-7 a month). You start using even a couple hundred GB of bandwidth a few GB of storage and they're happy to kick you off for "abusing resources".
Started to use Stripe recently for a project and have used Braintree extensively for work. Your comparison is spot on.
Tradeoff with Stripe is that while you get a much simpler API, it (likely) will cost you more than Braintree depending on scale and what cards are commonly used.
The nice part of sending the data to your servers with client side encryption is that you can do validation before sending to the payment gateway. For example, if you want to ensure everyone enters a cardholder name, you can validate the non-encrypted fields before eating the cost of calling a payment gateway.
The aggregation framework is meant to fill a gap between SQLs SUM, COUNT, AVG, etc without requiring a full map/reduce. The Hadoop integrations are unrelated and are just a nice little bonus that they added.
I'm debating buying an iPad and not being able to play all media formats is a big one. Can I drag mkv 720p files to my iPad and will VLC play them if I can get it installed via above? Also hows the battery life if trying to play a 720p or 1080p file?
It was crypt-MD5, the fact that they call it MD5 with salt is generous at best. They seem to have made the decision to move to crypt-MD5. I don't really have any faith in their ability to secure the servers.
> Even with the iteration count, SHA512 is not exactly meant to be slow.
Increasing iteration count is synonymous with intending something to be slow. BCrypt itself uses a default of 2^10 iterations in most bindings. PBKDF2 + and an NIST studied hashing algo like SHA512 is a perfectly valid method.
I can't see how it could mean anything at all. Your password is either salted or it isn't, a hash can't really be said to have multiple salts. Maybe they're using different salts in their various rounds of hashing, can't see how that would provide any more security.
The reason you're being downvoted is because this has been explained a fair number of times on HN. The problem with using SHA-* or MD5 for hashing is that those algorithms are designed to be fast. This means that it's relatively easy for a cracker with a dump of the database to bruteforce passwords, since they can try gazillions of combinations very quickly. Hell, they can even parallelise the task on EC2 and get it all done in an hour.
By contrast, computing bcrypt takes a significant amount of time and CPU. It's slow. It's designed to be slow. It's designed so that you will need a LOT of CPU power to bruteforce it.
So, no, SHA-512 is not much better than MD5. It's still a fail.
Many are forced to use insecure hashing for compatibility reasons with outside vendors. Google email for orgs/colleges has two options for hash exchange (or used too... it may be different now) MD5 and SHA1. So you could not migrate user accounts unless the hashes were MD5 or SHA1.