Disclaimer: I'm a former employee of Zillow Group, and still hold stock.
These online real estate listing sites are using machine learning to try and predict sale price. The main difficulties are:
1. 'Machine Learning' is a broad term, with lots of potential approaches, some of which could give more or less accuracy.
2. If you put too much effort into getting the prediction accurate, you're prone to overfitting.
2. The data is messy. Really messy. Really really messy. The major players in the field often have a vested interest in not rectifying the situation. Even when they do, the software that runs most MLS data exchanges is, well, old. To a certain extent, it doesn't matter if one particular house has accurate data, if all the surrounding data is inaccurate.
3. On the scale of things, home sales are actually pretty rare. Any particular home might only change hands half a dozen times, while the market demand for that home might swing wildly. Any given home sale has a pretty big influence on the valuations of nearby homes, and if that sale is 'inaccurate', i.e., the seller over- or under-paid what others would consider the fair price, it can have an outsized effect on algorithmic estimates of value.
4. It's much easier to tell when a machine-learned valuation is wrong, and by how much, compared with other common applications. If Netflix is telling you a movie is 4 stars, when you think it's more of a 3.5, it doesn't feel that "off" to you. If text-to-speech mistakes a word for another similar-sounding word, it feels understandable. When Zillow says you have $30k less than you think you do, it's more quantifiable, and has more emotional impact.
Tangentially, fun fact: while there's hundreds and thousands of potential features, if you do a PCA on it, basically 80% of a home's price is just "price per square foot in the nearby area". one feature. Of course, if you end up 20% over or 20% under based on that logic, you get sued.
Was the model aware of it's own error rate? For instance, could you say that this home is worth $500-650 thousand dollars?
Obviously, this is harder to convey to a homebuyer and for various reasons "hard" numbers are prefered, not least because they cause the illusion that you know what you are talking about.
These online real estate listing sites are using machine learning to try and predict sale price. The main difficulties are:
1. 'Machine Learning' is a broad term, with lots of potential approaches, some of which could give more or less accuracy.
2. If you put too much effort into getting the prediction accurate, you're prone to overfitting.
2. The data is messy. Really messy. Really really messy. The major players in the field often have a vested interest in not rectifying the situation. Even when they do, the software that runs most MLS data exchanges is, well, old. To a certain extent, it doesn't matter if one particular house has accurate data, if all the surrounding data is inaccurate.
3. On the scale of things, home sales are actually pretty rare. Any particular home might only change hands half a dozen times, while the market demand for that home might swing wildly. Any given home sale has a pretty big influence on the valuations of nearby homes, and if that sale is 'inaccurate', i.e., the seller over- or under-paid what others would consider the fair price, it can have an outsized effect on algorithmic estimates of value.
4. It's much easier to tell when a machine-learned valuation is wrong, and by how much, compared with other common applications. If Netflix is telling you a movie is 4 stars, when you think it's more of a 3.5, it doesn't feel that "off" to you. If text-to-speech mistakes a word for another similar-sounding word, it feels understandable. When Zillow says you have $30k less than you think you do, it's more quantifiable, and has more emotional impact.
Tangentially, fun fact: while there's hundreds and thousands of potential features, if you do a PCA on it, basically 80% of a home's price is just "price per square foot in the nearby area". one feature. Of course, if you end up 20% over or 20% under based on that logic, you get sued.