Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Semantics3 (YC W13) Is A Massive Consumer Products Database To Rule Them All (techcrunch.com)
83 points by relation on Feb 25, 2013 | hide | past | favorite | 34 comments


I haven't tested their data, but 1 reason why this is hard is due to the SKUs at Walmart and Best Buy. There may be a Samsung 42-inch TV that exists at Amazon, Walmart and Best Buy but have slightly modified specs for each of those mega retailers... and with this, Best Buy has a lot of SKUs that simply die. So once semantics3 has "reconciled" that samsung 42-inch TV across retailers, they'll have to continuously check whether any of the retailers have changed the SKU and/or URL on them.

source: I do this at dealzon.com for a very limited set of data where it's practical


Yes, this is a big problem, which we have put in a lot of effort to tackle.

We try to calculate a 'hash' for the product, which is independent of the sku, factoring in all the structured metadata available - normalized dimensions (height, length, width), weight, model, manufacturer, etc.. We also account for small variations in the numerical data points.

So even if skus change with slightly different specs, the 'hash' remains the same and we can identify and reassign them.

Drop me a note at varun [at] semantics3.com - we could swap notes :)

[Edit: Added extra information]


We're working on this problem at Datafiniti (https://www.datafiniti.net). Since we index hundreds of sources for a similar service, we can leverage some basic string comparison techniques to normalize records from different sources and fill in attributes that are missing from any one source.


Very cool! This could be really useful for something I'm working on.

I noticed in the example queries on your site, the data that is returned looks like it is being piped pretty much exclusively from Amazon. For example, every image url is an Amazon url. Have you licensed this data from Amazon? Otherwise, any customers who use this info on their website are effectively ripping off Amazon's product data, and could be exposed to legal issues. For example, would it be kosher to download an image and use it on your own site?

I'm trying to understand the terms under which this data is being provided and whether it is worth it for me to buy a subscription to your service.


>I noticed in the example queries on your site, the data that is returned looks like it is being piped pretty much exclusively from Amazon.

Damn, I really am hoping that's not the case. I got all excited because I really don't like the Amazon Product Advertising API. In fact, I have a project that I've basically given up on pursuing further because it depends on product lookup and I didn't think I could depend on Amazon, for various reasons especially:

-You can't even use the API unless you participate in their affiliate program, which you can't do in certain states because of constantly shifting tax laws (the fact that I have to do this to hit an API makes me facepalm)

-There are pretty low hourly usage limits. You can only get those increased by driving traffic and thus sales to Amazon. I'd rather pay.

-Technically, you can't write an app unless the exclusive purpose of the app is driving sales to Amazon

I'd love to pay for this thing if it is the real deal. If anyone has other alternatives, I'd like to hear them too.


Thanks for the comment! Since amazon.com storefronts cover more products that the rest, hence the skewed results - this is specifically true with the long tail queries. As we add more sources, the dominance would be less pronounced :)

We did hear that feedback too. Hopefully our API helps mitigate that issue. Hit me up at vinoth at semantics3.com if you'd like to talk about how our API could work for you!


The discussion of the TOS kind of obscured the real question I was trying to ask, which is "What are users of this API allowed to do with the image data?"

The API provides a link to the image, which is probably perfectly legal, but the site that displays that image in, say, an ad for a product on a non-Amazon website may be infringing. I'd like more clarity on this particular aspect of this api.


So the terms of service answer this question in a vague way. Basically, they say you aren't allowed to use this data for any commercial purpose. But who is going to want to use this data for anything other than commercial purposes?

How You Can Use the Service Semantics3 grants you only a limited license to access and make use of the Service for your own personal, non-commercial purposes. This license is the only license to the Service granted by Semantics3 unless you have a separate written agreement with Semantics3 or there are Additional Terms applicable to your use of a particular feature, resource or other portion of the Service. This license does not include any resale or commercial use of the Service, its contents or any derivatives. You agree not to reproduce, duplicate, copy, sell, resell or exploit for any commercial purposes any portion of, or access to, the Service without the express prior written consent of Semantics3.

You may not download any data or other resource made available as part of the Service, except for temporary system caching necessary for your own personal, non-commercial use of the Service in accordance with the TOS and applicable Additional Terms, if any, and not for further distribution or transfer to others, without the express prior written consent of Semantics3. You may not--and you agree not to--modify, reformat, copy, display, distribute, transmit, publish, license, create derivative works from, transfer or sell any information, products or services obtained from the Service, except in accordance with the TOS and as set forth in applicable Additional Terms, if any, or otherwise agreed in writing with Semantics3. Additionally, except as set forth in applicable Additional Terms, if any, or otherwise agreed in writing with Semantics3, you may not: frame or mirror any part of the Service without the express prior written consent of Semantics3; forge headers or otherwise manipulate identifiers in order to disguise the origin of any content transmitted through the Service; create a database or dataset by systematically downloading and storing content from the Service; intentionally or unintentionally violate any applicable local, state, national or international law and any regulations having the force of law; use any robot, spider, site search/retrieval application or other manual or automatic device to retrieve, download, extract, copy, index, "scrape," "data mine" or in any way gather information, content or other materials from the Service or reproduce or circumvent the navigational structure or presentation of the Service. Any unauthorized actions taken by you or by any third party on your behalf automatically terminates any and all permissions or licenses granted to you by Semantics3.


Well that doesn't make any sense. All our use cases would be violating our own ToC then!

What we meant was you cannot resell the data. We shall fix the wording of that ASAP. Thanks for pointing it out.


This is an incredible achievement, would love to hear more about the backstory and the "curation" process.


By "curation", it's machine "curation" of course :) All the product data is categorized and disambiguated with the rest. At this point we spit out a confidence interval, and only those that pass a high threshold are inserted into our master database. The rest are discarded - because they were missing some attribute or we didn't know where to fit it in our category tree, etc..

In terms of the backstory - We started off as a data marketplace (like infochimps) where folks could buy and sell data sets. But we soon realized most of the demand was in the ecommerce vertical and we decided to focus purely on that segment. Based on customer feedback, we scrapped the downloadable data sets model and switched to the api model for the delivery of data.


Some retailers get their own sku's for common products, (to prevent price comparisons?)


Some retailers go so far as to invent different names for uncommon brands and use generated SKUs to prevent even the poaching of new manufacturers. Misinformation is used a lot to keep a competitive edge. One of the many challenges this service faces is trying to figure out if mfr A is actually mfr B.


I always thought it would be nice to instantly learn a product's provenance.

An app that tells you where some product was made would probably go over really well in the USA. There are a lot of people who would pay more for a product if they knew it was made in the USA while the product next to it was made in China. It's probably quite difficult to get this information, though.


This service would be excellent for insurance claims. I used to work with a third party insurance claims specialist where they had a manually updated database of products to valuate insurance claims. They would either do all the hard work of doing the valuations, sell access to their reporting app or sell access to their product DB... 90% of their sales was the product DB.


All payments are handled through PayPal? These guys are really putting all of their eggs into one basket with that move.


Why so you say so? PayPal allows payments through their account and payments with credit cards without an account too.


because Paypal has a history of shutting down accounts for no apparent reasons. Then it takes months to get your money out.


We started out in Singapore and the only payment processor we could use there was Paypal. If started now we would have used something more easier and better like Stripe :) That being said, we should be switching over to them soon.


Fair enough. General PayPal safety tips:

- When PayPal raises a dispute, they really want to see a shipping confirmation for a product, or something really closely equivalent. Make sure you have very clear records prepared to show them, and hope that keeps them happy.

- Don't leave any money in an account PayPal controls. Get your money out of PayPal into your bank, and get your money out of any bank account linked to PayPal into an account PayPal has no access to, as quickly as possible after you receive it. When the inevitable dispute happens, you'll feel much happier if they can't sit on a pile of your money.

- Have a kill-switch plan to shut down new orders going into PayPal. PayPal will often happily lock your account while continuing to take money from customers, who will then want to receive their product even though PayPal won't give you their money. Worst case, that produces more disputes; best case, PayPal has increasing amounts of your money to sit on. Whether this means another payment provider or just a "we're sorry, we can't take orders right now, can we put you on a waiting list?" page, make sure you can stop getting money via PayPal the moment PayPal screws you.


Check out balancedpayments.com. It's a processing, escrow, and payouts in one API. Also the founding team were early employees at Milo.com and worked on this specific issue of SKU level data cataloging & matching.

disclosure: I'm a co-founder of Balanced and this is a shameless plug for my product. :)

I was also the head of data acquisition at Milo.com and am fascinated by this problem. I have to admit I'm skeptical this scale of product matching and metadata cleaning can be done, but I'm very happy for you if you've succeeded where I failed. In either case, best of luck.


Sure I'll check out balancedpayments.com :) We should definitely talk! Hit me up at vinoth at semantics3.com



Seems complementary to a service like EDITD (www.editd.com). I'm curious how Semantics3 determines which ecommerce sites to crawl. I remember reading somewhere about how even Amazon only stocks something like 2% of all ecommerce products.


We started off adding most of the major e-commerce stores. As to how we decide which sites to add, well that depends. We usually add sites based on what our users seem to be asking for. More sites means more ability to do interesting queries like comparing the prices between store A and B. Since we have the products disambiguated, it is much easier.

Or if we find that the existing data that we gathered from our existing sources for a certain vertical could be improved, we add new niche sites to augment the data that we already have.


If its being used for competitive price analysis, I wonder if any retail sites will simply block their crawler? I am assuming that they are (correctly) announcing their crawler by its user agent, so could be blocked via robots.txt.


Some percentage of online retailers will block their crawler. It's a situation where party A wants party B's data and there is really no reason party B would want party A to have it. It occurs even to small and medium-sized online retailers to try to block pricing crawlers. Yahoo Store devs are irked that you still can't upload a custom robots.txt file to the web root. Forbidding unwanted crawlers from poking around is frequently the reason (applies to stores using the store editor).

Maybe in exchange for the privilege of obtaining pricing data, this service could offer to automatically clean up their product data in exchange. Win-win for everyone.


Yeah, exactly what I was thinking. Once Amazon and big e-tailers figure out they're being crawled for profits or to the detriment of their sales, you can bet they will restrict access and charge a fee. Seems like a good idea, and I'm sure many of us have thought of the same idea, but it's easy for any of the big guys to easily block their crawler whenever they wish. Crowdsourcing works to a certain extent but that's what deal sites are for.

I remember a real estate startup crawling listing prices from many real estate sites for maket analysis. Needless to say, that startup was shutdown quicker then you can you say "doh!"


Not to mention, many ecommerce sites explicitly forbid this sort of thing. I'd be interested to know how they got around it.


We currently get the pricing data via rss feeds, crawling, data dumps and for some cases also crowdsourcing. In the long run, we also hope to establish merchant relationships and get the data directly.

To the original question on crawling - (I had replied to a similar question previously on HN): "Some great advice here on crawling at scale, which has inspired our crawlers a lot : http://news.ycombinator.com/item?id=4367933 Basically it boils down to three things: 1. If the site is slow,crawl slooowly. 2. If you see non-200 http error codes, stop! 3. Obey robots.txt and speed restrictions."


With UPCs/EANs. Finally. No idea why Freebase never had them.


is this US centric? or applicable to EU audiences as well?


We are currently only US centric. We hope to cover the other regions (starting off with the EU zone) in the next few months. Drop me a mail at varun [at] semantics3.com - I will keep you posted!


I want this for food!




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: