

Semantics3 (YC W13) Is A Massive Consumer Products Database To Rule Them All - relation
http://techcrunch.com/2013/02/25/yc-backed-semantics3-is-a-massive-consumer-products-database-to-rule-them-all/

======
jotto
I haven't tested their data, but 1 reason why this is hard is due to the SKUs
at Walmart and Best Buy. There may be a Samsung 42-inch TV that exists at
Amazon, Walmart and Best Buy but have slightly modified specs for each of
those mega retailers... and with this, Best Buy has a lot of SKUs that simply
die. So once semantics3 has "reconciled" that samsung 42-inch TV across
retailers, they'll have to continuously check whether any of the retailers
have changed the SKU and/or URL on them.

source: I do this at dealzon.com for a very limited set of data where it's
practical

~~~
netvarun
Yes, this is a big problem, which we have put in a lot of effort to tackle.

We try to calculate a 'hash' for the product, which is independent of the sku,
factoring in all the structured metadata available - normalized dimensions
(height, length, width), weight, model, manufacturer, etc.. We also account
for small variations in the numerical data points.

So even if skus change with slightly different specs, the 'hash' remains the
same and we can identify and reassign them.

Drop me a note at varun [at] semantics3.com - we could swap notes :)

[Edit: Added extra information]

------
rosenjon
Very cool! This could be really useful for something I'm working on.

I noticed in the example queries on your site, the data that is returned looks
like it is being piped pretty much exclusively from Amazon. For example, every
image url is an Amazon url. Have you licensed this data from Amazon?
Otherwise, any customers who use this info on their website are effectively
ripping off Amazon's product data, and could be exposed to legal issues. For
example, would it be kosher to download an image and use it on your own site?

I'm trying to understand the terms under which this data is being provided and
whether it is worth it for me to buy a subscription to your service.

~~~
rosenjon
So the terms of service answer this question in a vague way. Basically, they
say you aren't allowed to use this data for any commercial purpose. But who is
going to want to use this data for anything other than commercial purposes?

How You Can Use the Service Semantics3 grants you only a limited license to
access and make use of the Service for your own personal, non-commercial
purposes. This license is the only license to the Service granted by
Semantics3 unless you have a separate written agreement with Semantics3 or
there are Additional Terms applicable to your use of a particular feature,
resource or other portion of the Service. This license does not include any
resale or commercial use of the Service, its contents or any derivatives. You
agree not to reproduce, duplicate, copy, sell, resell or exploit for any
commercial purposes any portion of, or access to, the Service without the
express prior written consent of Semantics3.

You may not download any data or other resource made available as part of the
Service, except for temporary system caching necessary for your own personal,
non-commercial use of the Service in accordance with the TOS and applicable
Additional Terms, if any, and not for further distribution or transfer to
others, without the express prior written consent of Semantics3. You may not--
and you agree not to--modify, reformat, copy, display, distribute, transmit,
publish, license, create derivative works from, transfer or sell any
information, products or services obtained from the Service, except in
accordance with the TOS and as set forth in applicable Additional Terms, if
any, or otherwise agreed in writing with Semantics3. Additionally, except as
set forth in applicable Additional Terms, if any, or otherwise agreed in
writing with Semantics3, you may not: frame or mirror any part of the Service
without the express prior written consent of Semantics3; forge headers or
otherwise manipulate identifiers in order to disguise the origin of any
content transmitted through the Service; create a database or dataset by
systematically downloading and storing content from the Service; intentionally
or unintentionally violate any applicable local, state, national or
international law and any regulations having the force of law; use any robot,
spider, site search/retrieval application or other manual or automatic device
to retrieve, download, extract, copy, index, "scrape," "data mine" or in any
way gather information, content or other materials from the Service or
reproduce or circumvent the navigational structure or presentation of the
Service. Any unauthorized actions taken by you or by any third party on your
behalf automatically terminates any and all permissions or licenses granted to
you by Semantics3.

~~~
vinothgopi
Well that doesn't make any sense. All our use cases would be violating our own
ToC then!

What we meant was you cannot resell the data. We shall fix the wording of that
ASAP. Thanks for pointing it out.

------
swohns
This is an incredible achievement, would love to hear more about the backstory
and the "curation" process.

~~~
netvarun
By "curation", it's machine "curation" of course :) All the product data is
categorized and disambiguated with the rest. At this point we spit out a
confidence interval, and only those that pass a high threshold are inserted
into our master database. The rest are discarded - because they were missing
some attribute or we didn't know where to fit it in our category tree, etc..

In terms of the backstory - We started off as a data marketplace (like
infochimps) where folks could buy and sell data sets. But we soon realized
most of the demand was in the ecommerce vertical and we decided to focus
purely on that segment. Based on customer feedback, we scrapped the
downloadable data sets model and switched to the api model for the delivery of
data.

------
fnordfnordfnord
Some retailers get their own sku's for common products, (to prevent price
comparisons?)

~~~
coderdude
Some retailers go so far as to invent different names for uncommon brands and
use generated SKUs to prevent even the poaching of new manufacturers.
Misinformation is used a lot to keep a competitive edge. One of the many
challenges this service faces is trying to figure out if mfr A is actually mfr
B.

------
buss
I always thought it would be nice to instantly learn a product's provenance.

An app that tells you where some product was made would probably go over
really well in the USA. There are a lot of people who would pay more for a
product if they knew it was made in the USA while the product next to it was
made in China. It's probably quite difficult to get this information, though.

------
patternpaul
This service would be excellent for insurance claims. I used to work with a
third party insurance claims specialist where they had a manually updated
database of products to valuate insurance claims. They would either do all the
hard work of doing the valuations, sell access to their reporting app or sell
access to their product DB... 90% of their sales was the product DB.

------
instakill
All payments are handled through PayPal? These guys are really putting all of
their eggs into one basket with that move.

~~~
vinothgopi
Why so you say so? PayPal allows payments through their account and payments
with credit cards without an account too.

~~~
vaksel
because Paypal has a history of shutting down accounts for no apparent
reasons. Then it takes months to get your money out.

~~~
vinothgopi
We started out in Singapore and the only payment processor we could use there
was Paypal. If started now we would have used something more easier and better
like Stripe :) That being said, we should be switching over to them soon.

~~~
jareau
Check out balancedpayments.com. It's a processing, escrow, and payouts in one
API. Also the founding team were early employees at Milo.com and worked on
this specific issue of SKU level data cataloging & matching.

disclosure: I'm a co-founder of Balanced and this is a shameless plug for my
product. :)

I was also the head of data acquisition at Milo.com and am fascinated by this
problem. I have to admit I'm skeptical this scale of product matching and
metadata cleaning can be done, but I'm very happy for you if you've succeeded
where I failed. In either case, best of luck.

~~~
vinothgopi
Sure I'll check out balancedpayments.com :) We should definitely talk! Hit me
up at vinoth at semantics3.com

------
prawn
Direct URL: <https://www.semantics3.com>

------
aviswanathan
Seems complementary to a service like EDITD (www.editd.com). I'm curious how
Semantics3 determines which ecommerce sites to crawl. I remember reading
somewhere about how even Amazon only stocks something like 2% of all ecommerce
products.

~~~
vinothgopi
We started off adding most of the major e-commerce stores. As to how we decide
which sites to add, well that depends. We usually add sites based on what our
users seem to be asking for. More sites means more ability to do interesting
queries like comparing the prices between store A and B. Since we have the
products disambiguated, it is much easier.

Or if we find that the existing data that we gathered from our existing
sources for a certain vertical could be improved, we add new niche sites to
augment the data that we already have.

------
al_james
If its being used for competitive price analysis, I wonder if any retail sites
will simply block their crawler? I am assuming that they are (correctly)
announcing their crawler by its user agent, so could be blocked via
robots.txt.

~~~
rozap
Not to mention, many ecommerce sites explicitly forbid this sort of thing. I'd
be interested to know how they got around it.

~~~
netvarun
We currently get the pricing data via rss feeds, crawling, data dumps and for
some cases also crowdsourcing. In the long run, we also hope to establish
merchant relationships and get the data directly.

To the original question on crawling - (I had replied to a similar question
previously on HN): "Some great advice here on crawling at scale, which has
inspired our crawlers a lot : <http://news.ycombinator.com/item?id=4367933>
Basically it boils down to three things: 1. If the site is slow,crawl
slooowly. 2. If you see non-200 http error codes, stop! 3. Obey robots.txt and
speed restrictions."

------
sachinag
With UPCs/EANs. Finally. No idea why Freebase never had them.

------
hnwh
is this US centric? or applicable to EU audiences as well?

~~~
netvarun
We are currently only US centric. We hope to cover the other regions (starting
off with the EU zone) in the next few months. Drop me a mail at varun [at]
semantics3.com - I will keep you posted!

------
samstave
I want this for food!

