
Tech giants let the Web's metadata schemas and infrastructure languish - timhigins
https://threadreaderapp.com/thread/1291509746000855040.html
======
the_duke
Actual title: Google and other tech giants are happy to have control over the
Web's metadata schemas, but they let its infrastructure languish

I know that hating on Google is fashionable, but that's a bit too much
editorializing. Especially considering the content of the post, and Google
just being a small side note.

\---

On-topic: I recently looked into using schema.org types as the basis for a
information capturing system, but many of the types are somewhat outdated, of
questionable quality or just missing. Development indeed seems slow, while
changes that are needed by one of the larger involved companies get pushed
through quickly.

I think a big part of that stagnation is a lack of interest though. The whole
semantic web domain has been pretty much inactive.

It's a real shame: having canonical types for most things in existence, and
have those actually be supported as import/export formats or for cross-app
integrations, would be immensely valuable! But there is absolutely no business
incentive there - rather the opposite. Easy portability of data is not
something most companies would want.

~~~
Vinnl
> But there is absolutely no business incentive there - rather the opposite.
> Easy portability of data is not something most companies would want.

That depends on what kind of data it is. For example, your home address is not
part of your bank's primary business model, but keeping it up-to-date is
important for it. If data portability in and out of the bank makes it more
likely that you'll keep it up-to-date, that's useful for your bank as well.

Legislation and customer demand is also making it more and more palatable. If
some data is not critical to your business model, but being the sole guardian
of it is a legal/reputational liability is, then actually handing control over
that data over to someone else and re-using that is very useful.

~~~
riffraff
That's interest on the side of the data consumer not the data provider, for
lack of better words.

If the bank was the one owning the information they would not want it to be
shared with others as that would allow their client to easily migrate to
another bank which they definitely do not want.

But as the one receiving the data,sure, it would be nice to have others share
it with me, they'd say.

I'm afraid without legislation data sharing is never going to be a thing.

~~~
Vinnl
At this moment that bank _is_ the entity that keeps this data. Their challenge
is, however, that the data gets outdated. But if they give other parties the
ability to access that data, then the consumer will have more motivation to
keep it up-to-date, and the bank will now have access to more accurate address
data.

(Note that the bank is an example - it could be another party.)

~~~
mschuster91
The solution for this would be for banks to use the government as single
source of truth - in Germany we have the Melderegister anyway, it's mandatory
to register your primary address.

Unfortunately it's not allowed by law that a consumer gives "push access" to
e.g. banks, health insurance or employers.

------
frou_dh
This reminds me of the tragic situation where if you process XHTML locally
using XML tools that incidentally fetch the DTD, then things block and become
absolutely dirt slow, because the W3C sysadmins are permanently pissed off by
that:
[https://stackoverflow.com/a/13865692/82](https://stackoverflow.com/a/13865692/82)

~~~
habitue
This is great. It's a perfect solution too, because it's not super long, but
it's long enough that nobody is going to go into production without figuring
out how to cache it

------
jacques_chester
It was once put to me that Google's promotion system creates this dynamic.

Starting a new project that garners widespread attention looks good in a
package, but replacing lightbulbs and scrubbing floors doesn't. Folks create a
splash, get promoted, then move on and are not replaced.

I've never worked at Google, so I do not know if this dynamic is real. I would
be interested to hear from Googlers about incentives to work or not work on
something.

~~~
ocdtrekkie
Former Googler, 16 hours ago, stated:
[https://news.ycombinator.com/item?id=24077692](https://news.ycombinator.com/item?id=24077692)

~~~
jacques_chester
He says incentives and recognition began to shift by 2010, which is good to
hear.

But that leaves me wondering how the linked situation occurs. It's a cliché
that Google shutters projects or loses interest. I don't know whether that's
particular to google or if it's due to an availability heuristic (Google is
well-known, so its wanderings-off are widely publicised).

But if there _are_ such dynamics and incentives, they are worthy of attention.
Google exerts enormous gravity on the fabric of the technology industry, it
would be helpful to avoid hurtful externalities arising from internal
incentives.

~~~
acdha
My theory is that Google is notably worse at this because their core
advertising business is so profitable. Most companies would have been forced
to become good at managing projects by necessity but until ad revenues
substantially decline Google can subsidize a ton of inefficiency and still
report good numbers to Wall Street, just like Microsoft's various write-offs
in the 90s and 2000s.

~~~
jacques_chester
It certainly doesn't help. I refer to this as the "Mississippi of Money"
problem. If a river is deep, wide and fast-flowing, you can do pretty much
anything and still get somewhere.

The rest of us have to make do with leaky canoes, going up a certain creek,
often sans a paddle.

------
Santosh83
Isn't schema.org supposed to be an "industry wide" collaborative effort? In
which case we must also remark on the disinterest shown by players like
Microsoft, Apple or Google, or even Facebook, Twitter and so on, all of whom
benefit by this semantic markup.

~~~
dbish
My opinion is the disinterest from bigger players is because of the lack of
traction/interest from the broader community. Maybe there's a chicken and egg
problem that comes with bootstrapping any new standard

~~~
reaperducer
IMO, the reason for the lack of interest from the broader community is because
it's unnecessarily complicated, and somehow simultaneously incomplete to the
point of being almost unusable for some projects.

It was clearly designed by bureaucrats who enjoy making rules and sub-rules
and sub-sub-rules. It doesn't matter if it works, or is useful, as long as
there are plenty of rules.

There's a reason nobody wants to play with the jerk dungeonmaster.

~~~
dbish
That's fair. I can see that being the issue. Speed and ease of use don't go
hand in hand with having a well defined ontology and processes for updating it

------
wrnr
Google is dropping the ball here, as they stand to benefit the most from a
single central ontology for the web. It does illustrate that this approach
doesn't work if you are looking to innovate quickly and not be dependent on
the goodwill of a single institution that doesn't even know who you are.

Maybe we can finally stop using ontologies for the semantic web and start
solving the hard problem of language pragmatics.

~~~
zo1
A single ontology would level the search/structured-web playing field. Right
now Google has a huge advantage because they leverage their ML/NN knowledge
and funding to "extract" structured information from the unstructured web. New
players just can't do that without a lot of time & funding, which wouldn't be
the case if a lot of the data was in a structured and agreed-upon
format/schema.

~~~
zozbot234
ML/NN methods and knowledge only become _more_ effective and powerful if
accurate input data is available. You can not just "extract" information, but
try to perform complex queries, inference etc. over structured data.

------
sawaruna
I'd love to see schema.org updated and used more. As someone still doing
linked data work, albeit in academia, I mainly use it simply to provide more
context to self-created, domain specific properties within ontologies using
things like rdfs:seeAlso, skos:related, etc.

Ideally it'd be nice (imo) if schema.org had more domain specific extensions,
similar to the bib[0] one which allows for things like comic book properties
to be described.

[0]
[https://schema.org/docs/bib.home.html](https://schema.org/docs/bib.home.html)

------
stefan_
I don't understand. This is a tool for Google to extract information from
websites and keep potential visitors on Google instead. Every use case for and
future progress on it will be measured on that metric.

They don't care to address any of the issues or "fix the infrastructure"
because this isn't a "organize all the information in the world!" project at
all. The guys that take Google visitor retention stats into their next
performance meeting are probably poking fun at all the ontology nerds that
have descended on their metric-driven scheme.

~~~
danShumway
From the official website:

> Schema.org is a collaborative, community activity with a mission to create,
> maintain, and promote schemas for structured data on the Internet, on web
> pages, in email messages, and beyond.

> A shared vocabulary makes it easier for webmasters and developers to decide
> on a schema and get the maximum benefit for their efforts. It is in this
> spirit that the founders, together with the larger community have come
> together - to provide a shared collection of schemas.

If this isn't an "organize all the information in the world" project, then
Google and the other companies involved are branding it in a horribly
dishonest way. In which case, they should be criticized for presenting a
company-specific visitor retention strategy like it's some kind of altruistic
gift to the world.

Sites like Facebook and Twitter have their own 'lite' metadata schemas that
they use to help identify and render links. Hardly anyone criticizes them over
it, because they haven't registered a generic domain like 'schema.org' and
presented their work like it's some kind of community-driven collaboration.
They're upfront that it's just a simple API for their website.

------
hn-cmt
The Schema.org vision certainly is not dead within Google. See the Google-
backed DataCommons project at
[http://datacommons.org/](http://datacommons.org/) which heavily relies on the
schemas defined by schema.org. Headed by the creator of schema.org.

------
tomcam
My solution was to reverse engineer highly ranked web pages. I used a subset
of the schema that seemed to be universal to those pages. Schema.org just gave
me the proper file formats.

~~~
evolve2k
Extrapolating is this possibly a “JavaScript - The good parts” sort of
problem?

Care to elaborate further as to which of the schema that your kept/found most
useful?

------
techntoke
I think their choice of JSON-LD as the recommended format and not being
transparent in how it effects results is the biggest issue. JSON-LD requires
duplication of content, where as microdata is inline with existing content.

~~~
modernerd
JSON-LD is also much easier to generate in scenarios where you can access post
meta data and output scripts but can't necessarily filter HTML markup output
(WordPress, corporate CMS, etc.).

And you can generate it dynamically:
[https://developers.google.com/search/docs/guides/generate-
st...](https://developers.google.com/search/docs/guides/generate-structured-
data-with-javascript)

~~~
techntoke
A lot of websites are now generated using static site generators and it is
much easier doing so inline, than have to duplicate the content which also
makes the pages much bigger. Like I said the issue is more about lack of
transparency about how it may effect ranking.

------
acdha
Actual thread for anyone wanted to look at the images which Thread Reader
stripped:

[https://twitter.com/alkreidler/status/1291509746000855040](https://twitter.com/alkreidler/status/1291509746000855040)

~~~
efreak
The images show up just fine for me? If you're like me and don't like Twitter,
use nitter instead; it works just fine without javascript (the entire thread
is there on one page)

~~~
acdha
Their lazy loader appeared to fail - the first image rendered but none of the
rest did.

------
bawolff
So fork? Its not the big G's responsibility to solve all the internet's
problems and honestly most other web metadata standards have failed, only
difference is that this one has a big name attached we can all blame.

------
Nasrudith
There is one question I always have about the semantic web schemes? What if it
finally catches on and the end sites just immediately start lying their ass
off for selfish purposes? Like many of the earlier search engine optimizations
to try to land common hits on a massive page that doesn't actually provide
what you are looking for.

The only way around that is for somebody to do the processing of the real data
to validate that it isn't just bullshit for a nefarious purpose. From what
I've heard about the Semantic web conceptually seems a bit skeumorphic as a
concept.

~~~
zozbot234
The nice thing about the newer semantic web standards is that they're a lot
more detailed than the "description" and "keywords" standards of old. It's
more obvious if a website is providing misleading information for self-serving
purposes.

------
zelly
All the information is already out there. Ontology is a crutch.

------
pokoleo
@dang there's a typo in the name: should say "infrastructure", not
"infrastrucure"

It's missing the T, as-in: _infrastrucTure_

------
valuearb
How do you avoid the Bike-Shedding problem?

Would forcing the proposer to quantify costs and benefits help?

~~~
pas
How does the IETF manages it? It operates by seeking broad consensus. That
might work. Also getting a good compromise done should be in the interest of
everyone. Plus there could be iterations every few years. (Like there's with
JS/ECMAScript via TC39.)

------
westurner
It's "langushing" and they should do it for us? It's flourishing and they're
doing it for us and they have lots of open issues and I want more for free
without any work.

Wow! Nobody else does _anything_ to collaboratively, inclusively develop
schema and the problem is that search engines aren't just doing it for us?

1) Search engines do not owe us anything. They are not obligated to dominate
us or the schema that we may voluntarily decide to include on our pages.

We've paid them nothing. They have no contract for service or agreement with
us which compels them to please us or contribute greater resources to an open
standard that hundreds of people are contributing to.

2) You people don't know anything about linked data and structured data.

Here's a list of schema:
[https://lov.linkeddata.es/dataset/lov/](https://lov.linkeddata.es/dataset/lov/)
.

Here's the Linked Open Data Cloud: [https://lod-cloud.net/](https://lod-
cloud.net/)

Does your or this publisher's domain include any linked data?

Does this article include any linked data?

Do data quality issues pervade promising, comparatively-expensive, redundant
approaches to natural-language comprehension, reasoning, and summarization?

Here, in contributing this example PR adding RDFa to the codeforantarctica web
page, I probably made a mistake.
[https://github.com/CodeForAntarctica/codeforantarctica.githu...](https://github.com/CodeForAntarctica/codeforantarctica.github.io/pull/3)
. Can you spot the mistake?

There should have been review.

[https://schema.org/ClaimReview](https://schema.org/ClaimReview), W3C
Verifiable Claims / Credentials, ld-signatures, and lds-merkleproof2017.

Which brings us to reification, truth values, property graphs, and the new
RDF* and SPARQL* and JSON-LD* (which don't yet have repos with ongoing issues
to tend to).

3) Get to work. This article does nothing to teach people how to contribute to
slow, collaborative schema standards work.

Here's the link to the GitHub Issues so that you can contribute to schema.org:
[https://github.com/schemaorg/schemaorg](https://github.com/schemaorg/schemaorg)

...

"Standards should be better and they should pay for it"

Who are the major contributors to the (W3C) open standard in question?

Is telling them to put up more money or step down going to result in getting
what we want? Why or why not?

Who would merge PRs and close issues?

Have you misunderstood the scope of the project? What do the editors of the
schema feel in regards to more specific domain vocabularies? Is it feasible or
even advisable to attempt to out-schema domain experts who know how to develop
_and revise_ an ontology or even just a vocabulary with Protegé?

To give you a sense of how much work goes into creating a few classes and
properties defined with RDFS in RDFa in HTML: here's the
[https://schema.org/Course](https://schema.org/Course) ,
[https://schema.org/CourseInstance](https://schema.org/CourseInstance) , and
[https://schema.org/EducationEvent](https://schema.org/EducationEvent) issue:
[https://github.com/schemaorg/schemaorg/issues/195](https://github.com/schemaorg/schemaorg/issues/195)

Can you find the link to the Use Cases wiki (which was the real work)? What
strategy did you use to find it?

...

"Well, Google just does what's good for Google."

Are you arguing that Google.org should make charitable contributions to this
project? Is that an advisable or effective way to influence a W3C open
standard (where conflicts of interest by people _just donating time_ are
disclosed)?

Anyone can use something like extruct or OSDS to extract RDFa, Microdata,
and/or JSON-LD from a page.

Everyone can include structured data and linked data in their pages.

There are surveys quantifying how many people have included which types in
their pages. Some of that data is included on schema.org types pages.

...

Some written interview questions:

> _Which issues have you contributed to? Which issues have you seen all the
> way to closed? Have you contributed a pull request to the project? Have you
> published linked data? What is the URL to the docs which explain how to
> contribute resources? How would you improve them?_

[https://twitter.com/westurner/status/1291903926007209984](https://twitter.com/westurner/status/1291903926007209984)

...

After all that's happened here, I think Dan (who built FOAF, which all
profitable companies could use instead of
[https://schema.org/Person](https://schema.org/Person) ) deserves a week off
to add more linked data to the internet now please.

~~~
InfiniteRand
I think that might be fair, but when she makes org came out it pitched itself
as trust us we will take care of things, so yeah they don’t owe us anything
but track record matters for trust in future ventures by these search engine
orgs

~~~
westurner
schemaorg/schemaorg/CONTRIBUTING.md
[https://github.com/schemaorg/schemaorg/blob/main/CONTRIBUTIN...](https://github.com/schemaorg/schemaorg/blob/main/CONTRIBUTING.md)
explains how you and your organization can contribute resources to the
Schema.org W3C project.

If you or your organization can justify contributing one or more people at
full or part time due to ROI or goodwill, by all means start sending Pull
Requests and/or commenting on Issues.

"Give us more for free or step down". Wow. What PRs have you contributed to
justify such demands?

[https://schema.org/docs/documents.html](https://schema.org/docs/documents.html)
links to the releases.

------
rondennis
What's in it for the tech giants? Google is merely interested in peddling its
ads to its Chrome users. Use Brave instead.

------
ProAm
Google neglect a project? No...I don't believe it.

Isn't this SOP for Google?

