
Mozilla Fathom: Find meaning in the web - guifortaine
https://github.com/mozilla/fathom
======
joelg
Tim Berners-Lee loves to wax poetic on the evolution of our mental model of
what's important and what we consider to be the nodes in the computing graph:

TCP/IP: "It's not the wires, it's the computers!"

HTTP/HTML: "It's not the computers, it's the documents!"

Semantic Web (/Fathom): "It's not the documents, it's the things that they are
about!"

[https://www.w3.org/DesignIssues/Abstractions.html](https://www.w3.org/DesignIssues/Abstractions.html)

I look forward to the day when the idea of discrete, atomic documents is
finally abstracted away in the same manner that we abstracted away the
physical machines that host them.

~~~
ZenoArrow
I agree that our uses for the web will evolve again once we embrace the
semantic web. To repurpose some marketing-speak, I see the semantic web as
'Web 3.0' (as it goes beyond the possibilities of the web apps of 'Web 2.0').

There does seem to be some overloading of the term 'semantic web' though.
Sometimes the term is used to refer to semantic markup, such as using the <em>
HTML tag instead of using the <i> HTML tag...

[https://en.m.wikipedia.org/wiki/Semantic_HTML](https://en.m.wikipedia.org/wiki/Semantic_HTML)

Other times, the term semantic web is used to refer to structured meaning
transmitted through specialised metadata...

[https://en.m.wikipedia.org/wiki/Semantic_Web](https://en.m.wikipedia.org/wiki/Semantic_Web)

Fathom seems to sit across the two, though I'd suggest it's closer to the
first, unless it's used as a tool to analyse the design of existing web pages
for the benefit of new website designs (combined with site usage data, e.g.
bounce rates, etc...).

~~~
janober
Totally agree! The semantic web (Web 3.0) will change a lot how we for example
interact with websites and will finally make the information on the web
usable. Currently every website lives in its own little world. As long as we
stay on one page it kind of works and I can for example sort and filter by
properties, however as soon as I want to combine the information of different
pages this ends (example if I want to see the hotel I stay in with the best
restaurants in the city from yelp).

That is the reason why we currently work on
[http://link.fish](http://link.fish) which is a smart bookmark manager and
allows people to work with the information behind the urls. We are currently
in a closed beta but am happy about anybody who gives honest feedback.

Here is a short 2 min (low quality) demo video which shows how it works:
[https://youtu.be/Chfy3le5gY0](https://youtu.be/Chfy3le5gY0)

~~~
ZenoArrow
Link.fish looks promising, I can see it being a useful tool. Best of luck with
it.

~~~
janober
Thanks & great to hear!

------
sugarfactory
Extracting machine-understandable meaning from web pages is much analogous to
extracting text from images.

Fortunately, we usually don't need to process web pages using fancy yet hardly
accurate algorithms in order to extract machine-readable text from web pages.
Why? It's because we agreed to use character codes to codify letters and most
of the time text is encoded using some character code, which makes it
unnecessary to OCR pictures of hand-written letters to programatically process
text from web pages.

These kinds of program wouldn't be needed if only the same thing had happened
for page structures, if HTTP included page semantics.

~~~
firasd
The issue with creating tech for such semantics is whether authors put in the
effort to provide metadata. For example, rel=next/previous has been around
forever but most webpages don't have them because they are not exposed in
browsers or other clients. Other data mentioned in the examples like title and
open graph tags are provided for search engines, Facebook previews, and such.

~~~
astrobe_
I suspect it's against the website's interests. If you provide semantic marks,
it makes it easier to crawl your website, extract the actual content, and
leave the ads behind.

~~~
hyperdunc
The <article> tag already makes it pretty easy.

------
jaredkerim
One of the first projects we're building on top of Fathom is a collection of
'rules' for extracting a consistent metadata representation of web pages. For
instance, many pages expose Open, Graph tags to identify semantic metadata
such as title, description, icon, preview image, canonical URL, etc. However
not all pages use Open Graph, some expose Facebook tags, Twitter tags, generic
HTML meta tags, some expose not even those!

We want to use Fathom as the engine for applying a series of rules to look for
various forms of metadata in pages and collect them in a consistent fashion.
This can be used for storing/querying rich data about pages, presenting nice
previews of pages to users, and other applications.

My hope is that this repository acts as a collection point for people to
continue to contribute rules for various forms of metadata, from basic page
representation, to more domain specific things like product data, media data,
location data, etc.

The project can be found here:

[https://github.com/mozilla/page-metadata-
parser](https://github.com/mozilla/page-metadata-parser)

This library is designed to be used either as a node package within a server
side node ecosystem, or client side through npm and webpack/browserify/etc,
it's currently tested against node and Firefox.

Soon I hope to wrap it in a RESTful API with a Docker container, which will be
found here (still needs docs/tests):

[https://github.com/mozilla/page-metadata-
service](https://github.com/mozilla/page-metadata-service)

I recently found a very similar project called `Manifestation` which does
almost the exact same thing, so I hope to collaborate with Patrick and
integrate the projects if possible.

[https://github.com/patrickkettner/manifestation](https://github.com/patrickkettner/manifestation)

------
sr3d
This is an interesting library to watch for sure. Personally I have built many
scrapers and extractors to be used in house and I have spent many hours on
tweaking Readability JS and I know how complicated and hard-to-test the code
is. Seeing how Fathom does its job is cool -- it takes care of a lot of the
low level, bookkeeping parts so that all you need to do is to focus on
tweaking the ranking formula. I'm not surprised if in the future we will have
a shared repo containing "recipes" to parse pages, and slap on a nice UI with
DOM traversal then we'd have a Kimono-like app for parsing contents.

~~~
jaredkerim
Yes you are exactly right, that is exactly what we are planning with:

[https://github.com/mozilla/page-metadata-
parser](https://github.com/mozilla/page-metadata-parser)

This repo is exactly what you describe, meant to be a collection of 'recipes'
or 'rules' for extracting various forms of metadata from pages. It's very
early in its infancy but we are nearing deploying a first version of this to
users via Test Pilot:

[https://testpilot.firefox.com/](https://testpilot.firefox.com/)

I would love feedback or contributions!

------
kmike84
It looks based on rules. There are Python libraries which try to solve similar
tasks using Machine Learning:
[https://pypi.python.org/pypi/autopager](https://pypi.python.org/pypi/autopager),
[http://formasaurus.readthedocs.io/en/latest/](http://formasaurus.readthedocs.io/en/latest/),
[https://github.com/scrapinghub/page_finder](https://github.com/scrapinghub/page_finder).
I wonder how the quality compares. It is hard to make rules work reliably on
thousands of unseen websites.

------
fitzwatermellow
_HTMLElement.dataset is string-typed, so storing arbitrary intermediate data
on nodes is clumsy_

I use dataset extensively. Once each DOM element is assigned a unique key id,
many things get simplified: DOM manipulation, client server state sync, event
handling, etc. Now if an additional data layer can resolve key id to semantic
metadata, the only problem left would be to expose that data so third parties
can read from it. So, the problem could be solved with three data points: url,
dom key and the resolved metadata value.

What's really needed isn't a new protocol or set of rules and conditions.
OpenGraph probably works just fine. It's a dedicated global database of
metadata. A web white pages directory. As OpenGraph is perhaps dominant, it
may make sense for someone like Facebook to provide this. Historically, third
party non-profit services have not been popular.

All of this is of course contingent on the assumption that content hosters
provide that metadata to begin with ;)

------
reitanqild
Happy to see another really interesting project from Mozilla.

Still worried about the extensions ecosystem though.

~~~
Manishearth
There's been a lot of misinformation about the extensions changes out there;
what are you worried about? Most of the worries about the upcoming Firefox
changes to addons seem to come out of this misinformation :)

~~~
reitanqild
> what are you worried about?

It is said to be similar to chromes. And chrome doesn't support real
extensions like tree style tabs etc.

Happy if you can tell me I am misinformed.

~~~
pauljohncleary
Extensions are a form of vendor lock in, Google were killing Mozilla in that
space because:

\- network effects (chrome is more popular than ff)

\- chrome extensions are so much easier to write than XUL ff extensions

Now webextensions are implemented in FF, it's incredibly easy to port
extensions across (I ported one earlier this week, took about 20 minutes!)

Tree style tabs and all that jazz aren't going away yet.

It's a killer move from Mozilla and makes FF a first class citizen of the web
again.

~~~
Sir_Cmpwn
The primary concerns I have, which I left Firefox over, are:

\- Dropping XUL breaks backwards compatability and it seems Mozilla is willing
to break it before they provide adequate replacement APIs

\- The design of the new addon signature requirements turns AMO into a walled
garden and I very much do not appreciate that

~~~
rhelmer
> Dropping XUL breaks backwards compatability and it seems Mozilla is willing
> to break it before they provide adequate replacement APIs

The problem is not XUL. Modifying HTML or XUL DOM via JS is roughly equivalent
(setting aside XUL-only features like XBL, does this matter for many
extensions?)

The problem is all the internal JS APIs that add-ons can call right now, there
are too many to secure and ensure backwards compat. This is why extensions
break between releases so much. It's also pretty hard to program against, so
there are a lot of common bugs and it's very difficult to ensure any level of
security, since Firefox extensions can do anything.

WebExtensions are intended to be a superset of the APIs Chrome exposes, new
APIs are being added all the time. It must be possible to implement them
securely and maintain them over time, unlike the current situation with
internal-only APIs.

> The design of the new addon signature requirements turns AMO into a walled
> garden and I very much do not appreciate that

I disagree. Signing is required to make it more difficult for malicious
extensions to persist in the wild. There's no requirement to host on AMO, just
a requirement to sign if you want your extension to run in Firefox release
builds.

If the extension is later found to be malicious, it can be revoked without
having to depend on the ID (which is set by the add-on and trivial to
circumvent).

> The primary concerns I have, which I left Firefox over

Which browser did you switch to? There's still time to participate and
influence outcomes, old-style extensions are still supported today...

------
niftich
Seems like a tool about to be used in the Mozilla Context Graph effort,
discussed on HN last week [1].

[1]
[https://news.ycombinator.com/item?id=12044212](https://news.ycombinator.com/item?id=12044212)

------
ComodoHacker
Is this the tech behind Firefox Reader View?

~~~
ianbicking
Not exactly, though inspired/reacting to that approach. The reader view is
based on this:
[https://github.com/mozilla/readability](https://github.com/mozilla/readability)

~~~
ComodoHacker
Which itself turns out to be contributed by the community in 2010. Open source
in action.

------
fiatjaf
I don't quite understand why is this useful, but someone could turn it into a
browser extension.

