
Fathom: a framework for understanding web pages - nachtigall
https://hacks.mozilla.org/2017/04/fathom-a-framework-for-understanding-web-pages/
======
unabst
> The browser could recognize a Log In link, follow it in the background, and
> log you in,

As a web developer, this is exactly what I don't want. Just the other day I
got bit by font boosting on Mobile Chrome. Couldn't figure out for the life of
me why my h1 was bigger than I had _explicitly_ specified. No traces of
anything going on on the desktop either. Turns out, my page was being tampered
with because someone at Chrome had a "brilliant" idea for mobile. The fix is
to specify max-height to something inconsequential like 1000000px. My page
just had a header, but font boosting destroys navigation menus and tiles too.
Thank you but no thank you.

There cannot be any conflict between browser and web developer. If we have to
fight and hack one another, we're both failing.

Font boosting needs to be explicit. And nothing should be done without the
explicit intent of the author of the page.

If the browser wants to provide automatic login, then great. Please outline
exactly how to enable it, make it as easy and automated as possible, and help
decrease my workload. Thank you and thank you.

~~~
flukus
> There cannot be any conflict between browser and web developer. If we have
> to fight and hack one another, we're both failing.

It sounds like you're understanding is wrong, You're css is a polite
suggestion to the browser, nothing more.

If you want pixel perfect rendering then publish PDF's.

~~~
unabst
Code should be explicit. There should be no "in between the lines" and the
browser should not be guessing what the developer or consumer wants, let alone
override what was explicitly declared.

Then having to "politely suggest" max-height: 1000000px is not the right
conversation.

~~~
derefr
What does it mean to "be explicit" about how to render a page, when the UA is,
say, a screen-reader for a blind person? Or Siri? Or an even more novel
client-type with no directly-applicable CSS rules? They want to do a
completely different thing with the page than what you think of as "rendering"
it. They might take some of your CSS as hints on how to do that job, but
they're not _obeying_ that CSS; they're _inferring_ their own rule set for
their own rendering algorithm _from_ CSS.

Perhaps there needs to be a distinction made between "standard web browsers in
Standards Mode"—programs that nominally "obey" CSS, and should be chastised
for deviating from it—and all other clients, on which you can place no such
expectation.

(But even then, it's perfectly within the rules of CSS to apply UA styles at
the _beginning_ of the cascade. Usually those are per-browser, but there's no
reason there couldn't be per-document ones generated from heuristics. Want to
turn them off? Use a CSS Reset.)

~~~
unabst
> when the UA is, say, a screen-reader for a blind person?

The blind person explicitly turns on blind mode. This is not a difficult
problem. Trying to guess if the user is blind is a hard problem.

If the user wants bigger fonts, they can zoom. If font boosting is a feature
they would like on, then have them turn it on. If the author makes a crappy
page it should be on them. The best thing a browser could do is let them know.
And to visitors, provide options. But instead, they automatically enable these
secret brilliant features that break random things. These are not solutions.
These are the cause of many problems.

Automatic login? Great. Have a button. Have a feature. Don't do it in the
background. Just let me tell you what I want. And let the page authors embrace
those features and tailor them, and not wind up hacking them to tame their
functionality.

~~~
derefr
I'm not talking about option-switches; I'm talking about purpose-built
browsers or browser extensions, where _using_ the browser was the choice you
made to get these effects.

If someone wants to develop what's essentially "automatic login, even for
sites that would hate you if you did that: the browser", is that wrong?

Hell, this is essentially the same argument as the one behind ad-blocking. If
someone wants to build a browser that—by default—alters your page to remove
ads (surprise: someone does!), and people want to use that browser (probably:
everyone), I don't think you have the moral high ground to tell them to stop.

Sure, you might be able to at least insist that they do some sort of feature-
negotiation, where they tell you (maybe with feature-headers? a piece-wise
replacement for UA-string heuristics, how lovely) what they're going to do to
the page, and your server can then _choose_ to do things like just not serving
pages to people who have {font boosting, ad-blockers, etc.}; or redirecting to
a page telling them their browser is bad and they should feel bad.

~~~
unabst
> feature-negotiation

It's words like these. We should not be looking to negotiate with anyone --
not as an author, and not as a consumer.

If someone builds an auto-login browser, or an auto-login feature, then fine.
But when Mozilla decides to put this in by default without telling everyone,
then not fine.

If font boosting was part of some cross browser mobile web standard then fine.
But if you're Google and it's a feature specific for your browser, and you
turn it on be default, then not fine.

Of course, it seems like the premise is always that your feature can do no
harm. But in practice it's always the opposite. There are always edge cases
where something is broken or not right, that leads to these features as a
cause.

Login is actually extremely important. To tamper with the behavior of the
browser when it comes to logins, and to do it automatically, seems extremely
dangerous.

But either way, it's about communication, not negotiation. If the user wants
to turn it on, great. If the author turns it on, great. If no one turns it on,
but the browser developer just "thinks it's a good idea" for _every_ site ever
built, then not great.

------
gavinpc
We've been seeing more Datalog-inspired DSL's around here, and that's a good
thing. Fathom surely has uses beyond those the OP mentioned.

But as for those use cases... well, it just makes me sad. Obviously people
have perverse incentives to make the kind of noise that Mozilla is bemoaning,
and those people will find a way to game any system --- especially a highly
readable one!

~~~
grincho
> Fathom surely has uses beyond those the OP mentioned.

(Author here.) In fact, Fathom isn't particularly coupled to the DOM, apart
from the dom() call that acts as the initial source of data and some of its
optional utility procedures. With a few tweaks, you could use it for any
score-and-rank problem.

------
throwaway2016a
I like this post but I disagree with the premise... the whole "webpages don't
implement microformats and RDF so we're going to take away all their control"
doesn't settle right with me.

If stores using microformats and semantic markup is important then give value
added to the places that support it and people will start using. If
readability is important start penalizing sites with bad readability indexes.
But please don't take over my UX. I can't think of a faster way to stifle
innovation.[1]

Try the documentation, much less political
[https://mozilla.github.io/fathom/intro.html#why](https://mozilla.github.io/fathom/intro.html#why)

[1] That's an exaggeration. I can think of lots of better ways.

~~~
r3bl
I agree with you on one hand that a browser shouldn't mess with the layout,
but it could provide an alternative layout of the content (kind of like the
reader mode does) in the click of a button, and I'll be more than happy to use
it (the same way I got used to reading articles in a single, clean layout.

But, how do you think that a browser could theoretically penalize a website?

~~~
throwaway2016a
> But, how do you think that a browser could theoretically penalize a website?

Easier to do on a search engine. But for a browser they already do: many
browsers are or plan to call out sites that don't use HTTPs.

Or as another example: the warning that happens when you try to access a page
that uses a self-signed certificate.

Granted the penalty needs to be proportional to the crime. So maybe the
penalty and value add go together. If the value add is a useful feature that
everyone wants and demands then not having it becomes a penalty to the sites
that don't offer it.

------
hnruss
I looked at mozilla/activity-stream a bit to try to find some examples of
fathom usage, but didn't find any. Then I figured out that it doesn't depend
directly on fathom-web, it depends on page-metadata-parser (which then depends
on fathom-web).

Here's the code in that project which uses Fathom:
[https://github.com/mozilla/page-metadata-
parser/blob/master/...](https://github.com/mozilla/page-metadata-
parser/blob/master/parser.js)

------
dlwdlw
Often times, trying to reduce illegibility kills the ecosystem. Centralized
economies for example, though easier to understand and control, often
destroyed economies.

Certain choices cause divergence, they increase your optionality in the
future. Other choices cause convergence to one thing. Convergence is only
desired if and only if the one true thing is the one true thing.

So neither divergent nor convergent thinking is good in itself unless you
apply another filter on how you see the future. You either see messiness as
indicative of the future having possibility, or you see the present as a
broken world in need of fixing. (Or isolated areas of perfection needing
protection and isolation)

------
untangle
Prev HN Comments:

[https://news.ycombinator.com/item?id=12060787](https://news.ycombinator.com/item?id=12060787)

------
andy_ppp
This looks incredible for starting to understand web content in useful ways.
Will definitely give this a try for a product I'm building.

------
vinceguidry
Would using something like this run you afoul of the CFAA for rules
prohibiting web scraping?

~~~
throwaway2016a
I haven't read of CFAA and scraping so I had to do some research and it is
fascinating. I just read through
[http://www.sociallyawareblog.com/2014/07/21/data-for-the-
tak...](http://www.sociallyawareblog.com/2014/07/21/data-for-the-taking-using-
the-cfaa-to-combat-web-scraping/)

It seems that even though people keep trying to use CFAA for scraping it
almost always gets thrown out as long as the source is publicly accessible.
Even if the ToS prohibits scraping explicitly.

Seems people have had more luck with just plain old copyright.

But back to this tool in particular... depends entirely on what you are using
the data for. Don't use it to gain access to data you wouldn't normally have
or give other people access to data they wouldn't normally have and I can't
see why CFAA applies.

------
acdha
Previous discussion:
[https://github.com/mozilla/fathom](https://github.com/mozilla/fathom)

------
nebabyte
> That scores within 7% of Readability’s output on a selection of its own test
> cases

"on a selection of" its own test cases = lol

> Fathom is a data-flow language like Prolog, so data conveniently “turns up”
> when there are applicable rules that haven’t yet seen it

Why are you trying to explain the concept of declarative programming without
just telling people about declarative programming? Just because it's not
common in a scene doesn't make it some proprietary concept you've just
introduced there.

> The best part is that Fathom rulesets are data

That's actually kinda nice

> In 70 lines,

Insert that comic about 'just a few lines' masking the function calls that
actually don't 'replace' a system, just change where work is being done when
flexibility is not needed

~~~
grincho
You might prefer the more technical introduction at
[https://mozilla.github.io/fathom/intro.html#specific-
areas-w...](https://mozilla.github.io/fathom/intro.html#specific-areas-we-
address). It talks about Fathom in terms of declarativeness.

> "on a selection of"

One has to start somewhere! At this early stage, my aim is to demonstrate that
Fathom has value for simplifying the implementation of recognizers, not to
claim a polished, production-ready Readability alternative.

Though, frankly, getting to the latter would be a fun project for someone:
just write more features (in the ML sense), and add more tuning data. There
are lots of low-hanging TODOs in the code around
[https://github.com/mozilla/fathom/blob/master/examples/reada...](https://github.com/mozilla/fathom/blob/master/examples/readability.js#L52).

