The really interesting parts were
(1) trying to make sure that rendering was deterministic (so that identical pages always looked identical to Google for duplicate elimination purposes) (2) detecting when we deviated significantly from real browser behavior (so we didn't generate too many nonsense URLs for the crawler or too many bogus redirects), and (3) making the emulated browser look a bit like IE and Firefox (and later Chrome) at the some time, so we didn't get tons of pages that said "come back using IE" er "please download Firefox".
I ended up modifying SpiderMonkey's bytecode dispatch to help detect when the simulated browser had gone off into the weeds and was likely generating nonsense.
Most useful aside of all time.
That tends to be my fall back whenever I'm specifically fussed about the "freshness" of a page. That or curl
I don't get why the rendering had to be deterministic. Server-side rendered HTML documents can also contain random data and it doesn't seem to prevent Google from doing "duplicate elimination".
Obviously, there's a lot I know you can't say, but I'd love to know your general thoughts on how far off we were:
Or maybe they were trying to get past the great firewall of China?
(I'm not thrilled about rendering this way, but it makes development a lot easier.)
If you aren't heavy reliant on conversions from search traffic, you can probably get away with being JS driven, I'd suggest sticking with Anchor tags for direct navigation with JS overrides. Assuming you are supporting full url changes.. otherwise you need to support he shebang alternate paths... which is/was a pain when I did it 3-4 years ago.
Towards the end of my time there, there was someone in Mountain View working on a heavier, higher-fidelity system that sandboxed much more of a browser, and they were trying to improve performance so they could use it on a higher percentage of the index.
Part of the reason is because Google themselves are doing this in many of their products... some of their employees probably disagree with "JS everything", but they're in the minority.
I don't agree with this. Browsers are more user-targeted than ever.
We're moving from browsers being viewers for simple HTML documents (which can be copied, shared, and linked via simple means)
Browsers still allow this.
I have rarely seen a web API that uses anything other than straightforward JSON.
The "open by default" nature of plain HTML has become the "closed by default" of the data processed by web apps
Almost always equally as open as any HTML you would previously received.
Native app platforms (e.g. mobile) are also gradually becoming more "closed by default"; I'm not sure if that's a related trend.
What do you mean by this?
> Browsers still allow this.
Yes, they allow it. But consider that user stylesheets have been dropped, hardly any improvements have been made with presenting standard html (unless you count giving in to IE and moving from less-stark black-on-grey to too stark black-on-white as an improvement).
Still no browser does a half-decent job of avoiding ragged-right text, give you decent margins on un-styled content, etc. There's no real reason for this. You could claim "backwards compatability" -- but if there was genuine interest, there'd be nothing stopping the introduction of a <sane-default-render-html6-whatever> content type.
It's ironic, that browsing a plain html-site in w3m in the console is a better reading experience than opening the same page in a desktop browser. So of course people need to supply a crap-load of stuff just to get decent, basic text layout, that flows well across various screen sizes. No reason a basic, unstyled html-document couldn't look much better than a TeX/LaTeX document published in the 80s, with the added bonus of re-flowing in a sane way for various window/screen sizes -- but they all look awful, too the point that plain html is actually not usable.
You need to wrap a document in js to get sensible layout, and in css to get sensible presentation. Even if the document doesn't contain any other media than text. Add an image or two, and things keep going downhill. It's absurd.
I don't think we can point to this reason to explain the rise of web app's.
Many Devs use Frontend JS in places where it's absolutely not needed. If you're building an App that updates in realtime, shows informations while it's created, i'm fine with Frontend JS, but it's an overkill for most content pages.
Sure, it depends on your implementation details, but as i said, it's just my opinion.
I think it's perfectly fine to enhance stuff with JS, as long as important content is visible without it.
Some people go head over heels the JS route (since this seems the way people do it today) and build things that can be build way cheaper (measured in hours) with traditional HTML. Since the outcome is the same (static content) it's just not necessary.
Note: I'm focusing on static content here. Pages that mainly show text and images (Blogs, Newssites et al). (Web)Apps are another topic that present good reasons to use Frontend JS.
Sometimes doing it all in react/angular is easier than bolting on jquery extension after extension... bloating everything up. Also, if you're using more modern techniques, you're going through a build/minify step anyways which makes it even easier still to be more JS based than static.
My experience with JS only apps is that they're often less usable, more brittle, and often don't work at all in IE
Sometimes you need additional functionality, maybe a realtime graph of share prices that has to be JS. So progressively enhance just that component, or gracefully degrade if you have to (e.g. put a sign up saying "switch on JS to get this specific functionality") but don't use it as an excuse to turn everyone away. It might be that you will reach people who can live without the stock ticker.
Sometimes you just can't do without JS. I wrote a desktop publishing app on the browser once. Obviously I wrote it in JS - users were forced to use a modern standards compliant browser (this was an internal app) - but if I'm doing an ecommerce site, or really any public site, I'm always challenging the devs who want to "build it in angular" to reconsider that option before ploughing ahead.
It isn't 2010 anymore.
React (just to name an example, there are many others) completely avoids this issue - you get serverside and clientside rendering out of the box.
reddit is almost a chat service.
your example is also very poignant as neither of those would get close to exist without proper urls
Not to mention that when we demand to have a functional js-parser and dom-tree just to get at the content, things like text-to-speech and many other things (including building a search-engine!) becomes much harder. For very little (I'd say no) gain.
Not to sound completely apathetic, but so what? Most of us aren't building sites that we expect to be around in 10 years, much less 1000 years. The ephemeral nature of what we're building isn't lost on us - we're trading that guaranteed longevity for an improved development process (though some obviously disagree).
Sure, 50 years down the road if these sites still exist they'll probably be nigh-unusable without some sort of "ES6 emulator mode", but so what? I don't think we'll go wanting for any historical artifacts from this time period. If we do, it'll be because future generations have no interest in our generation - not because we didn't produce enough relics.
This isn't so much about plain-jane HTML pages (useful as they are, since they have a simple interaction model than many understand and enjoy). It's more about using and exposing data in a visible manner (known formats and semantics) and hyperlinks rather than a single page app with opaque data. This gives you network effects.
Think about the minor uproar over hash-bang URLs around 5 years ago, Twitter being the primary offender. That was single page application oriented rather than hyperlink orientation. There is a reason they've moved away from that.
In the 90s, Google or Yahoo was just something students did with the links that were out there - that eventually generated hundreds of billions in value because of network effects and visbility of the information in HTML (Ie. They could apply algorithms to it like PageRank).
The point of the web architecture is that it enables serendipity. Most anyone who has had massive success in business will explain the role of luck, serendipity, and network effects in their rise.
20 years ago a webpage was just text, but it has evolved in so much more.
I'd be ok with a data site rendering everything from a set of json files. There is more legitimacy in having the presentation done in static html.
Same would go with sites mixing different information sources (twitter, rss etc). You can do the data fetching server side, but the user might prefer having it done client side for a reason or another (transparency for instance).
These kind of sites woyld still be purely informative and yet having them heavily using js makes sense.
I could think of many more situations were generating html from a different format on the client side is the right way to go. Horses for courses.
The answer to these question will depend on your priorities and use case, and the choice can easily be between no site or a js rendered site.
"I understand that everything in HTML could also be represented in another format. But if your master data is in HTML, does it always make sense to convert the data to JSON just for the sake of it? Would you build a separate client-side rendering component only for that conversion?"
There two non-circular considerations that favor HTML:
1. Web pages still render via DOM. JSON data has to be tramsformed, HTML does not.
2. HTML has semantic capabilities. JSON does not.
Contrary to that you are positing that everything should always be based on or converted to plain html before serving to a browser. That's the point where we disagree.
Perhaps it is because the type of content I consume is just reading material, listening material or viewing material. Each of which I can usually download and view with a dedicated application, if I so choose.
Do you think 1000 years from now there will still be one group of people working to make the www more "interactive" and another group of people working to make the www more "machine readable", the later undoing the work of the former?
Impressive work by Google to do that at scale, of course, but they'd be dead in the water if they didn't.
pip install robotframework-selenium2library
*** Settings ***
*** Test Cases ***
My First Test
[Setup] Open Browser http://google.com firefox
Input Text name=q Robot Framework
Click Button name=btnG
Wait Until Page Contains Element ires
Click Link Robot Framework
[Teardown] Close All Browsers
Finding a problem with code is useful, but it's extremely limited. You might find 100 bugs, but if there's 101 bugs your product has the potential to fail completely. It's so much more useful to define a framework of things that the code has to do properly and make sure it does do them all properly. To that end, testing should come first - define what the code needs to do, write tests to make sure it does those things (automated unit tests where possible, but at the very least well defined processes for how you make sure it works), and then write the code to actually do it. Any developer who isn't interested in proving their code works, and will continue to work as it becomes more complex, is a terrible developer.
tl;dr If you want to fix testing don't write any code until you know how you're going to test it works.
As for pure html front-ends, I understand the attraction, but when a single js-based implementation gets you consistent behavior and presentation across all browsers and mobile devices the advantages are pretty huge.
My first guess would be that they snapshot the DOM in the JS tick immediately after window.onload completes. Maybe they have a short pause to let any fast timeouts or callbacks complete, but there's got to be a cutoff at some point (e.g. to stop an infinite wait for pages that continuously update a relative date). Of course, with their own JS engine, I bet they can get really fancy with the heuristics to determine when to take that snapshot.
If they're smart, they actually make the exact timeout a function of a HMAC of the loaded source, to make it very difficult to experiment around, find the exact limits, and fool the indexing system. Back in 2010, it was still a fixed time limit.
Think of a long page where you need to read a few things as you scroll. You could game Google by timing it so that most humans would see content x but any script that ran at an unnatural speed would see y.
Imagine you use setInterval to load a new paragraph from a server, and the server only provides a new parapgraph 1200ms after the first polling?
For example this URL:
is performing the drawing of the molecule using RaphaelJS, then pulling the corresponding molecule from the database using Ajax and updating the page. Googlebot is performing all the steps perfectly well to add the end index the page.
It is very annoying because this is not important in our case, what we want is the good indexing of the main data pages, not these pages... I do not want to block the bot yet, but I need to figure out a way to have the main page better ranked.
"Property Prediction for Butane" site:https://www.chemeo.com
Certainly the right combination of kernel zero-days and JS interpreter exploits could be used to take over the machine, but it would be non-trivial.
You were trying to sandbox the JS engine rather than using disposable VMs?
Not everywhere. The ones in the USA already have free access to anything in Google. In fact, Google is a part of their network.
I think there is no way they are going to scrap the websites as there are millions of them with each having their own structure.
So the bot loads your webpage into a headless browser and sends it a series of events to simulate a user interacting with it, and waits for navigation requests.
There is probably a whitelist of simulation behaviors:
* mouseover, then click each <a> node
* mouseover every pixel
* mouseover, then change every <select> node
* mouseover, then click every <button>
Also, I wonder how Google handles security while executing random JS code. It's one thing to hack into a single browser. It's another thing to hack into a crawler. Think of all the possibilities.