Understanding HTML with Large Language Models

ankrgyl · on Oct 11, 2022

There is a visual demo here: https://sites.google.com/view/llm4html/home.

This work is very exciting to me for a few reasons:

- HTML is an incredibly rich source of visually structured information, with a semi-structured representation. This is as opposed to PDFs, which are usually fed into models with a "flat" representation (words + bounding boxes). Intuitively, this offers the model a more direct way to learn about nested structure, over an almost unlimited source of unsupervised pre-training data.

- Many projects (e.g. Pix2Struct https://arxiv.org/pdf/2210.03347.pdf, also from Google) operate on pixels, which are expensive (both to render and process in the transformer). Operating on HTML directly means smaller, faster, more efficient models.

- (If open sourced) it will be the first (AFAIK) open pre-trained ready-to-go model for the RPA/automation space (there are several closed projects). They claim they plan to open source the dataset at least, which is very exciting.

I'm particularly excited to extend this and similar (https://arxiv.org/abs/2110.08518) for HTML question answering and web scraping.

Disclaimer: I'm the CEO of Impira, which creates OSS (https://github.com/impira/docquery) and proprietary (http://impira.com/) tools for analyzing business documents. I am not affiliated with this project.

ShamelessC · on Oct 11, 2022

Exciting/scary stuff! A sophisticated enough version could carry out any range of tasks that a typical computer user/browser could from just a few sentences with somewhat high chance of success.

we will overuse this tech, forgetting important processes that are perhaps wise to keep a "human backup" for redundancy. Then again, RPA is already a case where a "proper" rewrite of some multi-program pipeline is impossible.

ankrgyl · on Oct 11, 2022

This is a "classic" tension. Having worked in the (broader) RPA space for a while, I would say that the true north star of most processes is (a) rewriting the internal procedures to be transformations on data (not UIs) and (b) standardizing communication across companies.

There is a lot of momentum to solve (a) with no code, but it's slow because processes are impossibly complex. I think AI will accelerate this and could result in the "human backup" dystopia. On the other hand, AI can also be used to generate code, and I'm optimistic that technology like this can accelerate humans' ability to encode complex processes robustly (as transformations of data) and would 10 or 100x less work than no/low code.

ShamelessC · on Oct 11, 2022

> On the other hand, AI can also be used to generate code, and I'm optimistic that technology like this can accelerate humans' ability to encode complex processes robustly (as transformations of data) and would 10 or 100x less work than no/low code.

Ah right, lots of angles to consider! A hybrid system would certainly be interesting. Let the AI runtime generate and evaluate code to perform tasks (e.g. selenium/puppeteer in python/java). Upon failure, "escalate permissions" to enable DOM control, or full mouse/keyboard to complete the task (probably best not to let the thing open up a code-editor with M/KB controls though heh)

yoquan · on Oct 12, 2022

The model used by this research is T5 which is open sourced already. So I think once the dataset is released, we'll see the open version of pre-trained model very soon.

hwers · on Oct 11, 2022

This is google, they for sure aren’t releasing the weights

ankrgyl · on Oct 12, 2022

They release a lot of weights open source, including T5 (the underlying model they used in this work). They also indicated their intent here: https://twitter.com/aleksandrafaust/status/15799326368934420....

nl · on Oct 12, 2022

Google Research releases the vast majority of their interesting model weights. See eg:

https://tfhub.dev/google

https://tfhub.dev/tensorflow

https://tfhub.dev/deepmind

https://tfhub.dev/mediapipe

https://tfhub.dev/ml-kit/collections/image-classification/1

yellsatclouds · on Oct 11, 2022

could also jump straight into the code that generates so much of that 'unlimited' html (them web frameworks)

drothlis · on Oct 11, 2022

Related, "natbot" uses the stock GPT-3 model (no fine-tuning apart from the examples in the prompt) to drive a browser:

https://github.com/nat/natbot

bootcat · on Oct 11, 2022

Looking forward to the code and the model,