Hacker News new | past | comments | ask | show | jobs | submit login
Understanding HTML with Large Language Models (arxiv.org)
63 points by PaulHoule on Oct 11, 2022 | hide | past | favorite | 11 comments



There is a visual demo here: https://sites.google.com/view/llm4html/home.

This work is very exciting to me for a few reasons:

- HTML is an incredibly rich source of visually structured information, with a semi-structured representation. This is as opposed to PDFs, which are usually fed into models with a "flat" representation (words + bounding boxes). Intuitively, this offers the model a more direct way to learn about nested structure, over an almost unlimited source of unsupervised pre-training data.

- Many projects (e.g. Pix2Struct https://arxiv.org/pdf/2210.03347.pdf, also from Google) operate on pixels, which are expensive (both to render and process in the transformer). Operating on HTML directly means smaller, faster, more efficient models.

- (If open sourced) it will be the first (AFAIK) open pre-trained ready-to-go model for the RPA/automation space (there are several closed projects). They claim they plan to open source the dataset at least, which is very exciting.

I'm particularly excited to extend this and similar (https://arxiv.org/abs/2110.08518) for HTML question answering and web scraping.

Disclaimer: I'm the CEO of Impira, which creates OSS (https://github.com/impira/docquery) and proprietary (http://impira.com/) tools for analyzing business documents. I am not affiliated with this project.


Exciting/scary stuff! A sophisticated enough version could carry out any range of tasks that a typical computer user/browser could from just a few sentences with somewhat high chance of success.

we will overuse this tech, forgetting important processes that are perhaps wise to keep a "human backup" for redundancy. Then again, RPA is already a case where a "proper" rewrite of some multi-program pipeline is impossible.


This is a "classic" tension. Having worked in the (broader) RPA space for a while, I would say that the true north star of most processes is (a) rewriting the internal procedures to be transformations on data (not UIs) and (b) standardizing communication across companies.

There is a lot of momentum to solve (a) with no code, but it's slow because processes are impossibly complex. I think AI will accelerate this and could result in the "human backup" dystopia. On the other hand, AI can also be used to generate code, and I'm optimistic that technology like this can accelerate humans' ability to encode complex processes robustly (as transformations of data) and would 10 or 100x less work than no/low code.


> On the other hand, AI can also be used to generate code, and I'm optimistic that technology like this can accelerate humans' ability to encode complex processes robustly (as transformations of data) and would 10 or 100x less work than no/low code.

Ah right, lots of angles to consider! A hybrid system would certainly be interesting. Let the AI runtime generate and evaluate code to perform tasks (e.g. selenium/puppeteer in python/java). Upon failure, "escalate permissions" to enable DOM control, or full mouse/keyboard to complete the task (probably best not to let the thing open up a code-editor with M/KB controls though heh)


The model used by this research is T5 which is open sourced already. So I think once the dataset is released, we'll see the open version of pre-trained model very soon.


This is google, they for sure aren’t releasing the weights


They release a lot of weights open source, including T5 (the underlying model they used in this work). They also indicated their intent here: https://twitter.com/aleksandrafaust/status/15799326368934420....



could also jump straight into the code that generates so much of that 'unlimited' html (them web frameworks)


Related, "natbot" uses the stock GPT-3 model (no fine-tuning apart from the examples in the prompt) to drive a browser:

https://github.com/nat/natbot


Looking forward to the code and the model,




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: