It is messy and overly ambitious, but promises something like a return to the "view source" mindset of the old web - where data was in plain sight and anyone curious and a little tenacious could reshape the web for their own needs.
I have gone partway down this path for a related concept, and browser extensions are really the only way to go. The biggest risk and hassle is a reliance on brittle, site-specific logic to make things work well. I haven't dug into this project yet to see how automated any of this is or might become, but if there is an element of community sourcing (like a ruleset for scraping AirBnB effectively) it opens up a potential attack vector like any GreaseMonkey-tyoe script, especially if passed routinely to less technical users. Not a huge issue on day 1 but not an easily solvable issue.
Thanks! "View source mindset" is a nice word for what we're trying to promote with this project.
Brittle site-specific logic is indeed a challenge. So far we've started with the simplest thing possible of programmers manually writing scraping code, so we can focus on how the system works once you have the data available. That has been enough to test the system out and build lots of useful modifications ourselves.
I think eventually some degree of automation will be an important way to help end users use this tool with any website. The "wrapper induction" problem has been well studied and there are lots of working solutions for end-user web scraping, so I expect to be able to integrate some of that work.
We're also interested in a community of shared scrapers, but as you point out there are security considerations. I'm considering trying central code review from the project to approve new site adapters and make sure they aren't doing anything obviously malicious. Another solution could be carefully restricting the expressivity of our scraping system (eg only specify CSS selectors, no arbitrary code) but I doubt that would be sufficient for all cases. Would appreciate any suggestions here.
I really like the term "view source mindset." I think it aptly applies to certain systems where you can intuit what the underlying programming is doing just by interacting with them. Definitely stealing that phrase.
It seems to rely on a willingness of the company owning the data to disclose their full data set up you. Currently, with things like GraphQL, we are moving in the opposite direction in that the server only sends you those columns that are absolutely required to fill the fields in your GUI.
Since they used it as the example, I don't see any incentive for AirBnb to let random people on the internet download their full internal data tables. Quite to the contrary, AirBnb will block you from accessing their servers if they believe that you are scraping.
So this is a new way for users to toy around with the limited incomplete data set that the website operator was willing to give them. But it won't empower users. What if AirBnb implements server-side pagination, so that your client doesn't even receive the data for the cheapest apartment, because it's on a different page?
Tools like this would be perfect in theory to enhance social networks like LinkedIn with an export and batch processing capabilities. But the company claiming ownership of your contacts will surely prevent you from actually getting a useful export.
Plus there's cases where the data is on a server because it's impractically large. For example, try this to improve your Google search results. Downloading a 100mio row spreadsheet as the first step?
You're absolutely right that limited data access and pagination exclude certain types of modifications.
So far, we've decided to defer thinking about that limitation, and first focus on other questions like getting the spreadsheet interactions right. We're making new site adapters every week and finding that we can build lots of useful modifications for ourselves which work even with only one page of a paginated list. For one example, see my demo of modifying HN front page [1], which I find useful even though it only loads the current front page articles.
At some point, we're considering adding more features around fetching subsequent pages of a table (as explored in Sifter [2], which sorts an entire list of search results across pagination boundaries) or scraping each detail page from a table (as explored in Helena [3]).
Tell the websites what is being done with the data/spreadsheet. If hacker news is being filtered to exclude domains, or people are searching for all things LISP, the admins could use that information to change the website. Try making a sharing website (like greasemonkey scripts) -- users post scripts and discuss what their trying to do, and website admins can comment and post changes or scripts, etc...
I have another comment on this thread that discusses the difference between research and engineering. The goal of this project is not to improve your google search results via the provided framework. That argument is an fine use of reductio ad absurdum, but it assumes a different premise than the one that the paper is addressing. The paper is an inquiry into where are we building systems that could empower user modification but for some reason or another are not. I encourage you to read the Related Work section of the paper to perhaps pattern match on other more fleshed out systems that might demonstrate the end goal in a way you've seen before.
Wow, this is the real deal. First time I've heard of this work and I've got some more digging to do. Has all the right language and some great references with pretty awesome related work on digital tools should anyone want to keep digging.
Low floor, high ceiling is the best case for that framework, and should be every toolmakers ideal.
The Airbnb story reads like a sign of the times. Platforms can do as they like, users just have to conform. PCs and the internet promised the kind of programmatic control described here(I wonder if there is a better term than "programmatic" control?), end users should be able to come up with arbitrary representations of the data they query on the fly and realize them as quickly as possible.
Web UIs are stupidly underpowered, table based queries for flights as presented here seem much more usable. Michel Beaudouin-Lafon has a few great ideas to explore here, "One is Not Enough" which he described in a different context but I think can apply to the desire for composability between multiple tools here (Airbnb + walkability) and "software is not soft" describing the boundaries placed on software users. I have many tools for manipulating strings or sorting numbers, why can't I use them on the Airbnb table listings, served up on my computer?
This is the most inspiring implementation of live web scraping that I have seen. However, I think it will only work well on semantic HTML. I don't know about AirBnb, used in the paper, but I can say many good words about GitHub. GitHub is an awesome example of a customizable web app thanks to solid, semantic HTML structure. You can see hundreds of web extensions and Tampermonkey user scripts for GitHub that work consistently. I wrote a few of my own.
As a co-founder of Handsontable, I am proud to see it used in this paper. Handsontable is a commercial spreadsheet component, however it is free for non-commercial purposes such as education, research, study, and personal use, testing and demonstration: https://github.com/handsontable/handsontable/blob/master/han...
Thanks for building Handsontable! It's been essential for quickly prototyping this project and I'm a fan of the API design.
re: scraping, it's true that semantic HTML makes things easier, but we've also been building site adapters for a variety of modern sites that use frontend frameworks, "utility CSS", etc. Most promising solution so far is something I'm calling "AJAX scraping" -- observe the JSON request made by the client and just directly extract structured data from there.
I’ve used handsontable and will gladly recommend it! The only library that can beat handsontable is ag-grid, which is on a league of its own and also very expensive.
I know this is front page, but I’m surprised more people aren’t chiming in. Maybe the web used to be this way where you can easily manipulate views to your liking, but as a 20 something I’ve never even envisioned end users crafting their own views of pages. It honestly makes a lot of sense to represent pages how you see fit as a user, and no, inspecting each page and changing the source isn’t practical at all in Web 2.0 div spaghetti. It seems pretty practical to have a spreadsheet formatted UI for your most popular sites.
I've found this dream is not about age people just think about this differently, and some are jaded saying we will never get there.
As a 50 something it has been one of my ultimatedream, but it has proven to be hard all trough my very short history with computers. Letting the user modify their view in a GUI is always a hard task to solve.
The curl trick worked for so long[1], it's nice to see that you can get a better experience with wildcard with the div/js spaghetti today.
I've been following this lab's work for a while and actually suggested to them that the implementation for this be based on an RDF style data model. Ontology languages are the level of abstraction up from a spreadsheet and are an atomic unit in semantic web technologies. It looks like the way this fits in to the existing architecture is that the site adapters would extract data as RDF triples.
Professor Daniel Jackson runs this lab. His book, Design by Concept, is a phenomenal read. It made me understand why software can be so unintuitive for people who haven't grown accustomed to its idiosyncrasies that I've come to internalize.
Please don't bring RDF out of its coffin. It has been tried and failed because it's overly complex and verbose. It's terrible, and technologies which still use it are terrible to interact with to this day.
Interestingly enough I think that the core of rdf (the subject predicate object triple) is a quite elegant abstraction for knowledge graph representation.
I do agree that the layered system of different ontology languages as present in current semantic web standards is not beginner friendly, but it doesn’t mean they can’t be improved on.
I think you might be throwing the baby out with the bath water.
Thanks for the recommendation, I just picked it up from amazon! I was expecting an expensive textbook but it's surprisingly cheap, only $6 for a paperback copy
This looks really excellent, and is the future (meaning, these sorts of worse-is-better tools scraping loosely structured messes into very simple standard structures are the future).
Something that's conceptually related but pretty different is Workbench from the Columbia School of Journalism (although glancing at their page they may be some kind of dumb startup now).
I've said it a few times here on HN that I think the best UX for many web apps (particularly business apps) would be a spreadsheet connected to an API (or better yet, multiple APIs).
Of course most web apps don't expose an API, so here we are.
It reminds me of useful extensions like [1] Honey (auto coupon code finder), which are generalized enough to automatically detect coupon code input fields in eCommerce checkouts that it's never seen before.
"Wildcard" however either needs; AI to detect and classify unknown HTML as rows in a table. OR tonnes and tonnes of integration code (glue code) for all the popular websites used... which seems to be the plan
Yes, you're right! In practice, though, we're finding that many useful customizations can fit into that framework. For example, the Expedia demo in the paper shows a "1 row table" to represent an input form. It's worth thinking about how many different things people use spreadsheets for...
I think another useful analogy for thinking about abstract data representations is text streams in UNIX. It turns out many types of data can be represented as newline-delimited text, which enables you to use a suite of generic tools with that data. Inevitably, some data doesn't fit into that format, but it's perhaps surprising how much does.
This is a very relevant engineering critique, but note that this is a research project. The first step is to ask, "what if it worked this way?" After a prototype has been developed to more accurately identify what is the actual problem that is being solved, then problems like this can be addressed.
I had many interesting conversations in my undergraduate research lab trying to find the right place to draw the line between engineering and research. Problems can be more apt classified as engineering when there is high consensus on what the actual problem is. Research often addresses what question should we be asking to determine the problem that may then be solved. Most often there is a series of research and engineering iterations intertwined with each other.
My favorite example from the (ecommerce) domain in which I work is https://www.cobby.io/ - I know the team behind it, and while it perfectly solves the problem of product data for shops of a certain size, the live-editable cell idea always sparks conversation about the broader applications. Years later and we still see the genius of spreadsheets.
I've done something similar with Google Sheets and a sprinkling of JS automation. This works well because Google Sheets is pretty good and I can embed Google Sheets in an iframe. A server-to-server POST message sends the (relevant) cells to my running application using a secret key (it's like 5 lines of JS).
Why should I invest in Wildcard's API instead of Google's? Am I missing something?
I don't get it. We have custom web frontends because we feel our problems can be solved more efficiently in UIs different from a spreadsheet, don't we?
Granted, not every custom UI is better than its spreadsheet version would be. But thats a different problem.
Otherwise, there are a lot of react datagrid and spreadsheet components to use if you feel that would be the best UI solution for your app.
It’s not really true that we have custom front-ends because they are best for the user. Consider the example of AirBnB in the article and the fact that in 2012 they stopped allowing ranking by price: one has to assume that if a feature has been absent for eight years there’s no intention to reimplement it (presumably because it behooves the company to have it absent). I’m guessing that AirBnB knows/believes that the absence of this feature leads users to choose slightly more expensive properties and this generates more income. The spreadsheet intercept allows the user to regain control.
This is for the end user to use when they need the web app to work differently or to filter data from the web app while it is live. If software devs implemented everything that everyone needed, this wouldn’t be needed, but that is obviously an impossible goal (and even if you could do it, would you want to?)
The goals of the site builder and the end user are unfortunately not always the same. Airbnb doesn't want the user to have too much control (sorting by dollars), Facebook and Google also come to mind (being able to easily filter out ads would be great).
This browser extension is targeted at the end user.
My company, frame.ai, uses a lightweight version of this pattern and have gotten a lot of value!
One of our products helps teams ensure response times on shared Slack channels. On some teams, the duty schedule of who should respond in these channels evolves in complicated ways - for example, complex business hours and holidays, account managers with backup reps, and so on.
Rather than attempt a one size fits all interface, we expose configuration via an Airtable base that we prepare. Airtable makes it much more convenient to enforce structure and give a nice interface to the configuration - plus an API. Highly recommended.
I like the demos a lot, it is easy to understand the idea from them! How hard was it to write the browser extension and how well does it work for the different sites?
Obviously the main challenge, as other mentioned, is that not all of the data is present on the frontend. Also, user cannot permanently change the app, since just the DOM is changed and that is not persisted anywhere, am I right?
But the whole idea to be able to peek "under the hood" of an app and customise/edit it sounds very appealing to me! I am actually working on the open source project that has that aim, to "understand" the web app from within.
But of course for that we had to go with a bottom-up approach, so we are building a DSL for describing how a web app behaves: https://github.com/wasp-lang/wasp
This is essentially the early Filemaker Pro all-in-one desktop app (pre-Apple), maybe even Microsoft Access, but in the browser. I always liked the flexibility of Filemaker, being able to add a new field and pull data into it. I could never find anything comparable. Good to see a revival of the concept.
Spreadsheet-Driven Customization is a great way to enable non-technical users to customize and configure software.
I used that technique for a one-off Java application back in 2011. The Java application did not do any live synchronization with the spreadsheet like Wildcard does. It just read the spreadsheet at application start-up to get configuration data needed to drive the application logic.
Spreadsheet-driven customization allowed the application's users to edit the spreadsheet to grow and maintain the dataset that drove the application logic.
I would not be surprised if others have done something similar before me.
Yup, doing the same for a frontend we're currently building. Basically some of the core features exposes warnings (on purpose) to the user once they do an action that they might want to do in a different way. These errors just have error codes assigned to them on the backend side and the frontend loads the real messages from AirTable on boot, which then is used to show the user-friendly title and description. We're doing the same thing for a couple of things and it has really cut down and development time as product team can now change the frontend themselves by just editing cells in AirTable instead of creating tasks for the development team.
I think a lot is possible if you could use spreadsheets to create all software (I even have an idea on the back burner to try in that direction).
That being said, $95/month isn't an accessible price point for anyone that isn't trying to integrate this into a business of some sort. Which might be fine for your market, but it doesn't work for most hobbyists, or people wanting to write personal projects. Heck, Adobe Creative Cloud can be had for half of that price.
I realize your market probably doesn't include those people right now, which is a valid business decision. I've got plans to try to make a scriptable web spreadsheet application that allows people to make their own websites with that sort of thing.
That being said, if you had a $5/month or $10/month tier, or an option to self-host without some of the fancier functionality, I'd be all over trying those out.
Some ten years ago there was an online spreadsheet startup called Hypernumbers which, I believe, at one point pivoted to letting people build websites using their spreadsheet. Intriguingly close to this, but without the mashup angle, which is important.
Brilliant! This really resonates for me, as someone who's used to dev tools for centralized state management (a la Redux or especially Mobx) ... thanks for sharing and good luck! :)
nice, this would be very useful in enterprise/internal facing apps, where the users are 'power users' who already know and know the app on a day-to-day basis.
It is messy and overly ambitious, but promises something like a return to the "view source" mindset of the old web - where data was in plain sight and anyone curious and a little tenacious could reshape the web for their own needs.
I have gone partway down this path for a related concept, and browser extensions are really the only way to go. The biggest risk and hassle is a reliance on brittle, site-specific logic to make things work well. I haven't dug into this project yet to see how automated any of this is or might become, but if there is an element of community sourcing (like a ruleset for scraping AirBnB effectively) it opens up a potential attack vector like any GreaseMonkey-tyoe script, especially if passed routinely to less technical users. Not a huge issue on day 1 but not an easily solvable issue.