The semantic web is not a technical problem, it's an incentive problem.
RSS can be considered a primitive separation of data and UI, yet was killed everywhere. When you hand over your data to the world, you lose all control of it. Monetization becomes impossible and you leave the door wide open for any competitor to destroy you.
That pretty much limits the idea to the "common goods" like Wikipedia and perhaps the academic world.
Even something silly as a semantic recipe for cooking is controversial. Somebody built a recipe scraping app and got a massive backlash from food bloggers. Their ad-infested 7000 word lectures intermixed with a recipe is their business model.
Unfortunately, we have very little common good data, that is free from personal or commercial interests. You can think of a million formats and databases but it won't take off without the right incentives.
> The semantic web is not a technical problem, it's an incentive problem.
True. Demonstrable in the health-care IT world. Think of electronic health records. My personal portable electronic health record would either be a bunch of images of scrawled notes and maybe some nice medical images ( = nonsemantic web) Or it would be in a highly wrought format, i dunno, XML or something, with carefully worked out schemata for everything from flu shot records to heart transplants (= semantic web).
Back in 2007-2010, "electronic health records" EHR were spottily and sloppily implemented by some providers. But, in the US, a federal law pushed more widespread implemetation. Now my online EHR, and yours, is decidedly app-mediated and non-semantic, on a web site portal. Export to JSON? Hah. No.
The hospitals and health care systems only did it because of incentives.
I happened to work at a B2B SaaS company focused on making connections between hospitals and rehab/skilled nursing providers. A rehab outfit can't decide to accept a patient without seeing her medical records and doctors' orders. So our customers had a real incentive to be able to share records. It worked. But the data we had access to (go read about HL7) was not even close to semantic. And our SQL database schemas were, umm, quinquiremes of Nineveh, really intricate, somewhat brittle. Let's leave privacy issues out of the conversation for a moment. Publishing the schema and accepting random queries would help NOBODY except some partner outfit willing to develop and test useful stuff.
Hey, I got an idea! Let's give them an API! Oh, wait, nobody wants to bother with an API? OK, how about a nice web site! And we're back where we started.
With a universal semantic web, the same problems would crop up everywhere.
Even something silly as a semantic recipe for cooking is controversial. Somebody built a recipe scraping app and got a massive backlash from food bloggers. Their ad-infested 7000 word lectures intermixed with a recipe is their business model.
Taking someone else's content and republishing it without permission isn't cool, even if you wrap it in a nice machine readable format.
I fully agree, and that's one of the problems I was describing. There's very little content free of commercial interests. If this is true, it blocks a lot of potential use cases of a semantic web.
I have a similar idea with PDF documents. Instead of having the royal PITA of parsing generated PDFs (e.g. invoices), things would be much simpler if every generated PDF came with a built-in SQLite or JSON that contains the structured data of that PDF.
One day I will do it.
Speaking more broadly, whether we talk about HTML or PDF it's the same problem: documents should have two representations - human-friendly and machine-friendly until AI gets so good that only having the human-friendly representation is enough.
> ... documents should have two representations - human-friendly and machine-friendly until AI gets so good ...
when I download papers from arxiv I sometimes choose the LaTex version because it often comes with commented out ideas that didn't make it into the paper. The author thought process becomes clear. The Metadata helps me understand the whole thing quicker the same way the semantic web helps the machine.
Perhaps there is a clever philosophical analogy in there somewhere about "us becoming the machine" or "the map becoming the territory", but I can't put my finger on it.
> Perhaps there is a clever philosophical analogy in there somewhere about "us becoming the machine" or "the map becoming the territory", but I can't put my finger on it.
I personally believe the key is in "information architecture". We have been conveying information as a linear sequence of words for so long that we don't know yet how to best exploit non-linear formats.
Programming languages harness the relation between specific instructions and the structure on which they are embedded; but this structure is oriented towards building a single executable block that controls a machine step by step. We have yet to build tools equivalent to IDEs to better exploit the overall structure of knowledge, but for the goal of understanding a topic at all levels of detail, bot at the local flow of ideas and the overall relation between its subtopics.
The first widespread step in that direction was the 1.0 static World Wide Web, and we have learnt a lot thanks to it so we can now improve upon it. I have great hopes in online notebooks and no-code spreadsheet-like tools as the basis of such information-processing environments.
I have a similar, "One day I will do it" project. The idea is that somebody cares enough to make the semantic web work--it's just not the people with write access to the data.
I think we can use CTPH algorithms to fingerprint the data independently of whatever names are used, and then we can use that to find representations of the "same data" submitted by other users. Probably there would be some reputation stuff involved, a web of trust, etc. The flow would go something like this:
COGNIZE:
0. Encounter messy data in the wild (has pagination, timestamps of access, etc), need other representation (human/computer/whatever)
1. Calculate CTPH fingerprints, use them to search for link: miss
3. Clean data the hard way and publish canonical representation (ipfs?)
4. Generate missing representation the hard way, publish that too
5. Calculate the fingerprints common to both the missing representation and the canonical representation, and publish it as a "link" between the two. Unlike traditional web links, this one is bidirectional.
RECOGNIZE
1. Different user encounters "same" data in the wild
2. Calculate fingerprints, use them to search for canonical representation: hit
3. Find further links to see what other representations of the "same data" are available, download them if desired.
The fingerprint stuff works, but there's a lot of work left to be done re: mapping fuzzy hashes of "in the wild" data to cryptographic ones of "canonical representations" and finding ways to incentivize users to go through the hassle of the "cognize" step so that other users can benefit from the "recognize" step.
Sorry for talking your ear off, it just feels good to know I'm not the only one working on something like this, even if our approaches are quite different (mine works on PDF's only because it works on arbitrary bytes). Good luck with yours.
Manual curation was why Berners-lee's Semantic Web failed. Unlike ARIA annotations which is required to serve certain clientele, the semantic web offers no inherent value to the developer. There is an entire field of machine learning dedicated to automatically generating knowledge graphs, I would start there rather than trying to manually curate and annotate.
That's a good point. Once the tooling is there for the manual workflow, I expect that an ML-driven approach will plug in to the same hooks without fuss.
But I don't think it changes the main problem to be solved, which is that it's data consumers who want the semantic web, but so far it's been up to data providers to implement it. We need to be able to create links between data with without the participation of whoever hosts it.
Hmm, it looks like the use of "target" as a uri means that if the file gets rehosted elsewhere under a different name, the annotations will fail to adhere. I'm going for more of a virus-scanner approach, except I'm looking for the good stuff instead of the bad.
But otherwise yes. I'll have to give that a closer look and use as much of it as I can.
You can use a magnet or ipfs uri or add a triple that specifies other properties of the file that let you find it.
Semantic web uses IRIs in triples. They serve as identifiers, not locators (URL). This is because semantic web talks about more than just files.
The Web Annotation spec talks about web resources which should be dereferencable. That is a naive assumption that makes the approach brittle. But one can annotate 'sourceDate' and 'cached' which indicate the date for the annotated version and a location for a copy.
When you're annotation a resource it's best to also store a copy. The Web Annotations specification allows this.
It's preferrable to use the ni:// URI scheme since it is standardized, unlike magnet links. There is a draft standard https://datatracker.ietf.org/doc/html/draft-hallambaker-deca... defining common extensions that enable much the same featureset as magnet.
> When you're annotation a resource it's best to also store a copy
Do you know if there's any precedent re: people getting upset about that copy? If I annotate an article that was originally served with ads, or behind a paywall, I'll probably not include that stuff in my copy. If people then start linking to my copy instead of the original, I could imagine that feathers would be ruffled.
If you're not permitted to publish a copy, you can keep it private and/or link to a version at an institution, such as archive.org, that is allowed to keep a public copy.
Ah well I guess I'm looking to make freelance librarian into something you can be.
I figure if you configure the client to keep track of whose annotations you have a history of using, you've got a real granular view of which content providers to pay. I'm imagining a game where we all put $5 in at the start of the month and we all pay each other based on whose content we use the most. Some users will pay more than they make, others will make more than they pay.
I think we need to reframe the problem. If we think of it as JSON data that comes with a PDF (similar to what @xmprt suggests "PDFs as checksum") then we have the benefit of machine-readable data that is transportable but also the attached human-readable PDF version of the data.
This is exactly what we are trying to achieve at Anvil.
1. Provide the no-code tools to make it easy to convert existing PDF forms into web forms.
2. Share the web forms with perspective customers instead of PDF forms as email attachments
3. PDFs are generated as part of the workflow once the data is captured and represented in structured JSON.
4. (optional) request certification of the PDF via e-signatures
The end result is a JSON payload that can be shared via API as well as a static PDF that is stored for human consumption. In most cases, we find that our customers actually just use the PDF as an interface with legacy systems (IRS, Banks, Insurance Companies) that haven't yet figured out how to modernize to a data-first business model.
Of course this really only addresses PDFs that are used for information capture and transfer between two parties. But most PDFs that are not "standardized-forms" are made for consumption by humans not by machines (think ebooks, journal articles, graphics etc), and therefore having a JSON payload of the data attached doesn't really matter.
I have an application that converts word documents to RDF conformant with the SPAR ontologies (mainly DoCO http://www.sparontologies.net/ontologies/doco), so it contains things like headers, numbering, contains/within relationships explicit in the RDF. I've used it successfully with PDFs by converting to DOCX first. Is this the sort of thing you had in mind? Not here to sell it! I think this is a genuinely interesting unexplored area ..
The PDF format supports attachments (embedded files). I'm thinking about a set of libraries and/or a command-line utility that would make it trivially easy to attach a SQLite|JSON file to a PDF or extract one from a PDF. This won't fix existing files, of course, but at least for those apps that generate PDFs it will be easier to embed a SQLite/JSON into a generated PDF.
This looks awesome! The decision to combine structural and rhetorical ontologies, seems like it optimizes the best between cost and availability, in the sweetSpot of the users actual requirements when working with research and academic documents.
I like this, and of course you could also embed the text of the document. Nothing stops us from doing this right now.
But: don't we need some way to prove that the data matches what is visually rendered in the PDF reader?
And if we can prove that the embedded data matches the rendered document, couldn't that same logic just be used in reverse to generate the structured data from the renderable PDF?
That's not necessary. If you think of the PDF as a checksum then it's possible to have a one way function that generates the PDF (checksum) but that you can't retrieve the original JSON from.
I do really like the idea of having a checksum of some sort if we end up embedding metadata like this.
That's a good idea: the tool that processes the data can just run your function and if the file doesn't match the result then it's rejected.
But in the real world people are going to want to annotate the PDFs, there will be open-and-save cycles that add metadata and break the checksum, etc and so on, even without considering malicious actors. Restricting all that is maybe easy -- just reject anything that doesn't match the checksum, done -- but communicating that restriction to the users without making a mess of it is probably hard.
Long ago I wrote some PDF generating programs and it was a lot of fun, but the spec has evolved in ways I imagine would make it less fun today. Still, could be a cool thing, and I'd be surprised if someone hasn't already done a version of it somewhere.
[Edit: plus, whoever creates the one-way function is deciding what all the PDFs are going to look like, which means you will end up with many such functions to accommodate the different rendering goals, and then each validator needs to know each one and someone decides which ones to trust, and so on...]
If you want to start with the most controllable representation of a piece of paper, consider that OpenOffice (word processing and drawing) embeds its structured file format into PDFs. Maybe those are the PDFs we scrape first, leaving the bag-of-jpegs for later.
SQLite is a relational database. Wouldn't it be a better fit to use a graph-database as the backend for anything "web"?
The idea is good, that a web-page should be generated from some data somewhere. But "web" is much about not a single document but the links between the documents, which allow you to to represent a "semantic net". The data should be about the links between them. Now where is such a database? And how can it "sharded" into multiple databases running in thousands of locations on the internet?
You can reasonably model graphs in a column-oriented database, but traditional SQL models tend to be horrible performance-wise, because most graph algorithms need fast traversal of edges, and doing a lot of recursive lookups in SQL is impractically slow. It's not that you can't model it, it's just that performance is terrible. For a graph database to be efficient, you need a high degree of locality for edge information (ideally a vector you can simply read out).
(Note: I've actually written a graph database from scratch, for exactly these reasons.)
Actually a few RDMBSes (Oracle, Postgres, MS SQL) have graph extensions. I've never used any of them, but I assume they work around some of the basic unsuitability of traditional tables for storing graphs.
Modelling transitive edges can be done in Sqlite with recursive common table expressions. The performance for that is probably less than for graphs databases, but Sqlite has many other advantages over graph databases.
Uhm, that's basically what I said? Aside from the "other advantages" -- which, yeah, sure. Socket sets have other advantages than hammers, but they're not very good at pounding in nails. You can do it, but it'll suck at it.
Also, in this case, "probably less" is "multiple orders of magnitude slower".
Why do you keep making the same argument again? Yes, SQLite has the parts needed to build a (very poor) graph db. But it will never be as performant as something dedicated, because the data structures used in SQLite have terrible characteristics for the algorithms used in graph operations. I like SQLite a lot, but it targets a niche and graph operations is simply not that niche.
Using Sqlite for semantic web is the topic of the post. And while Sqlite is not suited for billions of triples, it can be adequate for the small graphs. The advantage is that Sqlite is widely deployed.
That's not the same as just being able to access those via an API; locality in this case means that you shouldn't need extra seeks for every single value, nor have to make a bunch of round trips through SQL.
That is the primary difference between traditional relational databases and column-oriented databases. Normal relational databases have row-based locality; column-oriented databases have column-based locality.
If your typical query visits all adjacent edges of a few nodes then the locality you're talking about is great. If your queries typically aggregate over a few types of edges across many nodes then this locality is the worst case.
So the question is, do we actually want to run graph algorithms or is the data graph structured for other reasons? You're implying that choosing a graph representation means we want to perform graph analyses. I disagree with that if we're still talking about the semantic web.
RDF is a general purpose knowledge representation model. The triple structure lends itself well to combining data from different sources with little coordination. It happens to form a graph, but running graph algorithms is just one of many special purpose problems.
Yes, you're mostly correct. (The one caveat being that if you know your access patterns in advance, you can choose if you store the edge type along with the edge, or if you store separate edge types in separate columns.) I'm not even arguing against storing RDF in SQLite. There are times that would make sense, and times that it wouldn't.
I'm primarily replying to:
> What is a graph database? A miserable little pile of joins.
> Though to be serious: what do you expect a graph database to provide that sqlite cannot / does not do efficiently?
That seemed like a general questions of, "What are graph databases for, and why would someone use them?" And I'm trying to answer that question.
It was intended as both a "what significant benefits would it bring for this problem" (for implied exposed DBs with relatively small datasets... though "relatively" is relative of course), and "I don't see how using SQLite would somehow prevent this from happening".
If there was some kind of "normal" query on thousands-to-millions of items that was prohibitively terrible on SQLite but not on graph-database-X, yea - I'm interested :) And I totally buy that graph DBs are better at graph queries in general. I just have yet to hit these kinds of limits in my use of SQLite (a fair number of instances with tens of gigabytes, a few with billions of rows) - a sprinkling of reasonable database design addresses almost all issues.
The main one I can see is that, with longer-term use, SQLite's lack of any way to force locality would be fairly crippling. You'd need to make a reasonable sort order and periodically re-insert data in that order to optimize / vacuum. That's... technically achievable, but is a big downside compared to something that can dynamically organize / optimize it based on [some heuristic].
Sqlite does indeed not have an array type for columns. The overhead of (de)serializing binary values or json values would have to be offset against the advantage of cache coherence. Or one could implement a virtual table that stores arrays in integers.
A whole pile of joins, that naturally arises if you want to combine data from a ton of domains.
Triple stores are essentially relational databases in 6th normal form. But relational databases like SQLite don't have good join algorithms to deal with this (they do pairwise joins instead of the worst case optimal ones like Leapfrog-Triejoin or Tetris). They also lack good interfaces for so many joins, you want something more declarative like Datalog/SparQL/GraphQL, than to explicitly write out every join.
Treating links as an entity. In a RDBMS, it is not possible to map a bidirectional relation semantically. It needs to have a direction (you can define it in both directions but then you have 2 relations). You then also need to duplicate link properties (in the join table) or normalize to yet another join table.
That has been my pet peeve for a while, and can make it hard to define navigation.
Totally not a deal-breaker though. I'd still use sqlite because I ### love it :)
Custom indexes or insert triggers can both guarantee that, both in a declarative way (in that they prevent normal "blind" inserts. I'm not entirely sure what you mean by declarative here tbh).
Also I have no idea why a separate table is a negative / constraint of some kind. It's a natural way to model it.
When you are following a chain of relations (a vector), you end up losing a lot of performance.
> not entirely sure what you mean by declarative
Guaranteed by the database semantics, not imperative code even when it runs on the DB.
I don't want to come off sounding like I think it's a bad idea to map graphs on RDBMSs, it's just inconvenient sometimes. Usually not enough to complicate your setup, adding another database, but rarely it is.
I'm not sure though about the suggested use case here, I really don't know it enough to make a suggestion. In such cases I always pick sqlite or postgres (depending on the client model) though. You can't be too wrong with those, and you are most of the times right :)
Not to mention the crazy extensibility of pg and the stuff you can do with it.
Imperative code run on the DB (i.e. triggers) seems entirely in-scope. It's an incredibly powerful tool provided natively by the DB, intended to be used for exactly this kind of custom constraint. Whether it's begin/end with imperative code or modeled some other way declarative seems absolutely, 100% irrelevant to me - it's the perfect implementation detail, it could have been done with `create unique index max 2 (x, y) on mytable` and you'd never know.
If you meant to prevent "all inserts are now [run this stored procedure / multi-statement operation / transaction with N steps]", then yeah - I agree that runs counter to "normal" use in most cases. It's certainly dramatically more error prone. Triggers don't need this though.
re chain of relations: I haven't poked too hard in this direction in SQLite, but I do totally believe its query planner isn't too sophisticated. And though you can get pretty good data locality with fancy design around the DB, you certainly won't get good locality just by inserting a bunch of data and running queries.
A directionless link or relation does not require exactly 2 rows here, that’s overconstraining the definition.
A simple example would be to link 3 brothers with the same relation: the worse solution would be dividing it in 3 different binary relations; the reality is that each brother is part of the “brother relation” set by way of sharing parents.
Brothers yes. But what about children? child_of is a directional relation, right?
Semantic web I assume is full of such directional relations. Bidirecitonal relations are the exception. Things like "causes", "indicates", "enables", "prevents", ...
Well, that is what I actually use most of the time, but to nitpick, it still is not declarative, and semantically a and b have no meaning. Just to be clear, I'm saying these are in 99% of the cases not deal breakers but they can be if all you want to do is map graphs (and follow chains through many members), as there is also a significant performance cost to all this abstraction.
What you described is, IIRC, also what some file-based graph DBs do in the background with sqlite and abstract away from you (I've seen a couple custom ones in some projects). Left with a RDBMS, you just need to do it yourself and analyze how much it costs.
It would be interesting to see it implemented on an object database like Realm, rather than on being based on SQL. Seems like it would be a much better fit.
The thing it's kinda missing for me is the ability to compose multiple SQLite databases, possibly provided by different domains.
It'd be nice to join together different public datasets. In a weird personal example, if Strava exposed SQLite, I'd love to do a join to weather.com and see when the last time I biked in the rain was.
It'd be cool if one half of some table was at foo.com and I could add a few rows to it on my bar.com domain, and then the combined dataset was queryable as a single unit.
So one option is to download the database you want to join against and run the joins locally. Datasette encourages making the raw SQLite database file available, so if it's less than about 100MB this may be a good way to do it.
If you're willing to do the joins in your client-side code, Datasette's default JSON API can help. You can write an application (including a client-side JavaScript application) which fetches and combines data from multiple different Datasette JSON instances by hitting their APIs.
My last idea is the most out-of-left-field: since Datasette lets you define custom SQL functions using Python code, it would be feasible to create a Python function which itself makes a query via the JSON API against another Datasette instance! You could then use that to simulate joins in SQL queries that you run against a single Datasette instance.
I've not built a prototype of this yet, and to be honest I think combining data fetched from multiple JSON APIs (which is possible today) will provide just-as-good results, but it's an interesting potential option.
What is the difference between this idea and exposing a read-only MongoDB (JSON in, JSON out) via HTTP endpoint? In the end, 50 ranged HTTP requests is not that great for unstable connection, 1 request is way better, and you need a server anyway.
Okay, offline first, what does that mean? Should I download the entire 600mb SQLite database? Should I do it every time it changes? Who will pay for the bandwidth? We can not employ standard HTTP proxy and caching mechanism here, it is not for 600mb files.
Making it available via a static file host dramatically lowers the barrier. If you have an interesting dataset, you can even throw it on github pages and pay nothing; that likely is not true for your mongo db server.
A typical request using indexes will be less than 10 separate 1KB GET requests, not 50. But yeah, more work needs to be done on performance.
Whether it makes sense to fully download the dataset depends on the project; maybe it does not. But it doesn't have to be a monolithic file. You can use SQLite's multiplex VFS to split the SQLite file into many smaller pieces (and still update the db later!).
I think this can be carefully thought through and become something interesting. I don't fancy SQL, to be honest, despite having "structural" in its name it is too chaotic for XXI century. Rows and columns!
Schema-enforced document databases, on the other hand, are neat and mostly people- and machine-readable at the same time.
Another idea: downloading only indexes may greatly reduce number of requests needed to query the data.
That’s what those tiny 1KB requests are doing mostly, downloading indexes. With http pipelining taken into account, it’s way faster than trying to preload the full index data.
Despite the number of requests seeming excessive to us, the performance of this setup is already in the ballpark of your usual underpowered MySQL going through an app server.
Terrible idea. Why would anyone want to deal with interfacing a bunch of randomly structured databases whose tables can change at any time without warning. Nightmare.
Yes, it's still terrible for the consumer of the data. But I like it not because of that.
The positive thing I feel when reading about this, is that it dramatically lowers the barrier for the producer of the data to expose it in a meaningful way. While previously it was necessary to think about the format and write code to expose the data while now its possible to just throw the data over a wall.
You could use a framework to automate the first thing, but this would be specific to one programming language, while the second approach works with all languages. So it lowers the total effort to get to the goal, effectively side-stepping the "have to implement framework or serialization code" issue.
Warning: heavy speculation below
So if more people would build sites using this technique, the pressure for better tools (at a higher level than right now) for consumers would increase, so these would be built by someone. As you have a proper standard (there is only one SQLite) you would have a new "ecosystem" growing. This would lower the pain for the consumers of said data. You'd still have to implement it in every programming language that wants to access the data, but this is another problem.
As opposed to a bunch of websites serving an archaic, poorly formatted blob of text-the “correct” parsing of which has now become _so complicated_ that it’s basically infeasible for anyone not willing to build a whole web-browser?
> Terrible idea. Why would anyone want to deal with interfacing a bunch of randomly structured databases whose tables can change at any time without warning. Nightmare.
It isn't quite so bad. You can have wiki-esque volunteer-driven cooperative authoring, linking to known good versions, etc. to keep it from becoming a complete free-for-all.
I disagree, the alternative is either no access to website data or access through a tightly controlled API (which can come with the same problems if API compatibility is not guaranteed).
Because the backend data is exposed to the world, with all its original semantic structure (relational model) intact, before it is flattened into a document view.
Real world is messy, some companies have different key-value pairs on the same kind of document (invoice, purchase order, utility bill, etc). I counted 20K different keys, some semantically synonymous, in a few thousand invoices. Even the table part can have different columns. What do you do when schemas don't quite match?
I agree. Semantic data would mean that others can easily understand the meaning of the data. In the semantic web this would be by using ontologies, which define the types of things and relations between these types of things. But just having your schema visible doesn't mean anyone understands it straight away, they would still need to make an effort to understand the schema of that specific application. And the schema is probably unique to that application. The end result is pretty much the same as for example exposing your database as a GraphQL endpoint. Take "The Graph" a web3 project exposing data of many different blockchain projects as GraphQL endpoints. It's nice, but I still need to make an effort to understand the meaning of each property in each endpoint. And a "transaction" in one project is not linked to the meaning of a "transaction" in another. A bit off topic, but ironically I therefor don't find the name 'The Graph' to be all that accurate.
Point in case: YES to at least remembering that web3 is (also) the Semantic Web. But no, this solution is not semantic data.
Among the 30-odd technologies that make up the Semantic Web[1] (it never died, it's just a collection of tech, lots of organizations use it daily) are graph databases[2]. Graph databases are necessary to implement semantic web databases.
SQLite is not a graph database. Even if you used SQLite to implement a graph database, it would not solve any significant problems of the semantic web, such as access to data, taxonomies, ontologies, lexicons, tagging, user interfaces to semantic data management, etc.
It's a really odd suggestion that you would just copy around a database or leave it on the internet for people to copy from. For the BBS mentioned here, that might actually be illegal, as it might contain PII, and on other sites possibly PHI. Many countries now have laws that require user data to remain in-country. Besides the challenges of just organizing data semantically, there still needs to be work done on data security controls to prevent leaking sensitive information.
The funny thing is, that isn't even hard to do with the semantic web. You classify the data that needs protecting and build functions and queries to match. You can tie that data to a unique ID so that people can "own" their data wherever it goes, and sign it with a user's digital certificate which can also expire.
But all of that (afaik) doesn't exist yet. Everyone is more concerned with blockchains and SQL, either because the fancy new tech is sexier, or the old boring tech doesn't require any work to implement. The Semantic Web never caught on because it's really fucking hard to get right. No companies are investing in making it easier. Maybe in 20 years somebody will get bored enough over a holiday to make a simple website creation tool that implicitly creates semantic web sites that are easy to reason about. It'll probably be a WordPress plugin.
> Graph databases are necessary to implement semantic web databases.
This just isn't true, on multiple levels. RDF is an interoperability standard that does not per se depend on a 'graph-like' data model - you can very much expose plain old relational data via RDF, and this is quite intended. Additionally, modern general-purpose RDBMS's support graph-focused data models quite well, despite being built on 'relational' principles - there's no need for special tech when working with general-purpose graph models, unless you're doing some sort of heavy-duty network analytics.
You're talking about extending a database design created 50 years ago to work with models and methods that involve significantly different operations and concepts. Let the RDBMS die so we can make something that is much more powerful and requires less fidgeting and squinting to work the way we want.
RDBMS were a niche research project for a decade before they started to catch on in business apps. They've stayed around forever because they're just functional enough to be dangerous. But we've already hit the upper limits of both reliability and performance years ago (remember NoSQL?) and we just keep bolting on features because nobody wants to leave them. The old designs and implementations are holding us back.
You and I work on very different kinds of projects. I find myself fidgeting and squinting at the database far more when it fails to enforce a schema or provide ACID transactions.
RDF is a labeled multigraph data model with URI-based predicates as edge labels, where each triple represents an edge. You are right that relational data can be exposed in RDF, just like CSV can be loaded into a graph DB.
> Graph databases are necessary to implement semantic web databases.
The online docs (and TBL himself) rarely mention of graph databases, but obviously the idea is tied tightly to RDF. Separating it from that implementation detail is part of the point, though. Getting people to represent their data via an additional format was never going to work.
> For the BBS mentioned here, that might actually be illegal, as it might contain PII
Can't imagine the purpose you had in even making this point. In theory, any arbitrary database exposed publicly could be illegal to replicate due to copyright, PII laws, etc. But that has nothing at all to do with a technical discussion of a technique for exposing data. What a bizarre point to make.
As an aside, I'm glad you removed the "Uh........." from the beginning of your post. We're all making an effort to reduce the typical HN snark in the comments, and there's always room for improvement :D
SQLite could actually make a really good basis for building a graph database, thanks to "Many Small Queries Are Efficient In SQLite": https://www.sqlite.org/np1queryprob.html
I took advantage of that for my datasette-graphql plugin - it's not a graph database, but it does allow deeply nested graph-like queries that take advantage of SQLite's fast small query performance: https://datasette.io/plugins/datasette-graphql
The semantic web failed to become widely popular because:
1. Graph databases on top of triple stores are a lot less scalable than relational databases or key-value stores, and this is how semantic data is meant to be stored/queried.
2. Data is valuable. Handing out data for free in a machine-consumable way is both expensive (machines can request data much more quickly than a human) and a recipe for copycats. The incentives just aren't there.
TBL's Solid project is about trying to separate semantic data providers from the presentation layer and opening up the possibility of payment from these data providers to try to improve the incentives around semantic data sharing.
> The Semantic Web never caught on because it's really fucking hard to get right. No companies are investing in making it easier.
I really appreciate this point. I had the opportunity to work on an exploratory project with an experienced ontologist (yes, you really need one of those, I think). The tools were fascinating (reasoners quickly became necessary) but I had the feeling that many of these tools were at a comparatively early stage of maturity.
Trying to explain to people how the system would work was a challenge as it required a primer on theory and application -- we glazed many eyes. The CTO wanted to know if we could use blockchain somehow. Another group addressed a slice of the problem with technologies already in use and that decided the matter.
Ouch. Most uses of reasoners/inference are quite computationally-intensive, to the point of making "reasoning" quickly infeasible. But if you really want, you can do all this stuff in traditional databases by defining appropriate 'views' and having your application query them. You could even use custom database triggers to enable inserts/updates on views.
Really depends on the reasoner in use. I'd really take most current public benchmarks on reasoner performance with a grain of salt, as most implementations out there are mostly academic-grade non-production systems.
E.g. Stardog does most of their reasoning via query rewriting (and also lean on some restrictions). That way you can leverage DBs to do what they are good at. If you can then on top of that build some clever caching or incremental computation, you should be fine for even pretty huge dataset sizes.
Well, then again the original idea is taking off with https://solidproject.org/ with millions of pods by Tim Berner-Lee's Inrupt to go online starting this Spring.
Yes, the entire country of Flanders is getting pods for every citizen in March. Then there are patient pods for UK NHS, and then pods for BBC content users...
SQL is not better than XML or JSON for representing data. They are all mappings of much richer data structures on a limited data model. But even setting aside these problems, there are some problems with a distrubted semantic web that are barely ever mentioned: the step from going from data to 'semantic' facts, how to deal identifying sources, and versioning/updates. I think it is very important to record who (person or institution) is the source of a certain fact or the 'linking' of facts between multiple sources. Cryptographic keys, just as in blockchains, could help to link data of distruted sources such that it is possible to verify the source of a fact to sources/authorities and correct errors or deal with updates in case they occur.
There's one exception to the equivalence of SQL on the one hand to XML or JSON on the other hand. The point of SQL (and other DBMS paradigms) is to give access to data that's orders of magnitude bigger than the RAM in which the app runs. That has stayed true for at least a quarter century, during which RAM and database sizes both did the Moores-law exponential expansion.
A relational database maps everything to unordered relationships. Representing or manipulating a tree like structure is complex. Just representing an ordered list is complex. In XML and JSON everything is ordered and querying it as relational database is cumbersum. Graph databases and OO databases are somewhere in the middle.
But as I wanted to point out, which data models are used, is not the major obstacle to the semantic web. It is these other problems that are not addressed.
Wikidata is interesting. But it is a centralized approach. Is there an interface which gives the full breakdown of sources and where you can, through a chain of certificates (like those used for ssh and https), verify the sources?
Wikidata gives you a "full breakdown" of machine-accessible external identifiers for any entity. So, if you know how to query an external source, you can use it to verify any claim you care about.
This may be more likely to happen if there was a compromise between the two: query the database, but maintain a database that is queryable using SPARQL and can export to TTL files. Then the linked data revolution can continue and we don't have to maintain finnicky webpages but rather a relatively static database.
One problem is that it's the one hosting the data that pays the bandwidth. Yes when you download a video from Youtube, Google has to pay your ISP! (Google will strong-arm peering agreements though, but that doesn't take away from my point)
Someone have to pay for the infrastructure. Right now the one hosting (not the consumer) pays for the infrastructure. So there are not really any incentive to host data for free - like a kiosk offering free goods and services.
The problem is if you would get paid by sending stuff, everyone would be spamming data everywhere. Imagine if you would have to pay 10c every time someone sent you an e-mail.
Something that I think would help are micro transactions. And a built into browsers so that you could easily make a micro-transaction.
We already have Bitcoin and other crypto currencies, but they are too big to run inside the browser of a mobile phone, if it wasn't for the high transaction costs - the ledger/blockchain would be even bigger...
Today publishers - those that publish stuff on the web earn money by showing ads. And ads initially worked very well for a few years around 2000 before people started cheating with bots.
But you can still make individual deals with webmasters and choose to trust them.
Also a lot of "the web" has moved to videos and Youtube. The average web user choose to watch a video rather then reading a text article covering the topic of interest.
This is the standard ISP arguement that I never understood. Google pays their ISP and any other ISPs they are using for peering, but they don't HAVE to pay your ISP. You pay for your usage.
Is there really that much web safely exposable data in sqlite for this to make sense? I'm not really seeing how this is obviously better than the metadata ideas that preceded it.
Some: weather, ratings, topography, dictionaries and encyclopedias, sports scores, market prices, some other stuff. All public knowledge, but not necessarily publicly available (easily) in raw form.
I get that, but his argument is that you could do it with no additional work.
"The beauty of this technique is that you are already using SQLite because it's such a powerful database; with no additional work, you can throw it on a static file server and others can easily query it over HTTP."
I doubt very many of those already use sqlite. As soon as the additional work is added, its probably easier to just expose it as XML or JSON.
I do not understand how that would work. The clients don't have any way to synchronize with server changes, so they can read data in an inconsistent state.
I think the way you'd have to do it is to effectively publish new database versions to their own path. Symlinked as much as possible, so they can back onto the same database, but you'd do something like this:
http://host/db/1.1.0 is what you create when you add a new column. It's backwards compatible with /1.0.*, so you can either leave those paths working or you can redirect them to /1.1.0, depending on what guarantees you want to be making to your clients.
http://host/db/2.0.0 is the version with an old column deleted, but you'd want to check in your access logs that nobody was still requesting any 1.0 version before publishing it. Either way, when this gets published you want to stop serving the /1.0.* path because 1.0 and 2.0 now can't come from the same backing file. But 1.1 and 2.0 can come from the same file if you've given everyone time to stop using the deleted column.
It's not a great scheme, but it does give you a way to get new client connections onto the right version.
For clients that have a session which lives across database upgrades, I think what you'd need is a `schema_version` table the client could check however often makes sense, and let themselves get reset if they find there's a new version available.
What do you mean "works fine"? AFAIK to read consistent data you need to be reading from a database snapshot. So with every data update on the server you need to make a new snapshot (to include new data), and publish it under new version (otherwise clients might be in the middle of some query reading data from the old snapshot, and combine metadata from the old snapshot with data from the new one).
One of us does not understand how that works. According to me you need to acquire read/write locks in order to ensure consistency in presence of a writer. You can't do that by reading ranges of a static file with independent read requests.
if "exposing and parsing metadata" never took off as the meaning for web 3.0, by the author's own admission, why try to resurrect it with this title? clickbait?
we love web3 labeled articles here
fortunately the more popular variant of web 3.0 doesn't even need the developer to make a database or anything on the backend. just frontend development, and deploying code once to the nearest node. frontend is optional depending on your userbase.
API Platform is a popular and easy to use semantic web framework:
1. You design your data model as a set of PHP classes, or you generate the class from any RDF vocabulary such as Schema.org
2. API Platform uses the classes to expose a JSON-LD API with with all the typical features (sorting, filtering, pagination…)
3. You use the provided "smart clients" to build a dynamic admin interface or to scaffold Next, Nuxt or React Native apps (these tools rely on the Hydra API description vocabulary, and work with any Hydra-enabled API)
In addition to RDF/JSON-LD/Hydra, API Platform also supports ActivityPub.
Aren't you the same Dunglas that is head of this project?
Great idea by the way! had to the pleasure of working with api platform in some Symfony applications in a past gig. I can vouch its easy enough to use, but the GraphQL integration (at least at that time) was really slow. I have not found PHP to be the ideal runtime for GraphQL
Isn’t this Web 1.0 instead? You are only reading data, yeah ok with sql, but you still can’t modify it. And also there are already very good standards like Rdf, Owl2, spraql, which are more expressive than sql for consuming the info
The server is a source of data, its filesystem the database, and the client has to make sense of it. There is no API but GET requests. Works wonders for all but big data queries, naturally.
So you publish raw data (TimBL, you want it that way) plus a recipe for a visual representation and the browser shows a sensible view to begin with.
Well yes and no. I can see this working in theory, but in reality semantic means standardised as much as it means accessible.
In a world where my blogpost objet has the same information as your blogpost object, this works without a problem.
In a world where I actually want to up my database to you, we could agree on a format.
Both of these cases, from where i stand, seem very unlikely and we have not even talked about the pople that would clone your data 1 to 1 just to host an ad filled alternative of your site in real time.
I've been exploring the idea of using SQLite to publish data online via my Datasette project for a few years now: https://datasette.io/
Similar to the OP, one of the things I've realized is that while the dream of getting everyone to use the exact same standards for their data has proved almost impossible to achieve, having a SQL-powered API actually provides a really useful alternative.
The great thing about SQL APIs is that you can use them to alter the shape of the data you are querying.
Let's say there's a database with power plants in it. You need them as "name, lat, lng" - but the database you are querying has "latitude" and "longitude" columns.
If you can query it with a SQL query, you can do this:
select name, latitude as lat, longitude as lng from [global-power-plants]
But what if you need some other format, like Atom or ICS or RDF?
Datasette supports plugins which let you do that. I'm running the https://datasette.io/plugins/datasette-atom datasette-atom plugin on this other site. That plugin lets you define atom feeds using a SQL query like this one:
select
issues.updated_at as atom_updated,
issues.id as atom_id,
issues.title as atom_title,
issues.body as atom_content,
repos.html_url || '/issues/' || number as atom_link
from
issues join repos on issues.repo = repos.id
order by
issues.updated_at desc
limit
30
The plugin notices that columns with those names are returned, and adds a link to the .atom feed. Here's that URL - you can subscribe to that in your feed reader to get a feed of new GitHub issues across all of the projects I'm tracking in that Datasette instance: https://github-to-sqlite.dogsheep.net/github.atom?sql=select...
As you can see, there's a LOT of power in being able to use SQL as an API language to reshape data into the format that you need to consume.
I also have a project to explore this alternative way of peers communication but i have a different answer to this, and i think its better if its a network of peers that expose API's
It's badly documented as i have just published to github, but i hope it gives a clue of how is supposed to work.
I'm on the final touches over this project, but the main concept is already working as is 90% of it, but i think exposing SQL is too raw, and maybe dont offer the whole picture, as for instance, what is important is not data, but sometimes pure computation.. Eg. suppose you offer a deep leaning inference where you receive and give back tensors..In the middle of it is a different sort of computation, where it doesnt have anything to do with databases.
Or yet, suppose you need to access something in a third-party before giving an answer, or if you want to do it in a distributed fashion without you api consumer even noticing it?
API's are a good answer to that, and in my opinion are superior interfaces, whatever the semantic web of the future will be, it will need this network of API peers to work as a floor to it.
For instance, you can design a Graph API on top of it. Exposing your data layer directly is bad engineering as there's a lot of problems you wont be able to solve, and where leaving clients to talk to "you" over a well-defined API will.
To put it simply, in my point of view the direction the semantic-web is pointing to is cool, but the answer is not the right one, and this idea of exposing SQLite directly while is cooler, yet have the same flaws, or else something as GraphQL would have taken the world as its not much a different answer than the one presented here.
I've thought a bit about the problem of exposing your underlying database - that's obviously a problem for creating a stable API, because it means you may be unable to change your internal database schema without breaking all of your existing API clients!
With Datasette, my solution is to specifically publish the subset of your data in the schema that you think is suitable for exposing to the outside world. You might have an internal PostgreSQL database, then use my db-to-sqlite tool - https://datasette.io/tools/db-to-sqlite - to extract just a small portion of that into a SQLite database which you periodically publish using Datasette.
The other idea I have is to use views. Imagine having a PostgreSQL database with a couple of documented SQL views that you expose to the outside world. Now you can change your schema any time you like, provided you then update the definition of those views to expose the same shape of data that your external, documented API requires.
As with all APIs of this sort, adding new columns is fine - it's only removing columns or changing the behaviour of existing problems that will cause breakages for clients.
I wonder if database engines will ever have versioning, such that it would always be possible to see the database as it was at different points in time.
I'm not a fan of SQL, but I do think exposing your original source data in its original form is valuable (though it has little to do with being semantic). I carefully set up my blog to expose the raw markdown that is the source form of my blog posts in the source HTML itself, with the minimum necessary cruft around it to render it as a viewable webpage.
What a lot of folks don't realize is that the Semantic Web was poised to be a P2P and distributed web. Your forum post would be marked up in a schema that other client-side "forum software" could import and understand. You could sign your comments, share them, grow your network in a distributed fashion. For all kinds of applications. Save recipes in a catalog, aggregate contacts, you name it.
Ontologies were centrally published (and had URLs when not - "URIs/URNs are cool"), so it was easy to understand data models. The entity name was the location was the definition. Ridiculously clever.
Furthermore, HTML was headed back to its "markup" / "document" roots. It focused around meaning and information conveyance, where applications could be layered on top. Almost more like JSON, but universally accessible and non-proprietary, and with a built in UI for structured traversal.
Remember CSS Zen Garden? That was from a time where documents were treated as information, not thick web applications, and the CSS and Javascript were an ethereal cloak. The Semantic Web folks concurrently worked on making it so that HTML wasn't just "a soup of tags for layout", so that it wasn't just browsers that would understand and present it. RSS was one such first step. People were starting to mark up a lot of other things. Authorship and consumption tools were starting to arise.
The reason this grand utopia didn't happen was that this wave of innovation coincided with the rise of VC-fueled tech startups. Google, Facebook. The walled gardens. As more people got on the internet (it was previously just us nerds running Linux, IRC, and Bittorrent), focus shifted and concentrated into the platforms. Due to the ease of Facebook and the fact that your non-tech friends were there, people not only stopped publishing, but they stopped innovating in this space entirely. There are a few holdouts, but it's nothing like it once was. (No claims of "you can still do this" will bring back the palpable energy of that day.)
Google later delivered HTML5, which "saved us" from XHTML's strictness. Unfortunately this also strongly deemphasized the semantic layer and made people think of HTML as more of a GUI / Application design language. If we'd exchanged schemas and semantic data instead, we could have written desktop apps and sharable browser extensions to parse the documents. Natively save, bookmark, index, and share. But now we have SPAs and React.
It's also worth mentioning that semantic data would have made the search problem easier and more accessible. If you could trust the author (through signing), then you could quickly build a searchable database of facts and articles. There was benefit for Google in having this problem remain hard. Only they had the infrastructure and wherewithal to deal with the unstructured mess and web of spammers. And there's a lot of money in that moat.
In abandoning the Semantic Web, we found a local optima. It worked out great for a handful of billionaires and many, many shareholders and early engineers. It was indeed faster and easier to build for the more constrained sandboxiness of platforms, and it probably got more people online faster. But it's a far less robust system that falls well short of the vision we once had.
At one point twitter seemed to want to be a relatively general protocol, where users could build their own UI, use 3rd party apps and maybe even interoperate or extend with other social networks & such.
Pg/yc even wrote about it, inviting startups to start writing apps for this exciting new protocol. The early app ecosystem was pretty slimy, with a lot of spam-ish clients for promoting snake oil. More importantly, it became clear that controlling the UI means control over users: the data, rights, often and the ability to decide what goes into people's feed. That's where the (financial) value is, and they're not going to give that up.
TBL's ideas were naive perhaps, but he did have his thumb in the right place. Something like semantic web was necessary, in order to avoid the centralisation that did end up happening.
RSS, via podcasting did catch on. Today it's one of the only "free" media forms. There's no company moderating podcasts like twitter, FB, youtube, etc.
There's a standard XML serialization of HTML5 that supports all the features previously associated with XHTML. Additionally, RDF data can be exchanged as JSON via JSON-LD. There's no reason why a typical SPA app could not be built to query RDF-serving endpoints.
"Marking up forum posts" is something that's getting quite a bit of traction nowadays via specifications like ActivityStreams (with its "push" extension ActivityPub now powering the 'Fediverse') and WebMention.
> The entity name was the location was the definition.
While that concept sounds cool in theory, in practice it was and is a disaster. In combination with the big degree of centralization and little versioning mechanisms you have to trust the publisher to not alter the semantics, and also hope that they stay online forever or your semantics vanish.
When I first learned about the semantic web, I was very hyped on it, but that quickly subsided once I tried actually querying the ontologies and having to see that most of them yield a 404.
I'm still very hopeful for semantic data (and happy to be able to work on a product leveraging it), but I think for an open semantic web there is a lot of work that needs to go into tooling to make it succeed.
I agree with pretty much everything you said, except the part about the "VC-fueled startups". Google and fb were once startups, they were just earlier and Google in particular was smart enough to see the future. As part of a multi-faceted effort (including for instance, Chrome and gmail), they saw the need to head off the Web 3.0 standards, delivering us instead the web we have today. I wish I could have seen things as clearly then.
In the end though I'm not sure it ever would have been any different. People want it "now" and they want it "convenient".
The author seems to assume that everybody is using SQLite, but SQLite for a production database is an extremely niche choice. Attempting to expose more popular options like PostgreSQL or MySQL as SQLite would be extremely difficult because SQLite only supports a subset of SQL, whereas PostgreSQL and MySQL both implement their unique superset (for the most part) of SQL.
But it doesn't matter. The API doesn't matter. Web 3.0 was never about APIs, it was about data. A standardized API is only useful if it outputs standardized data. Having a bunch of bespoke SQLite tables scattered across the web gets us no closer to the ideal of Web 3.0.
- Extremely comprehensive geospatial capabilities thanks to the SpatiaLite extension - this has a huge amount of functionality, which I think is better than MySQL though not yet as good as PostGIS: https://www.gaia-gis.it/fossil/libspatialite/index
I'm not saying it's as "good" as PostgreSQL, but I don't thank your argument that PostgreSQL and MySQL implement a substantially larger portion of SQL holds up particularly well.
As luhn said, it’s more about a standard data format than a db choice. If every client has to figure out what schema a website uses for a recipe, let’s say, then Web 3.0 is still unrealistic.
Schema.org exists, but all websites adopting it seems unlikely.
That being said, I can maybe see a world in which one company adopts schema.org schemas and the rest have to follow suit to be competitive in that particular domain.
> Schema.org exists, but all websites adopting it seems unlikely.
Schema.org has the backing of major search engines and other reusers of Web-served content. It's way more likely to be adopted compared to anything else in its general domain.
SQLite is the most used database engine in the world, so I wouldn't call it niche. In fact, by some estimates, it is probably used more than all other database engines combined.
The only difference is that it is usually run locally (compared to Postgres and your other examples), but something doesn't have to run remotely to be considered running in production :)
> Yes, when I said "production database" I meant a database for a web application
I'm not sure that's what the author of the article means though, at least in my interpretation.
I assume when they say "everyone is already using it" I assumed that they meant literally everyone is using it on their phones and PCs every day, not that everyone is using it to develop production web applications (because very few people develop production web applications in the grand scheme of things!).
I presume they mean that it is one of (if not the) most common databases in existence in the wild, and it's interesting that it has this property of being able to be remotely read with surprisingly little overhead (without the need to implement an entirely bespoke database to be read in this way).
"The data needs to be exposed in its original form; any additional translation step will ensure that most people won't bother. The beauty of this technique is that you are already using SQLite because it's such a powerful database; with no additional work, you can throw it on a static file server and others can easily query it over HTTP."
The author believes (IMO wrongly) that there's lots of web app data that can be exposed via SQLite-over-HTTP without translating it into SQLite, because it's already in SQLite.
The author is saying that since lots of web apps use SQLite for their production database, they can easily "throw" their SQLite DB onto the web. But, in that case, you're out of luck if you use Postgres, MySQL, Oracle, MS SQL Server, or any of the popular key-value datastores like Mongo, Redis, or Elasticsearch.
I think he does mean web apps/sites, at least in large part. He is talking about implementing of web 3.0, after all. OTOH, I suppose there's no reason why web has to apply primarily to things that are currently webstuff. You both make good points.
sqlite is actually quite robust in a production web application environment, I have used it as a database in several production applications over the years including one that serviced 600k MAUs without issue. if your application is very write heavy or you're FAANG scale it could present a problem, but IMO sqlite is probably the best bang for your buck solution for the workload of most websites and applications.
tbh I wish it were used more. it's much cheaper to run and just as fast as mysql (sometimes much faster) on your average wordpress blog or equivalent. you don't need to handle thousands of concurrent writes, only maybe 5 max at peak... and queueing them for tens of milliseconds is totally fine. as long as you're not writing horrifically inefficient insert operations, you absolutely won't notice until you're under ridiculously high load for most sites.
Yeah, I've heard the argument before that SQLite is just fine as a production database. Not arguing against that, just saying that it's not a common choice.
SQLite is not uncommon in web applications. It isn't just found in phone apps and OSs.
It works pretty well for tasks that are primarily read heavy and is a lot lighter doing so than the common non-nosql alternatives (SQL Server, postgres, mariadb/mysql, ...). For applications that need good performance in the presence of much writing, particularly where concurrent writes might be desirable, it is far less ideal though.
"nosql" tends to mean more than simply not SQL. Often not any sort of relational structure and related querying methods, for instance, so "not nosql" is not the same set of things as "SQL".
And yet, you know that isn't the point and are just trying to have the last word.
SQL is not the only structured query language and SQL-queried databases are not the only relational databases. It just happens that the subset of database services that are not nosql and commonly “compete” with sqlite as a relational database mostly happen to use some variant for SQL.
This whole architecture flips a lot of assumptions on their heads. In particular here, the assumption that all user queries end up in the same database instance, which then might not handle the load, is thrown out. Queries here are primarily local to, and potentially defined by, the user. If that's useful to you, it's not something you'd want to use a traditional MySQL or Postgresql setup for.
GraphQL is not a standard, it's just a technology for building custom API's which are far from "agnostic" in practice. You can use SPARQL if interoperability if your goal.
If you have a Postgres database you can expose it as GraphQL by putting Hasura in front if it. A piece of software used by Walmart and Atlassian among others, despite being just 5 years old.
Nothing like that exists for SPARQL, RDF and Postgres
My point was not that people are using SQLite in prod everywhere; read that paragraph in more of a speculative voice, not a statement of fact about the present. At any rate, i do think the range request technique makes SQLite more practical to use in database-driven apps that normally would've opted for a traditional db like postgres (though there is more work to be done to make this technique fast when doing complex queries...lots of joins are no bueno right now).
I enjoyed the speculative bias there as its prompting me to pick up the sqlite bundle thats been sitting in my downloads while i use post-gres-like-the-rest
Exactly. And there’s really no reason those queries couldn’t be made public / collaboratively maintained.
You could probably take it one step further and define an OpenAPI spec which is populated via those queries. Tho that would require an intermediary / post-processing, likely with a cache.
Regardless, the capability to determine how and what to consume sits with the consumer (developer) from the outset. Rather than having to scrape the data, normalise it into some form of schema, and then build an api / interface around it. And then worry about keeping it up to date.
People are talking that way now. Before what is being called “Web3” ATM Web 3.0 was to be the “semantic web” (where Web 2.0 was the “interactive web” - with many things become read/write instead of read-only and greater interactivity, both in terms of social interactivity and individual interactive apps being web based, being the focus of technological enhancements).
This article is talking about that earlier definition, and a way it might once again be the definition, perhaps relegating Web3 to Web 4.0 (or web we-worked-out-it-was-a-ponzi-scheme-so-just-stopped with something else being Web 4.0, if you take the more cynical view).
"Before the term was hijacked by crypto-grifters (and, admittedly, a few genuinely neat projects), web 3 (point oh) referred to Tim Berners-Lee's project to promote a standard way to expose and parse metadata on the web."
> Data on the web will only be "semantic" if that is the default, and with this technique it will be.
Not going to work unless imposed by some external force. The semantics of the web can more practically be extracted with neural nets, but it's a long tail and there are errors. Lots of good work recently in parsing tables, document layouts and key-value extraction. LayoutLM and its kin comes to mind.[1]
Very nice indeed! I am sorry I did not notice before the discussion about previous blogpost on the subject [0] “Using the SQLite-over-HTTP "hack" to make backend-less, offline-friendly apps”
Are there more than 2 blogposts? Cannot find a posts page.
Humans, as of now (and as far as I'm aware, being outside the AI labs at the big tech companies and DARPA) have agency, and so are in a unique position to take advantage of the uniform interface of REST/the web in a flexible manner. I wrote an article about this on the intercooler.js blog, entitled "HATEOAS is for Humans":
The idea that metadata can be provided and utilized in a similar manner doesn't strike me as realistic. If it is code consuming the metadata, the flexibility of the uniform interface is wasted. If it is a human consuming the metadata, they want something nice like HTML.
For code, why not just a structured and standardized JSON API?
This appears to be what we have settled on, and I don't see any big advantage extending REST-ful web concepts on top of it. The machines just ignore all that meta-data crap.
>> why not just a structured and standardized JSON API?
So in this version of the idea... because structuring data requires work. Unstandardized data exists already. Some of it is already SQLITE. A lot of the rest is in other SQLs, and that might be a smaller bridge.
Author claims (if I'm understanding correctly) that a static website could easily query sqlites over HTTP, and bam, web 3.0.
Honestly, it's hard for me to think/discuss these ideas without examples, even if contrived. What kind of websites would be built this way? What data will they be querying?
A web app that uses photos and address books on the users phone? An alternative UI for news.yc?
> why not just a structured and standardized JSON API?
The word "just" is doing a lot of work there! Getting everyone to use the same standardized JSON API format turns out to be incredibly difficult.
This is why I'm a big fan of the idea of using SQL as an API language to redefine the data into the output format that you need, see my comment here: https://news.ycombinator.com/item?id=29900403
> The semantic web will never happen if it requires additional manual labor.
Is manual labor the reason things turned out the way they did, with google spending whatever it took to index and monetise the whole web the way it did?
I think the author missed the "semantic" part. If you push your own SQLite, then no, I don't have the semantic meaning of the website. Only a standardized semantic file format ala RDF can achieve that.
Out of context, you won't necessarily be able to glean meaning from either an arbitrary SQLite database or arbitrary RDF tuples. Both are equally meaningful or meaningless depending on the observer...at the end of the day, they are just structured data with labels that (hopefully) the observer understands. One doesn't have inherently more semantic meaning than the other.
Not going to happen. The reason for the Semantic Web never taking off were never technical. Websites already spend a lot of money on technical SEO and would happily add all sorts of metadata if only it helped them rank better. Of course, many sites’ metadata would blatantly “lie” and hence, the likes of Google would never trust it.
Re exposing an entire database of static content: again, reality gets in the way. Websites want to keep control over how they present their data. Not to mention that many news sites segregate their content as public and paywalled. Making raw content available as a structured and query able database may work for the likes of Wikipedia or arxiv.org. But it’ll not likely going to be adopted by commercial sites.
I wonder if combining this idea with some kind of microtransactional currency such as the bitcoin Lightning Network or even a simple Chaumian e-cash system (1) would help to get around the issue of requiring clickbait, advertising and SEO with every single piece of data.
Would be great if providers could offer data in raw form without the overhead of all the gunk that gets them paid.
I think both the capital S Semantic Web and the lowercase semantic web (microformats) kind of just fizzled out towards the end of last decade without changing much at all on the actual web.
The lower case variety kind of survives as a smart thing to do to "help" search engines a little but otherwise has very little real world relevance. All talk of doing anything with on page information in browsers evaporated a long time ago. E.g. MS had some plans with this with early versions of Edge and there were some nice extensions for Chrome and Firefox as well. Not a thing any more. Most of that got unceremoniously ripped out of browsers a long time ago. At this point it's basically just good SEO practice to use microformats as search engines can use all the help they need to figure out what is what on a page. Other than that, whether you render your data to a canvas, a table, or nice semantic HTML has very little relevance for anyone. It's all just pixels that hit your eyeballs in the end. There's nothing else that looks at that information. With the exception of search engines. And they were part of web 1.0 already.
The capital S Semantic Web with ontologies, triple databases, etc. never really got out of the gates and is perpetually stuck in people doing very academic stuff or specialist niche stuff that largely does not matter to anyone else. The exception is graph databases, which are still used in some data/backend teams for some stuff. And of course a few of those also pay lip service to some of the Semantic Web W3C standards from the early 2000s even though that is not the main thing they do anymore. Either way, too much of a specialist thing to call it a semantic web (capital or lower case). Most of the web uses exactly none of this stuff. But nice tools to have if you need them. You could argue a lot of the people involved moved their focus to AI and machine learning, which certainly looks like it is having a very large impact on e.g. search engines.
I guess web3 has that in common with web 3.0 (other than the number 3). There are a few people who desperately (and loudly) want the web to go their way and insist it must be the future. But most people couldn't care less. In the end people just vote with their feet and gravitate to technologies that work for them or solve a problem they have and ignore things that don't do anything useful for them. In the case of Semantic Web, there was nothing there that you could coherently explain (i.e. without using all sorts of abstractions, complex stuff, and simplistic hyperbole). There were a few startups and lots of hype. They did a bunch of stuff. Most of those startups no longer exist or have faded into irrelevance. And the few that survived carved out a few interesting niches but did not end up producing any mainstream, must have technology. Certainly no unicorns there. Wolfram Alpha probably is one of the more well-known ones that actually shipped something useful. But it's a destination and not the web.
Web3 has the same issues. Most threads on HN on web3 devolve into people talking about what it is, ought to be, or isn't and why that is or isn't important. That seems to be impossible to do without using a lot of hyperbole and BS. Very little substance in terms of widely adopted technology or even in terms of what that technology looks like or should look like. It's Web 1.0 all over again. Step 1 Blockchain, Step 2: ????, Step 3: Profit (or not).
Most of the web is just a slightly slicker version of what we had 15 years ago (web 2.0). AJAX definitely became common place. We now have mature versions of HTML, SVG, CSS, etc. that actually work. And with WASM we can finally engineer some proper software without having to worry about polyfills and other crazy hacks to make javascript do stuff it clearly is not very good at. I'm looking forward to the next 15 years. It's going to be interesting and possibly a wild ride.
RSS can be considered a primitive separation of data and UI, yet was killed everywhere. When you hand over your data to the world, you lose all control of it. Monetization becomes impossible and you leave the door wide open for any competitor to destroy you.
That pretty much limits the idea to the "common goods" like Wikipedia and perhaps the academic world.
Even something silly as a semantic recipe for cooking is controversial. Somebody built a recipe scraping app and got a massive backlash from food bloggers. Their ad-infested 7000 word lectures intermixed with a recipe is their business model.
Unfortunately, we have very little common good data, that is free from personal or commercial interests. You can think of a million formats and databases but it won't take off without the right incentives.