Hacker News new | past | comments | ask | show | jobs | submit login
SQLITE: JSON1 Extension (sqlite.org)
311 points by ashish01 on Feb 3, 2017 | hide | past | web | favorite | 70 comments

It should be made clear that this functionality is not equivalent to the Postgres JSON support where you can create indexes based on the structure of the JSON. This is an extension to the query language that adds support for encoding/decoding within the context of a query:

    SELECT DISTINCT user.name
    FROM user, json_each(user.phone)
    WHERE json_each.value LIKE '704-%';
It's pretty neat considering they rolled it all on their own: https://www.sqlite.org/src/artifact/552a7d730863419e

PS: If you haven't looked at the SQLite source before, check out their tests.

On the other hand the same version supports indices on expressions that could be used to create an index on json values. Not exactly the same but could be used for similar use cases.

Anything specific about their tests? I found this doc: http://www.sqlite.org/testing.html

SQLite is frequently cited for having incredible robustness, and this is certainly related to incredible test coverage.

Many of the stats on that page are impressive, but the one that always gets me is that for 122 thousand lines of production code, the project has 90 million lines of tests.

Though I am a huge fan of SQLite, I am not sure if the incredible test coverage necessarily leads to success :

"Trying to improve software quality by increasing the amount of testing is like trying to lose weight by weighing yourself more often. What you eat before you step onto the scale determines how much you will weigh, and the software-development techniques you use determine how many errors testing will find. If you want to lose weight, don't buy a new scale; change your diet. If you want to improve your software, don't just test more; develop better."(McConnell, Steve (2009-11-30). Code Complete (Kindle Location 16276). Microsoft Press. Kindle Edition)

Another unique aspect of SQLite code base is total sticking to KISS(http://www.jarchitect.com/Blog/?p=2392)

That is not an apt analogy. Unit tests improve code quality: code that is easily testable has high cohesion and low coupling. Tests can also serve as valuable documentation.

I have to work in a few codebases at work that require 100% line and branch coverage.

There are many cases where someone will write tests that hit an endpoint directly and then assert on the whole response, which in this case is quite huge. They'll then do so for all branches.

Their library / service / etc. code is technically exercised, sure, but doTheFooThing() isn't directly tested, so it could have a bug that is only exposed from another caller with different parameters that would be caught with direct testing. Extreme coupling happens all the time.

Now I'm slowly converting it to sanity, and my teammates are copying me.

To be fair, it was one of those "Get this out now because we're dying" kind of codebases, not do to lack of skill.

But once code is written, it's hard to undo. Then the bad pattern becomes "keeping the same style"

I am in favor of testing the way you complain about. The advantage is primarily that you can refactor code without getting bogged down in having to change tests. An API is a) more stable, so you are maintaining compatability for that surface anyway, and b) uses your code under exactly the preconditions that really matter.

If doTheFooThing() is called from somewhere else than that somewhere else should also have tests. So I find that an argument from "purity" more than practical consideration about bug probability.

Also, if you only test doTheFooThing but not the API then you could accidentally refactor yourself into breaking the API in a backwards-incompatible way (or not be bug-compatible, which is sometimes required, or at least you should detect it and check logs and warn consumers). So the API tests are needed anyway.

There's a balance of course, if doTheFooThing() is an important internal cross-road, or if it is algorithmically non-trivial, or important for other reasons then it should be tested in seperation. But between only semi-integration tests (hitting endpoints and checking responses), and only lots of small unit tests that break or needs rewriting for the simplest refactorings but doesn't catch subtle API breakage, I'd want to work with the former any day. The units of code are often trivial where mistakes are not made, and the mistakes comes when stringing them together, and then it is more difficult to trust the human capacity to figure out failure scenarios than just run the real handlers.

I've gone through various stages on my feelings about full coverage unit testing vs functional testing.

Having full unit coverage with full path coverage is a great ideal. The reality is that for most companies the overhead of having and maintaining these tests is impossible from a business POV. Getting buy-in to spend more than 50% of your time maintaining (writing, updating, verifying) tests is a very hard sell. For companies with under-staffed/over-worked development teams, it just doesn't happen.

At this point in my career I'm firmly in the camp that functional testing is really what matters in most cases. It is a compromise between the realities of business and the needs of engineering. I can test the end product of the development work as a whole and identify that everything works as expected and has no known strange side-effects. This also serves as a contract with your business users as to functionality of the product. If bugs are discovered you can add specific use-case tests to attempt to trigger the bug and prevent regression. All of this does not preclude limited unit testing as well for critical code paths. I find this to be a much more pragmatic approach.

High cohesion and low coupling are good, and easily testable correlates with that, but having lots of unit tests doesn't imply that the code is easily testable, and not having unit tests doesn't imply that code has low cohesion or high coupling. It's even possible code needs many lines of testing partly because it lacks in ease of testing.

In the case of SQLite I think it mostly is because of hard work to fulfill the ambition to deliver a robust project and because of the existence of fuzzing, which can automate test generation.

That is analogy I copied from McConnell, Steve (2009-11-30). Code Complete (Kindle Location 16276). Microsoft Press. Kindle Edition. I tend to rely on what I written on this book

That's fine. It's a good book. That doesn't mean the analogy works in every situation. I happen to think it doesn't in this one. Dr. Hipp is a great coder. Telling him to "develop better" from that analogy just falls apart. Tests are there for a reason and Dr. Hipp uses them to great effect on the quality of SQLite.

Yeah I agree. Actually the whole point of the original comment was about SQLite's robustness as the result of multiple development practices they use(including testing of course), but testing coverage can't lead to success itself. That's it)

More unittests would be like weighting how much bone, muscle, fat, organ, water, ect. you have in your body, instead of just one combined weight.

You can test known edge case input along with the general input, and it really does give you more robustness.

It is pretty well demonstrated that testing your software more reduces the number of bugs. That's why all safety critical software is required by industry standards to have complete test coverage.

Above a certain amount of tests it is maybe not the most cost-effective way to reduce defects, but arguing that testing is not the way to go seems misguided to me.

Well, no one can argue that testing is imperative. It's just the numbers that SQLite exposed might suggest that their test coverage is above any reasonable limits )

> the SQLite library consists of approximately 122.9 KSLOC of C code. (KSLOC means thousands of "Source Lines Of Code" or, in other words, lines of code excluding blank lines and comments.) By comparison, the project has 745 times as much test code and test scripts - 91596.1 KSLOC.

Though again, I admire SQLite and use it my projects all the time.

Yes. Despite all their tests, they still have monthly bugfix releases, often fixing years-old bugs.

Do you think they would have fewer bugs if they had worse test coverage?

I was wondering why did this comment receive multiple downvotes?

If you want to lose weight, you need to be able to actually measure your weight. Codebases with inadequate tests can't do this accurately, so they don't even know what changes to their diet/development process help.

Also, the quote does not say that you have to avoid testing, it says that trying to improve software quality by increasing the amount of testing does not make much sense.

>If you want to lose weight, you need to be able to actually measure your weight.

That's not true. There are multiple ways to track the progress of losing weight, starting from simple "look at the mirror" ending with fat percent measure.

Tests are like guard rails. Nobody's saying they shouldn't be there, but they're safety nets.

[Rich Hickey's _amazing_ talk on simple vs easy](https://github.com/matthiasn/talk-transcripts/blob/master/Hi...)

The `sqllogictest` suite (https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki) is a great resource that is not SQLite-specific. It's a huge collection of queries and expected responses (in an easily-parseable format) that mostly sticks to the SQL standard so it can be run against any database backend. We're using it in CockroachDB.

I'm pretty sure one can create indices based on JSON expressions (see, for example, http://sqlite.1065341.n5.nabble.com/Partial-indexes-on-JSON-...). I don't know how it compares to Postgres' JSON support.

That sounds great. Could anyone explain what the differences between the two would be?

Kinda like SQL Server. Postgres has the superiour implementation for now.

> It should be made clear that this functionality is not equivalent to the Postgres JSON support

I found that pretty clear by way of it being SQLite :P

How does the speed compare to MondoDB?

Wait, what? Isn't storing JSON data as text in a relational DB against all kinds of normalisation rules? Under what circumstances should one do this?

when you have arbitrary semi-structured data of limited scope to store. For example the stuff I was working on today is pluggable payment infrastructure. The vendor response is stored json (comes down the wire as xml or json, json is easier, I refuse to anticipate the structure of the data for future providers but I want the whole data returned for debug purposes), and the extra data it requires for transaction resolution is also json. Again I have no idea what this will look like for future payment providers, and this data will not result in consequences for other bits of the database.

Yep, I've done something very similar for payment processing (from multiple providers) recently. Have a well defined schema containing the columns that I _know_ I need now for handling a transaction, but also include a jsonb column that stores all of the data that the payment provider provides for the transaction.

For one, this makes debugging easier, and it also means that should business needs change in the future and some field that we've been receiving becomes important for payment processing, it can be extracted from the json field and promoted to a column in the table, without having to _now_ define a load of columns for every possible field that every payment provider can ever supply.

Another example from my space is product catalog data. There is large amount of data that is common and those are normalized properly. There is also a large amount of per product type variability that needs to be captured. One way to capture that is using JSON to store that specific data. Being able to index and query using the JSON structured data is a huge boon there.

Simple example, settings.

Stuff you want to make configurable at runtime.

> as text

That's the issue he was referencing, I think.

SQLite is a great database for microservices and other minimalist architectures. You'll never get a "TCP socket failed/timeout/reset" from SQLite - that's for sure.

I always seem to be the black sheep in a group of people when I say that I love sqlite. It's seriously so handy. If I need to aggregate or parse big CSV sheets, I just toss them in a sqlite database and run actual SQL queries against the data and it makes the process much faster and I can give someone their data in relational form on a thumbdrive or in a tar.gz.

I thought everyone loved SQLite! It's tiny, no-fuss but full-featured, performs well, and works great as a data-interchange format. I use it in all my simple ad-hoc personal apps, and where would mobile be without it?

Not to mention browsers. Many people are surprised to learn iMessage, Chrome and Firefox all use SQLite.

More people would probably be surprised to discover Adobe Lightroom catalogues are sqlite databases (which means you can pull all sorts of info out of your catalogue, should you so desire, or fiddle with it in unsupported ways).

It's a shame I don't see Adobe on the list of sponsors/contributers, I have to say. Lightroom makes a huge amount of money for them.

I think there could be some instances where people suggest or build stuff with SQLITE when a more robust database solution is called for. Or they start test/develop with SQLITE when the final target is, say, PostgreSQL and delay too long the time they get tests and rigor in place on the real database (or avoid doing so outright). I don't know that you fall into any of those camps (I don't know you after all), but if you did I could understand why you'd find yourself on the wrong side of opinion.

However, SQLITE is great for ad hoc data manipulation as you say, and there are application niches, particularly on the client side where it can be great. By itself, everything I know about it is that is an excellent piece of software and the tests are practically a case study in rigor (again, from what I've heard).... so how could you not love that?

> I think there could be some instances where people suggest or build stuff with SQLITE when a more robust database solution is called for.

They do, unfortunately. I inherited a web2py application using SQLite as database. It was maybe OK when the original developer wrote it but it's pretty clear now that we should move to a full fledged database.

There already were "fossil" columns in the database because there is no DROP COLUMN and no RENAME COLUMN. The lack of those basic features greatly hinders development because they must be implemented by creating a new table with the new schema and copying all the data in there. Luckily we can stop production to perform those operations, but the extra developer time means that any of those operations costs an unreasonable amount of money to the customer (compared to ALTER TABLE DROP COLUMN). It's not a surprise that they decided to keep those fossils into the database for now and "we'll see what to do".

Another problem I run into was the typelessness of the storage. Basically SQLite has storage classes and maps SQL types to them (Edit: see the note at the end) I quote https://www.sqlite.org/datatype3.html "The important idea here is that the type is recommended, not required. Any column can still store any type of data" and "If numerical data is inserted into a column with TEXT affinity it is converted into text form before being stored." I won't go into details but I had some fun with dates, stored in different formats in different TEXT fields and even in the same column of the same table, because different versions of a controller had different ideas about how to parse the dates received from the browser.

For these and other reasons we should migrating to PostgreSQL (even MySQL will do). That would mean rewriting parts of the application and testing it all (guess if the original developer wrote any test?) That costs too much upfront so we're sticking with SQLite for the time being, until we hit some showstopper.

At least a problem this application won't ever have is scaling. There will always be one server and one process writing to the database. SQLite is OK in this scenario. A different web app for a different customer (also inherited) is using SQLite and maybe it will have to scale horizontally in the next months. That means we'll have to replace SQLite with PostgreSQL but again, it costs and will do it only when necessary.

TLDR, my suggestion is to never start a web application with SQLite. It's not its natural environment and there is little to gain because setting up a traditional database is not that difficult anyway. Use SQLite only as embedded db for desktop or mobile applications. I wish we had it in the browser too. I'm sorry that https://www.w3.org/TR/webdatabase/ has died. IMHO it was much better than https://www.w3.org/TR/IndexedDB/

Edit: I read once again the page I linked and concluded that this is because of the choice of the developer. He used varchar for some date fields and numerical fields for others. The problems were in the varchar fields. No blame to SQLite here. Still, automatic data conversion is not a good idea for a database IMHO.

I'm a recent convert to sqlite - started using it in CLI Go apps as a more robust way of handling large data sets than building my own bespoke in memory databases. It's not often I come across software that impresses and amazes me as much as sqlite does. It's tiny, ridiculously easy to implement, writes 1 flat file, uses comparatively little memory, and is /fast/. I don't even know how they managed to make it so fast given all the other constraints.

Seriously impressive database.

I've been a fan of SQLite for nearly two decades now! use it all the time on my Win32 apps for recording things like local log files and other things that don't require multi user access. I recall even building a simple local network full text search engine for legal documents for a law firm using SQLite with their FTS3 extension once.

Built a few iOS app with it as the data store too. Love the 'zero configuration' install. From memory I've even used DOS batch files to manipulate data on SQLite...

Me too, nearly two decades now. Started using it for Win32 apps also. Now we're using it for mobile apps.

I'm a fan of sqlite for similar reasons.

I also use it frequently to do some data massaging before loading into a larger database since it has a standard Python module and queries are quick and easy.

I love sqlite, and I also love MySQL/MongoDB/PG/HBase/TiDB...just use the right tool for different job.

Most people I interact with love sqlite.

sqlite 'all the things!'. Seriously. One of the best tools ever. So many data, and related performance challenges, in almost any app can be solved efficiently with this (for what it does) tiny little library.

Oooh! I just arrived at work and things are calm, this will be interesting reading! We use sqlite for mostly every tool we develop.

Not clear if you can compose these functions. Flatbuffers over rocksdb should be considered an alternative. You can then use iterlib to do similar things.

Plug: https://github.com/facebookincubator/iterlib

Thank you for plugging iterlib here, I hadn't heard of it.

Have you heard of storehaus? https://github.com/twitter/storehaus

In terms of the fundamental abstraction offered, it seems comparable to iterlib to me, but I'd love to hear your opinion.

(See also: https://upscaledb.com/)

I use this for a custom JSON query tool and browser I wrote for our company (the C# client can load extensions). It's been available for a while. Is this post to spread awareness or have they added something to new to the extension?

This is very interesting, I wonder what drove this extension and how things will shape with SQLite. I always loved that in Python I can 'just use' SQLite quite easily.

>Experiments have been unable to find a binary encoding that is significantly smaller or faster than a plain text encoding. (The present implementation parses JSON text at over 300 MB/s.)

I understand that JSONB in Postgres is useful primarily for sorts and indexes. Does SQLite work around this somehow, or is that just not included in their experiments?

Looks like you can't index based on JSON in SQLite, so they might be optimizing for different metrics?

My understanding is that yes you can create an index using this JSON data. You just have to create an index from an expression. http://sqlite.org/expridx.html

It's possible this is just the first step in that direction.

Here are instructions for building the extension as a shared library on OS X https://burrows.svbtle.com/build-sqlite-json1-extension-as-s....

I wasn't able to get a working build on Windows.

Thanks for the procedure. Exactly what I needed !

What happened on Windows?

I was able to build the dll and load it into python, but using any of the json1 functions caused my python to crash.

Is there a Ubuntu PPA or docker image of SQLite + JSON extension enabled?

I can't get the compiled json1.so to load on Ubuntu 14.04 lts with stock SQLite.

I think the version on 14.04 is too old. I found the statement you need 3.9 for that (is that somewhere documented properly?).

For what is worth, the sqlite3 version in Ubuntu 16.04 has that json1-extension loaded by default:

    sqlite> CREATE TABLE user (name TEXT, phone TEXT);
    sqlite> INSERT INTO user VALUES("tester", '{"phone":"704-56"}');
    sqlite>  SELECT DISTINCT user.name FROM user, json_each(user.phone) WHERE json_each.value LIKE '704-%';
I'm now even more happy I upgraded my server to that version yesterday, the server for that project was like yours on 14.04. It was even also because of sqlite, I wanted to have a version that supports the WITH clause. Upgrading to 16.04 (actually, I made a fresh install and moved the data over) seemed like the easiest way to get that.

How does this perform on larger tables?

Is there any plan to support jsonb (or similar) which would speed processing by eliminating the need to reparse?

The embedded version of mongodb?

> The json_object() function currently allows duplicate labels without complaint, though this might change in a future enhancement.

Just like Mongo. I was suprised to learn that Mongo can actually store an object with two identical keys. Most drivers will prevent you from doing so, and will fail to display the record fully if it does happen, but it is possible.

More like embedded postgres.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact