Hacker Newsnew | past | comments | ask | show | jobs | submit | SQLite's commentslogin

Checking the datatype is not the same as validating. There is lots of data out there that is invalid, and yet still has the correct type. In fact, that is the common case.

I dare say you will be hard pressed to find a dataset of significant size that doesn't have at least one invalid entry somewhere. Increasingly strict type rules will not fix that.


Dr. Hipp,

> I dare say you will be hard pressed to find a dataset of significant size that doesn't have at least one invalid entry somewhere

I agree. In my experience, and you've forgotten more than I have learned, the mark of a good data engineer is how they account for invalid entries or whether they get `/dev/null`ed.

> Checking the datatype is not the same as validating. There is lots of data out there that is invalid, and yet still has the correct type. In fact, that is the common case. Increasingly strict type rules will not fix that.

I am having a hard time letting go of this opportunity to learn from you, so in case you have time and read this again - When you say "There is lots of data out there that is invalid, and yet still has the correct type", I read "type" as "shape" or "memory layout" and "invalid" as "semantically wrong".

So, is a good example of this a value of `-1` for a person's age? The database sees a perfectly valid integer (the correct shape), but the business logic knows a person cannot be negative one years old (semantically invalid).

In that case, to be explicit, `0` or `14` is a "valid type" for an age (usually integer), but completely invalid data if it's sitting in an invoicing application for an adult-only business?

Again, thank you for your time and attention, these interactions are very valuable.

PS: I'm reminded of a friend complaining that their perfectly valid email would keep getting rejected by a certain bank. It was likely a regex they were using to validate email was incomplete and would refuse their perfectly valid email address


No, I think that people can use SQLite anyway they want. I'm glad people find it useful.

I do remain perplexed, though, about how people continue to think that rigid typing helps reliability in a scripting language (like SQL or JSON) where all values are subclasses of a single superclass. I have never seen that in my own practice. I don't know of any objective research that supports the idea that rigid typing is helpful in that context. Maybe I missed something...


> where all values are subclasses of a single superclass

I don't understand this. By values do you mean a row (in database terms)? I don't understand what that has to do with rigid typing.

Lack of rigid typing has two issues, in my opinion: First, when two or more applications have to read data from a single database, lack of an agreed-upon-and-enforced schema is a limitation. Second, when you use generic tools to process data, the tools have no idea what type of data to expect in a column, if they can't rely on the table schema.


First off, I am so glad the famous "HN conjure" actually worked! My "if Dr. Hipp was reading this thread" was tongue in cheek because on HN it was extremely likely that's precisely what would happen. Thank you for chiming in, Dr. Hipp - this is why I love HN!

So, in case you missed it, you're responding to Dr. Hipp himself :)

> I don't understand what that has to do with rigid typing.

Now I would like to learn a bit from Dr. Hipp himself, so here's my take on it:

Scripting languages (like my fav, Python) have duck or dynamic typing (a variation of what I believe Dr. Hipp, you specifically call manifest typing). Dr. Hipp's take is that the datatype of a value is associated with the value itself, not with the container that holds it (the "column"). (I must say I chose the word "container" here to jive with Dr. Hipp's manifest. Curious whether he chose that word for typing for the same reason! )

- In Python, everything is fundamentally a `PyObject`.

- In SQLite, every piece of data is (or was?) stored internally as a `sqlite3_value` struct.

As a result, a stack that uses Python and SQLite is extremely dynamic and if implemented correctly, is agnostic of a strict type - it doesn't actually care. The only time it blows up is if the consumer has a bug and fails to account for it.

Hence, because this possibility exists, and that no objective research has proven strict typing improves reliability in scripting environments, it's entirely possible our love for strict types is just mental gymnastics that could also have been addressed, equally well, without strict typing.

I can reattempt the "HN conjure" on Wes McKinney and see if this was a similar reason he had to compromise on dynamic typing (NumPy enforces static typing) to Pandas 1.x df because, as both of them are likely to say, real datasets of significant size rarely have all "valid" data. This allows Pandas to handle invalid and missing fields precisely because of this design (even if it affects performance)

A good dynamic design should work with both ("valid" and "invalid") present. For example: layer additional "views" on top of the "real life" database that enforce your business rules while you still get to keep all the real world, messy data.

OTOH, if you dont like that design but must absolutely need strict types, use Rust/C++/PostgreSQL/Arrow, etc. They are built from the ground up on strict types.

With this in mind, if you still want to delve into the "Lack of rigid typing has two issues" portion, I am very happy to engage (and hope Dr. Hipp addresses it so I learn and improve!)

The real world is noisy, has surprises in store for us and as much as engineers like us would like to say we understand it, we don't! So instead of being so cocksure about things, we should instead be humble, acknowledge our ignorance and build resilient, well engineered software.

Again, Dr. Hipp, Thank you for chiming in and I would be much obliged to learn more from you.


Thank you for the great explanation. But SQL isn't as dynamically typed as you suggest. If a column is defined as DECIMAL(8, 2), it would be surprising for some values in that column to be strings. RDBMSs are expected to provide data integrity guarantees, and one of those guarantees is that only values matching the declared column type can be stored.

Relaxing that guarantee has benefits. For example, it can make application evolution easier--being able to store strings in a column originally intended for numbers is convenient. But that convenience can become a liability when multiple applications read from and write to the same database. In those cases, you want applications to adhere to a shared schema contract, and the RDBMS is typically expected to enforce that contract.

It also creates problems for generic tools such as reporting systems, which rely on stable data types--for example, to determine whether a column can be aggregated or how it should be formatted for display.


Flexible typing works really well with JSON, which is also flexibly typed. Are you familiar with the ->> operator that extracts a value from JSON object or array? If jjj is a column that holds a JSON object, then jjj->>'xyz' is the value of the "xyz" field of that object.

I copied the idea for the ->> operator from PostgreSQL. But in PostgreSQL, the ->> operator always returns a text rendering of the value from the JSON, even if the value is really an integer or floating point number. PG is rigidly typed, so that's all it can do. But SQLite is flexibly typed, so the ->> operator can return anything - text, integer, floating-point, NULL - whatever value if finds in the JSON.


Minor correction: SQLite is not closed to contributions. It just has an unusually high bar to accepting contributions. The project does not commonly accept pull requests from random passers-by on the internet. But SQLite does accept outside contributed code from time to time. Key gates include that paperwork is in place to verify that the contributed code is in the public domain and that the code meets certain high quality standards.

I was about to try to make this point: there have always been projects that attract more potential contributors than there are competent contributors.

And there have always been techniques for identifying quality contributions from new contributors.


Thank you for the correction, I should have said "not open contribution" rather than "closed contribution".

As long as you are not using indexes on expressions where the expression value is a floating point number that is computed using one or more text->binary conversions, then you should be fine.


You do not recall correctly. There is more than 500K SLOC of test code in the public source tree. If you "make releasetest" from the public source tarball on Linux, it runs more than 15 million test cases.

It is true that the half-million lines of test code found in the public source tree are not the entirety of the SQLite test suite. There are other parts that are not open-source. But the part that is public is a big chunk of the total.


Out of curiosity, why aren't all tests open source?


One set of proprietary tests is used in their specialist testing service that is a paid for service.


What is that service used for besides SQLite?


It's still SQLite, they just need to make money: https://sqlite.org/prosupport.html

Edit: also this:

> TH3 Testing Support. The TH3 test harness is an aviation-grade test suite for SQLite. SQLite developers can run TH3 on specialized hardware and/or using specialized compile-time options, according to customer specification, either remotely or on customer premises. Pricing for this services is on a case-by-case basis depending on requirements.


That's interesting. Here is more information https://sqlite.org/th3.html

The roots of SQLite are in defence industry projects of US Navy and General Dynamics. Seems like TH3 might be of interest for these sort of users.


One could assume also for Fossil.


The Common Table Expression feature of SQL is very good at walking graphs. See, for example <https://sqlite.org/lang_with.html#queries_against_a_graph>.


If an I/O error happens with read()/write(), you get back an error code, which SQLite can deal with and pass back up to the application, perhaps accompanied by a reasonable error message. But if you get an I/O error with mmap, you get a signal. SQLite itself ought not be setting signal handlers, as that is the domain of the application and SQLite is just a lowly library. And even if SQLite could set signal handlers, it would be difficult to associate a signal with a particular I/O operation. So there isn't a good way to deal with I/O errors when using mmap(). With mmap(), you just have to assume that the filesystem/mass-storage works flawlessly and never runs out of space.

SQLite can use mmap(). That is a tested and supported capability. But we don't advocate it because of the inability to precisely identify I/O errors and report them back up into the application.


Thanks for the response. I am more worried about losing already committed data due to an error

https://www.sqlite.org/mmap.html

> The operating system must have a unified buffer cache in order for the memory-mapped I/O extension to work correctly, especially in situations where two processes are accessing the same database file and one process is using memory-mapped I/O while the other is not. Not all operating systems have a unified buffer cache. In some operating systems that claim to have a unified buffer cache, the implementation is buggy and can lead to corrupt databases.

What are those OSes with buggy unified buffer caches? More importantly, is there a list of platforms where the use of mmap in sqlite can lead to data loss?


That depends on the query. SQLite tries to use LIMIT to restrict the amount of reading that it does. It is often successful at that. But some queries, by their very nature, logically require reading the whole input in order to compute the correct answer, regardless of whether or not there is a LIMIT clause.


No, Simon, we don't "refuse". We are just very selective and there is a lot of paperwork involved to confirm the contribution is in the public domain and does not contaminate the SQLite core with licensed code. Please put the false narrative that "SQLite refuses outside contributions" to rest. The bar is high to get there, but the SQLite code base does contain contributed code.


Dr. Hipp, I love SQLite but also had simonw's misapprehension that the project did not accept contributions. The SQLite copyright page says:

> Contributed Code

> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.

I realize that the section, "Open-Source, not Open-Contribution" says that the project accepts contributions, but I'm having trouble understanding how that section and the "Contributed Code" section can both be accurate. Is there a distinction between accepting a "patch" vs. accepting a "contribution?"

If you're planning to update this page to reduce confusion of the contribution policy, I humbly suggest a rewrite of this sentence to eliminate the single and double negatives, which make it harder to understand:

> In order to keep SQLite in the public domain and ensure that the code does not become contaminated with proprietary or licensed content, the project does not accept patches from people who have not submitted an affidavit dedicating their contribution into the public domain.

Could be rewritten as:

> In order to keep SQLite in the public domain and prevent contamination of the code from proprietary or licensed content, the project only accepts patches from people who have submitted an affidavit dedicating their contribution into the public domain.

[0] https://sqlite.org/copyright.html


Yes, that "does not accept patches" line must have been where I picked up my incorrect mental model.


Website updated with more precise wording. Sorry for the confusion.


Thanks for the correction, and sorry for getting that wrong. I genuinely didn't know that.

Found that paperwork here: https://www.sqlite.org/copyright-release.html

I will make sure not to spread that misinformation further in the future!

Update: I had a look in fossil and counted 38 contributors:

  brew install fossil
  fossil clone https://www.sqlite.org/src sqlite.fossil
  fossil sql -R sqlite.fossil "
    SELECT user, COUNT(*) as commits
    FROM event WHERE type='ci'
    GROUP BY user ORDER BY commits DESC
  "
Blogged about this (since it feels important to help spread the correction about this): https://simonwillison.net/2025/Dec/29/copyright-release/


Man, your behavior when you realize you got something wrong is something the rest of us can aspire to. This is one of the things I like the best about you.


It's a devious credibility hack!

I learned it from newspapers: papers that publish prompt and clear corrections when they publish mistakes are more credible than papers that don't acknowledge their errors.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: