I love that sqlite article. It seems like "everyone" is certain that sqlite can only be used for up to a single query per second, anything more and you need to spin up a triple sharded postgres or Hadoop cluster because it 'needs to scale'.
I love being able to show that study, if you properly architect your sqlite system and am willing to purchase hardware, you can go a long long way, much further than almost all companies go, with your data access code needing nothing more than the equivalent of System.Data.Sqlite
SQLite is incredible. If you are struggling to beat the "one query per second" meme, try the following 2 things:
1. Only use a single connection for all access. Open the database one time at startup. SQLite operates in serialized mode by default, so the only time you need to lock is when you are trying to obtain the LastInsertRowId or perform explicit transactions across multiple rows. Trying to use the one connection per query approach with SQLite is going to end very badly.
2. Execute one time against a fresh DB: PRAGMA journal_mode=WAL
If at this point you are still finding that SQLite is not as fast or faster than SQL Server, MySQL, et.al., then I would be very surprised.
I do not think you can persist a row to disk faster with any other traditional SQL technology. SQLite has the lowest access latency that I am aware of. Something about it living in the same process as your business application seems to help a lot.
We support hundreds of simultaneous users in production with 1-10 megs of business state tracked per user in a single SQLite database. It runs fantastically.
You've just moved a bunch of problems to whatever is accessing SQLite. How do you scale out the application server compute if it needs to make transactionally conditional updates?
E.g.:
transaction {
if (expensiveFunction(query()))
update();
}
(My applications always get much faster when I plug them into Postgres after SQLite. But then I do do the odd sort and group by, not OLAP, but because they express the computation I want and Postgres is simply much better at that.)
We aren't running any reports on our databases like this. I would argue it is a bad practice in general to mix OLTP and OLAP workloads on a single database instance, regardless of the specific technology involved.
If we wanted to run an aggregate that could potentially impact live transactions, we would just copy the SQLite db to another server and perform the analysis there. We have some telemetry services which operate in this fashion. They go out to all of the SQLite databases, make a copy and then run analysis in another process (or on another machine).
I am not aware of any hosted SQL technology which is capable of magically interleaving large aggregate queries with live transactions and not having one or both impacted in some way. At the end of the day, you still have to go to disk on writes, and this must be serialized against reads for basic consistency reasons. After a certain point, this is kinda like trying to beat basic information theory with ever-more-complex compression schemes. I'd rather just accept the fundamental truth of the hardware/OS and have the least amount of overhead possible when engaging with it.
> At the end of the day, you still have to go to disk on writes, and this must be serialized against reads for basic consistency reasons.
No, absolutely not.
That's why modern databases use a thing called multi version concurrency control. You can run (multiple) queries on the same table that is updated by multiple transactions at the same time without one blocking the others (assuming the write transactions don't block each other). Of course they are fighting for I/O, but there is no need so serialize anything.
Mixing OLTP and OLAP becomes increasingly "normal" theses days as the capabilities of the database products and the hardware improve. With modern high-end hardware (hundreds of CPUs, a lot of SSDs, large RAM) this actually scales quite nicely .
OLTP databases are optimized for mutable data. OLAP databases are optimized for immutable data. There's a big difference between appropriate data structures for each use case that has little to do with hardware capabilities.
OLAP databases tend to write columns in large blocks and apply sort orders to improve compression. This type of structure works well if you write the data once and read it many times. It's horrendously inefficient for concurrent updates to things like user session contexts. (Or even reading them for that matter.) You are better off using a row store with ACID transactions and relatively small pages.
The dichotomy has been visible for decades and shows no sign of disappearing, because the difference is mostly how you arrange and access data, not so much the hardware used.
"serialized" here doesn't really mean processed in serial, it means "serializable" in the context of database information theory. Databases have special concurrency control requirements in order to create hard guarantees on database consistency. You can process queries in parallel and still have a serializable result, because of transaction coordination. Doing this on one server is much easier than doing this across a cluster of servers.
So in your case, MVCC is what you're talking about, which is not the same level of consistency guarantee as serializable, rather it is based on snapshot isolation. Some database vendors consider them effectively the same isolation level because the anomalies associated to other common non-serializable isolation levels aren't typically present in most MVCC implementations, but there's a lot more complexity here than you are acknowledging.
Mixing OLTP and OLAP workloads on the same database is pretty much always a bad idea. This is why it's common practice to use ETL jobs to move data from an OLTP optimized database like Postgres or MySQL to a separate database for OLAP (which could be another MySQL or PG instance, or could be something like ClickHouse or another columnar database optimized for OLAP). Just because you /can/ do something, doesn't mean you /should/ do something...
There are MVCC systems with serializable, strictly serializable, and even externally consistent transactions. FoundationDB and Spanner are both externally consistent (with geo-replication in Spanner’s case). CockroachDB is serializable, though not strictly serializable. Single-master Postgres can do serializable transactions as well.
I'd be very interested in you providing an example of an MVCC system which is fully serializable. Conventional wisdom is, that while possible, it is prohibitively expensive to ensure a snapshot isolation system like MVCC is fully serializable, and it is explicitly most expensive for analytics / OLAP workloads because you must keep track of the read set of every transaction. It is possible for such a system to exist, then, but it would cost such a performance penalty as to be a painfully bad choice for a workload where OLAP and OLTP would be mixed, bringing us back to the point I originally made.
In most cases, snapshot isolation is sufficient, and some database vendors even conflate snapshot isolation with serializable, but they're not the same thing. I'd be hesitant to believe any vendor claims that they implement serializable MVCC without validating it via testing. As we've been shown by Jepsen, database vendors make many claims, some of which are unsubstantiated. Spanner is very cool technology, however I have personally heard some very interesting claims from folks on the Spanner team that would violate the laws of physics, so again, without a demonstration of their claims, I'd take them with a grain of salt.
Both the FoundationDB docs and Spanner whitepapers are very clear that their definitions of strict serializability match with the conventional academic one. FoundationDB does keep track of the read set of every transaction: there are nodes dedicated to doing that in their architecture. You can even add things to the read set of each transaction without actually reading them to get additional guarantees about ordering (e.g. to implement table-level locks). FoundationDB (and obviously Spanner as well) don’t have any massive penalties for this; FoundationDB clusters can handle millions of operations per second with low single digit read and commit latencies: https://apple.github.io/foundationdb/performance.html.
If your condition for believing these claims is approval from Jepsen (i.e. Kyle Kingsbury), he apparently didn’t both testing FoundationDB because their test suite is “waaaay more rigorous”: https://twitter.com/aphyr/status/405017101804396546. In particular their test suite is able to produce reproducible tests of any variation in messaging (dropped messages, reordered messages) across their single-threaded nodes, which is extremely useful in narrowing down places where serialization violations can hide. He also seems to believe Spanner’s claims: https://www.youtube.com/watch?v=w_zYYF3-iSo
I’m not sure where this “conventional wisdom” about serializability having an unavoidable large performance hit is coming from; the databases I mentioned are very well known in the distributed systems field and not some random vendors making outlandish claims.
There's an enormous leap between something which is slow on SQLite and something which requires etl into a data warehouse or similar tech, columnar store etc.
I mean, at least three orders of magnitude, minimum.
It's just a ludicrous argument. SQLite is fine for a file format, and in very specific dumb CRUD scenarios it's just about ok. But it's not worth sticking with if you need anything interesting over any volume of data, far far far below what would warrant a different DB tech to rdbms.
Ironically, you're more representative of the "we'll need Hadoop" crowd.
A defense of SQLite on unsuitability to OLAP grounds, when many aggregate and analytic functions can be performed perfectly acceptably on RDBMSes like Postgres (and even MySQL, with care), diving straight to the append-only, data pipeline with ETL approaches.
> Mixing OLTP and OLAP becomes increasingly "normal"
Just because it's "normal" doesn't mean it's correct. Just because you can doesn't mean you should.
All hail bob1029:
> We aren't running any reports on our databases like this. I would argue it is a bad practice in general to mix OLTP and OLAP workloads on a single database instance, regardless of the specific technology involved.
I think this is specifically because random reads are scaling much better than writes, even though you won't see it on standard "that many MB/s" benchmarks where read is 'just' a few multiples of the write performance.
Persisting a transaction to the database is still (and especially in MVCC): "send data write". "wait for write to be flushed". "toggle metadata bit to mark write as completed". "wait for bit to be completed" which still serialises transaction commits while reads can complete in parallel as fast as the device can handle.
Especially now that the reads and writes don't have to share the disk head, it makes sense for random reads to keep on scaling better than writes.
This is actually the opposite of current limitations. Stick a capacitor and some DRAM on the NVMe and you can "instantly" flush to disk, but there's no way to anticipate where the next read will come from and therefore no way to accelerate it.
You'll see modern NVMe disks with sustained writes greatly outpacing reads until the write cache is saturated, at which point what you say is true and reads will greatly outpace writes. But you don't want your disks to ever be in that threshold.
I think we're seeing the difference between someone who is a programmer and someone who plays a programmer at work. :) Arguing that some modern, complex feature is "better than" a simpler system is crazy talk. Any time I can simplify a system vs. add complexity I will go simplicity. Adding in plug-ins and features goes a long way toward vendor lock-in that prevents your org from being agile when you have to swap systems because you run into limits or have catastrophic failures.
> I am not aware of any hosted SQL technology which is capable of magically interleaving large aggregate queries with live transactions and not having one or both impacted in some way. At the end of the day, you still have to go to disk on writes, and this must be serialized against reads for basic consistency reasons.
I do sometimes wonder if dirty reads are what the business folks actually want.
Not necessarily unconstrained dirty reads. But if it were possible to say, "The statistics in your reports may only be accurate to (say) +/- x%," would that be good enough?
Going really philosophical, might they even make better decisions if they had less precision to work with? There are certainly plenty of studies that suggest that that's basically how it works when people are managing their investments.
My experience is that customers/executives will accept latency but not inaccuracy. In practice this may be the same thing, it just depends how you position it. “Reports are accurate but may be up to 5 minutes out of date” is a very easy sell to a corporate worker who logs in to check a dashboard once a month.
Primary/replica is probably the correct way to solve this. In some places, I have also shunted writes through SQS queues, which in practice protects us from a locking operation in one place impacting other operations in a customer-facing way. I don’t think this is strictly necessary but it is a nice technical guard against the sociological problem of a contractor merging code like that. They don’t feel the pain of locks gone bad because they (per contract) can’t be on call.
You're absolutely right. But I also find that what customers/executives actually want, and what they say they want, turn out to be different things once you start unpacking them.
And sometimes it's just a matter of framing. Don't say accurate, say, "Accurate to within x%" or "Rounded to the nearest $x" type of thing. But I certainly would never actually pick an argument over it. Sometimes they do know what they want. Other times they really don't, but you still don't get to decide for yourself how the problem is going to be solved.
We effectively do dirty reads when we just go around and copy the databases for purposes of offline analysis. Our use cases for aggregation queries are all tolerant to incomplete recent data. Most of the time we are running analysis 24 hours after whatever event, so we never get into a situation where we can't answer important business questions because data hasn't been flushed to disk yet.
The fact that most of the stuff we care about is time-domain & causal means that we can typically leverage this basic ideology. Very rarely does a time-series aggregate query need to be consistent with the OLTP workloads in order to satisfy a business question.
I can assure you that I live on the same planet as everyone else posting here.
Whether or not I could perform this miracle depends entirely on your specific use cases. Many people who have this sort of reaction are coming from a place where there is heavy use of the vendor lock-in features such as SSIS and stored procedures.
If you are ultimately just trying to get structured business data to/from disk in a consistent manner and are seeking the lowest latency and highest throughput per request, then SQLite might be what you are looking for.
The specific core counts or other specifications are meaningless. SQLite scales perfectly on a single box, and if you have some good engineers you might even be able to build a clustering protocol at the application layer in order to tie multiple together. At a certain point, writing your own will get cheaper than paying Microsoft for the privilege of using SQL Server.
This is a great answer. The details REALLY matter. One of my best early tech success stories was rewriting a SQL query that took 27 hours to one that took ~5 seconds. This was running on a very large Oracle cluster. They had poured more and more money into hardware and licensing trying solve this. In the end, it was a matter of turning a cursor-based query into a set-based query.
In some respects, I think the constraints of something like SQLite can focus people's attention on making things work properly rather than throwing hardware at the problem.
I can think of a couple of places I've worked where they had simple problems that could have been solved by some thinking and coding but instead were solved* by more expensive hardware.
This is precisely my favorite part of SQLite. The constraints (aka lack of features) is what makes it so compelling. We experienced some serious revelations going down this path. The biggest thing from a devops perspective is that you can push very consolidated releases. There is no need to install anything for SQLite to function.
For instance, we use .NET Core Self-Contained Deployments combined with SQLite. As a result, we ship a zip file containing the dotnet build artifacts to a blank windows or linux host and have a working application node within a matter of seconds. The databases will be created and migrated automatically by our application logic, and the host is automatically configured using OS interop mechanisms.
So, when you really look at it, the constraints imposed upon us by SQLite encouraged us to completely sidestep the containerization game. Our "container" is just a single platform-specific binary path that has all of its dependencies & data contained within.
Without SQLite, we would have to have some additional process for every environment that we stand up. This is where the container game starts to come in, and I feel like its a bandaid to a problem that could have been avoided so much further down in the stack (aka SQLite vs MySQL/SQLServer/Postgres). Sure, there are applications where you absolutely must use a hosted solution for one reason or another, but for many (most) others where you do not, it's really the only thing stopping you from having a single process/binary application distribution that is absolutely trivial to install and troubleshoot. You literally just zip up prod bin path and load it on a developer workstation to review damages. 100% of the information you need will be there every time. No trying to log into a SQL server or losing track of which trace folder is for what issue # and sql dump. It keeps things very well organized at all levels. We can just launch the production app with --console to see its exact state at the time of the copy operation and then attach debuggers, etc.
The last time I reduced database server load by 70% by just spending a week tuning indexing strategies and ill-performing queries, nobody thanked me. Sure, a new database server costs way more than a week of my time, but that is completely beside the point.
The point is that I robbed someone of the chance to buy a shiny new computer.
> The last time I reduced database server load by 70% [...] nobody thanked me.
It's crazy. I could maybe understand if there's a time crunch where it's quicker and easier to get more hardware in order to make a sale that'll keep the company alive (which I have experienced once) but that's maybe 1% of the cases.
Anyway, in lieu of their gratitude, I offer my thanks because I appreciate the effort.
That's a fantastic point. Most people in technology these days look at a problem and immediately think things like "more hardware" or "cloud deployment" all to get scalability. Scalability can come in forms other than throwing lots of money at an issue... Oftentimes, money can be saved if one throws more intelligence at the problem. :)
Is SQLite likely to be faster than postgres? In terms of ease of use / admin overhead I consider them mostly equivalent. I thought the main problem with SQLite was it was slow tih concurrent writers. Whereas the "bigger" SQL databases have code that allows concurrent writes.
The issue with SQLite and concurrent writers isn't that it's slow, it's that it just can't do it. WAL mode lets you have as many readers as you want concurrent with a single writer, but it doesn't give you multiple concurrent writers. If you really need concurrent writes, use PostgreSQL or another RDBMS.
In my experience, SQLite is likely to be faster when you have lots of reading. Being in-process gives SQLite a natural advantage in read-heavy situations.
"Write transactions are very fast since they only involve writing the content once (versus twice for rollback-journal transactions) and because the writes are all sequential. Further, syncing the content to the disk is not required, as long as the application is willing to sacrifice durability following a power loss or hard reboot."
I think there are a lot of places where sqlite can outperform postgres... This is read-heavy and latency-critical apps, where additional hop is costly, for example.
> come to my company and replace our 96-core SqlServer boxes with SQLite I'll pay you any salary you ask for.
I also had a server with 96 cores until we realized a developer had inadvertently made a query happen every time a scroll event fired... it was a nice chunk of chance saved.
Did you look at the article in question? Expensive uses (built on) SQLite as the RDMS for their application.
What people are conveniently leaving out is they wrote a serious wrapper around it that makes it very similar to other conventional large scale systems like MSSQL or MySQL: https://bedrockdb.com/
How do you handle access from multiple worker processes? Some languages/frameworks handle have poor multi-threading performance and must be deployed in a multi-process setup (e.g. python webapps). Or is it not a good fit for sqlite?
Multiprocess can work, but you may want to develop an intermediary process that is exclusive owner of the database for performance reasons and then delegate to it. I.e. you could have:
This exercise would also encourage development of a concise and generic schema for storing your business data (presumably because changes to the above JSON contract would be time consuming).
Where are these databases persisted ? HA (multi-process)? Failover ? Backup ? How do you handle containerisation ?Some details would be seriously welcome.
To me, this is the most important line in the article:
> SQLite scales almost perfectly for parallel read performance (with a little work)
They aren't using stock SQLite, they're using SQLite wrapped in Bedrock[1], and their use case is primarily read-only.
SQLite is fantastic at read-only, or read-mostly, use cases. You start to run into trouble when you want to do concurrent writes, however. I tried to use SQLite as the backend of a service a couple of years ago, and it locked up at somewhere around tens of writes per second.
(a) built their own transaction/caching/replication layer using Blockchain no less.
(b) paid SQLite team to add a number of custom modifications.
(c) used expensive, custom, non-ephemeral hardware.
Now you could do all of this or just use an off the shelf database that you aren't having to write custom code to use and if you choose a distributed one e.g. Cassandra will be able to run on cheap, ephemeral hardware.
(a) They implemented a very boring transaction/caching/replication layer that is like any other DB except they borrowed the idea that "longest chain" should be used for conflict resolution.
(b) They worked with upstream to get a few patches that were unique to their use-case. Once you're in deep with any DB this really isn't that uncommon.
(c) They used a dedicated (lol non-ephemeral) white-box server that has a lower amortized cost than EC2.
(d) Bedrock isn't bound to the hardware. You could run it on EC2 and reap the benefits just the same except you'd pay more.
>They implemented a very boring transaction/caching/replication layer that is like any other DB except they borrowed the idea that "longest chain" should be used for conflict resolution.
Handwaving this layer away as "very boring" isn't exactly fair, either. What does boring even mean here?
I mean, this layer solves problems that are both essential to performance scaling of RDBMS and have been proven time and again to be hard to reliably solve in a general case. And it has furthermore been built from the ground up tailored towards the specific needs/use cases of the company.
By the aforementioned handwaving the presented successes are implicitly attributed to SQLite to a degree that isn't justified IMO.
I don't know where they have it, but cohosting isn't exactly free.
1U of cohost with 100mbps in a cheap Eastern European DC will cost a few hundred euro per month... and my info is a few years old. It's more expensive now.
It is very specific case, that we should not extrapolate to mean it's general use case.
AWS/GCP/Azure are still a better places to start for most people.
As a casual user of random software that I test and then immediately forget: I really wish more applications supported sqlite databases. I've set up everything from IRC bots to server animation software to log analysis to forums and WordPress, to test it out, and the first thing that makes me drop something is a dependency on MySQL, pgsql, etc. If your software can work out of a single directory otherwise but can't work with a local database in the same directory, then I'm not going to use it.
I'd certainly believe sqlite can be taken very far. Never done it so far.
But what does "properly architect your sqlite system" mean and how does this compare to just spinning up a postgres service (nothing sharded or fancy otherwise)?
I've been so many solutions that would be easily and reliably implemented on a single or small SQL database cluster of various types that turn into these complex systems to avoid the costs of scaling up the RDBMS.
SQLite resources are a lot lower due to the database being a flatfile on the OS. It's only main resource is storage.
MySQL is a application that not only requires configuration, tweaking, turning and tender-loving-care but consumes constant resources utilising the processor, memory and storage.
More standard without all of the correctness foot-guns, less to configure and operate, and it’s usually faster. MySQL has to do a lot of work running as a separate process, handling connections, etc. whereas SQLite is just doing file I/O in your current process.
I don’t use SQLite on a server for the simple reason I’m to lazy to look after it. There are cloud managed sql server or psql databases. AWS backs me up by default. Why mess around with SQLite.
Not knocking SQLite - great for desktop apps or maybe local dev environments.
It depends on whether your data is confidential as well... I have been known to send a full SQLite dump to a small app that needed a lot of local data.
You can store it in localStorage and read it making your reload be a lot smaller. If it's not in localStorage, you can just request a fresh copy of SQLite from us.
I read this and thought, "oh, the author is calling out formal verification as overhyped? Hillel Wayne (https://hillelwayne.com/) is going to be angry! Wait, who wrote this..."
Overhyped is not the same thing as bad. I program in common lisp, it is by far my favorite environment for developing. I also think just about every blog post and reddit comment by someone who has just discovered lisp greatly overstates its advantages.
I recall that F Scott Fitzgerald said "the test of a first-rate intelligence is the ability to hold two opposed ideas in mind at the same time and still retain the ability to function."
In Hillel's defense, maybe they're not opposed ideas... this article talks about problems with formal verification, but I think his thing is more formal modeling (with TLA+).
I think the best takeaway from this is that the software industry makes lots of claims about development processes, but so little actual research is done in trying to validate those processes. It's all mostly based on opinion.
It's hard to do objective research. Some studies try A/B tests on student volunteers. But then this setting is clearly different to professional teams working on a project for a long time.
The value of new methodologies, languages, and techniques is partly that the enthusiastic proponents of them are given a chance to prove out that there is value, and so become motivated to go the extra distance to achieve the project specific outcome.
This value is destroyed if people are forced to use the technique, instead of championing its introduction. So measurement is made even harder!
That’s the interesting take for me, too: we love to talk about it as engineering or science but there’s a fairly good argument that this is more aspirational than real in many cases.
I reckon the same general phenomenon extends to most any human endeavor on the planet at the macro and micro level (and everything in between), and hardly anyone even notices they are doing it unless someone happens to point it out, which typically (in my experiences) then results in some form of rationalization. It's like (or actually is) humanity is living in a fantasy world at all times, but we're unable to realize it, or even consider it.
It's really hard to point at studies to evaluate these types of hyped development paradigms. Some thoughts, as someone who loves static typing and microservices:
My favorite thing about static typing is that it makes code more self-documenting. The reason I love Go specifically is because if you have 10 people write the same thing in Go, it's all going to come out relatively similar and use mostly built-in packages. Any API requests are going to be self-documenting, because you have to write a struct to decode them into. Any function has clear inputs and outputs; I can hover a parameter in my IDE and know exactly what's going on. You can't just throw errors away; you always are aware of them, and any functions you write should bubble them up.
Typescript addresses this somewhat, but basically offsets that complexity with more configuration files. I like Typescript in use, but I can't stand the fact that Javascript requires configuration files, transpilers, a million dependencies. Same for Python and mypy.
Yes, I could just look at class members in a dynamic language, but there's nothing that formally verifies the shape of data. It's much more annoying to piece apart. I don't use static analyzers, but my guess is that languages like Go and Rust are the most compatible with them. Go programs are the closest thing to a declarative solution to a software problem, of any modern language IMO. As we continue experimenting with GPT-generated programs, I think we're going to see much more success with opinionated languages that have fewer features and more consistency in how finished programs look.
Microservices are also great at making large applications more maintainable, but add additional devops complexity. It's harder to keep track of what's running where and requires some sort of centralized logging for requests and runtime.
You certainly can throw errors away in Go -- in various ways. It's one of the notable flaws in a largely cohesive, sensible language. (Which I use daily.)
I suppose my point, more specifically, is that functions return errors instead of opaquely throwing them. Most languages expect you to know where they're going to occur and catch them as close to their occurrence as possible, in Go you are explicitly carrying them along the whole way. Some people don't like this, I prefer it.
True, if you completely ignore the function's return values, you can throw errors away, but then you wouldn't be using the language in the way that makes it powerful to me; that there are simple and clear interfaces that you interact with.
I used to think this, and then I interviewed people who did both traditional and software engineering professionally, and now I'm not so sure. I did a first draft of what I learned here: https://www.youtube.com/watch?v=3018ABlET1Y
I'm hoping to have a written version by the end of September.
This is great. People romanticize construction, mechanical and other engineering like there would be no failures in those disciplines. Buildings collapse, machines break down in unforeseen circumstances. My pet theory is that in software it is just a lot easier to create a lot of stuff, so it is also a lot easier to create issues.
You can add that in eastern Europe you can get engineering degree which is "technical bachelor" from technical university, so I am software engineer as it is printed in my diploma.
It's not about the failures, it's about the modes of failure. I assume that the modes of failure of a bridge, or of a building, are pretty well understood.
Software has far more distinct pieces than any other product you can find anywhere (maybe the human body?) so it's impossible to completely check the modes of failure. I was just reading before about a hardware corruption bug due to a kernel feature [1] and it's hard to imagine the same chain reaction in other engineering areas.
In software it's also really hard to model behavior. In engineering you'll get tolerances, strength and other features of the pieces you use. In software, you can't even benchmark something and expect the same benchmark to translate to a different computer.
Yes, software is immaterial and thus not constrained by laws of physic (except speed of light). It is comparably easy to change but also comparably hard to specify and model in advance.
I'm a mechanical engineer who writes software for mechanical engineers. I find that the work of MechE moves slower due to operational issues. if there could move fast and break things cheaply to get to market faster they would. all that matters is that the final product is tested and hardened which is something that software shops mostly do anyways
not to mention things like generative design and process automation are getting us to that point.
I suspect many developers would find software a lot less appealing if it were closer to a traditional engineering discipline (slow, unforgiving, dramatically reduced expressive power and creative potential, etc).
I'm somewhat ADD and get bored easily, so not only do I need to do something more like software but I also have to stay as broad and high-level as possible within the discipline, to stave off ennui. NOTE: very much not arguing this is a good way to go through life.
I don't know, my partner works at a civil engineering firm and the day to day work there sounds pretty similar to what I do as a software dev. Sometimes they have to do complicated calculations and research, but by far most of the work is copying templates and tweaking them as needed.
The hardest part of most projects is taking unrealistic and ever changing client demands and trying to turn them into something that will actually work in reality; a process which is probably all too familiar to many software developers.
I've always considered "engineer" to be more of a personality trait than a formal qualification. Most engineers I've met (whether Software, Bio, Civil or whatever) have a similar mindset to my own, although that might be selection bias at play. There's a sense of curiosity and wanting to understand, not in an academic way, but by virtue of doing.
I think software 'engineering' is uniquely ambiguous in this regard, because software development as a discipline is in equal parts both design, and construction, and the design part bleeds into the 'construction' part, corrupting it (for want of a better word) in a way you would imagine that 'pure' engineering would not.
> "...but I do want to underscore a really important point: almost everything in software is a belief - it is something we have experience about, it is something we have opinions on, but it's not something we have hard data on. In most cases, we just don't know. But we can find out. We find out through..."
This seems applicable to everything, almost, in this whole experiment humans have going on here on planet earth, it just doesn't seem like it. To see it, you have to have (at least) the ability and willingness to look.
Hillel (the editor of this list) is one of the people in this industry that is going to make a tremendous difference to the world. His ability to make formal verification understandable, and therefore useful in practice, is unparalleled.
I think the biggest issue with formal verification, is that you need to rewrite the important parts of your code in (for example) TLA+. If it's integrated into the language, like ADA Spark, you don't need to learn so much additional syntax or rewrite parts of your codebase in a language you rarely use (given that you already work in ADA).
Well, you can't rewrite anything in TLA+. It's a formal specification language, not verification language. So you usually use it to catch spec- or algorithm-level logic & concurrency bugs then manually write the code to correspond to your TLA+ spec. People get really hung up on this last step, but I can tell you as a professional programmer that implementing code to follow a TLA+ spec is extremely easy - all the intellectual heavy lifting has already been done! - and avoids the extremely costly effort required to fully formally verify computer code. It's a great cost/benefit ratio.
My biggest issue with formal verification after doing it a couple of times was how absurdly complex the specification needed to be for it to work.
If the spec is 5x more complicated than the code would be then I'm not sure I see much of a point coz you're just creating different spaces for bugs to hide in.
The aim is to have a spec that is much LESS complex than the code, written at a higher level, abstracting away details. If the spec is 5x more complex than the code then indeed there’s no point.
My summary would be: The spec must cover all possible implementations so it is usually larger than the most simple one.
An example from there:
> The authors of SibylFS tried to write down an exact description of the `open` interface. Their annotated version of the POSIX standard is over 3000 words. Not counting basic machinery, it took them over 200 lines to write down the properties of `open` in higher-order logic, and another 70 to give the interactions between open and close.
> For comparison, while it’s difficult to do the accounting for the size of a feature, their model implementation is a mere 40 lines.
How much of that is open/close being a poor abstraction, or overloading a bunch of semi-related functionality vs a more generalizable consideration systems designed with verification in mind.
"Scalability! but at what COST?" Is a very good example on how frustrating it can be.
We are throwing a lot of resources against a problem because we are not able to educate people good enough to understand basic performance optimizations.
You are a Data Scientist/anyone else and you don't understand your tooling? You are doing your job wrong.
I wish it would be possible to have better studies for that. I believe that static typing has huge benefits as software scales. I also believe that the type system of TypeScript is actually stronger in practice than the Java or C# one (despite theoretical weaknesses). It has the right tradeoffs (e.g. structural equivalence, being able to type strings, being able to check that all cases are handled, etc.)
It would be nice to have proper studies, but it‘s difficult to control the other variables ...
I don’t get the cost claims. The time it takes to note which type I intend something to be is mostly either so low that I recover it via improved hints and such very quickly, or larger but only because I’m documenting something complex enough that I should have documented it anyway, whether or not I was using static types, because it’ll be hell for other people or future-me to figure out otherwise. It seems like a large time savings to me—throw in faster and more confident refactoring and stuff like that, and it’s not even close.
I just don’t get how people are working that it represents a time cost rather than a large time savings. I don’t mean that as a dig, I just mean I genuinely don’t know what that must look like. And I’ve written a lot more code in dynamic languages, and got my start there, so it’s not like I “grew up” writing Java or something like that.
I think the general feeling is that there are some code patterns that are safe and easy to do with dynamic typing, but impossible with simple type systems or more complex with more advanced type system.
An example would be Common Lisp's `map` function [0] (it takes a number of sequences and a function that has as many parameters as there are sequences). It would be hard to come up with a type for this in Java, and it would be a pretty complicated type in Haskell.
Another example of many people's experience with static typing is the Go style of language, where you can't write any code that works for both a list of strings and a list of numbers. This is no longer common, but it used to be very common ~10-15 years ago and many may have not looked back.
I replied to the parent as well, but not only is the solution the parent showed significantly more complex than the CL version, I'm not even sure it actually does what I asked.
More explicitly, the expression there seems to rely on knowing the arity of the function and the number of lists at compile time. Basically, I was asking for a function cl_map such that:
Sure it's possible in Haskell. I'm not sure where in that paper you got the impression it isn't. Of course one can't define variadic functions in Haskell, but that's a more fundamental difference from Clojure, not a "code pattern that [is] safe and easy to do with dynamic typing, but impossible with simple type systems or more complex with more advanced type system."
As far as I can tell, your example calls a unary function on each element of a list of lists. It's solving the variadic part of map, but not the part where I can call an N-ary function with each element of N lists.
Basically, instead of your example I would like to do something like this:
> cl_map (+) [ZipList [1,2,3], ZipList [4,5,6]]
[5,7,9]
> cl_map (+ 3) [ZipList [1,2,3]]
[4,5,6]
> cl_map max3 [ZipList [1,2], ZipList [3,4], ZipList [5,6]] where max3 x y z = max x (max y z)
[5, 6]
Can this be done? What is the type of cl_map?
Note: If this doesn't work with ZipList, that's ok - the important part is being able to supply the function at runtime. Also, please don't assume that the function is associative or anything like that - it's an arbitrary function of N parameters.
The functions in those examples have fixed numbers of arguments,
so one would use the original formulation shown by Tyr42.
> (+) <$> ZipList [1,2,3] <*> ZipList [4,5,6]
ZipList {getZipList = [5,7,9]}
> (+3) <$> ZipList [1,2,3]
ZipList {getZipList = [4,5,6]}
> let max3 x y z = max x (max y z)
> max3 <$> ZipList [1,2] <*> ZipList [3,4] <*> ZipList [5,6]
ZipList {getZipList = [5,6]}
If you want to use "functions unknown at runtime that could take
any number of arguments" then you'll have to pass the arguments
in a list. Of course these can crash at runtime, which
Haskellers wouldn't be happy with given an alternative, but
hey-ho, let's see where we get.
> let unsafePlus [x, y] = x + y
> fmap unsafePlus (sequenceA [ZipList [1,2,3], ZipList [4,5,6]])
ZipList {getZipList = [5,7,9]}
> let unsafePlus3 [x] = x + 3
> fmap unsafePlus3 (sequenceA [ZipList [1,2,3]])
ZipList {getZipList = [4,5,6]}
> unsafeMax3 [x, y, z] = x `max` y `max` z
> fmap unsafeMax3 (sequenceA [ZipList [1,2], ZipList [3,4], ZipList [5,6]])
ZipList {getZipList = [5,6]}
So the answer to your question is that
cl_map :: ([a] -> b) -> [ZipList a] -> ZipList b
cl_map f = fmap f . sequenceA
except you don't actually want all the elements of the list to be
of the same type, you want them to be of dynamic type, so let's
just make them Dynamic.
> let unwrap x = fromDyn x (error "Type error")
>
> let unsafeGreeting [name, authorized] =
> if unwrap authorized then "Welcome, " ++ unwrap name
> else "UNAUTHORIZED!"
>
> fmap unsafeGreeting (sequenceA [ZipList [toDyn "tome", toDyn "simiones", toDyn "pg"]
> , ZipList [toDyn True, toDyn True, toDyn False]])
ZipList {getZipList = ["Welcome, tome","Welcome, simiones","UNAUTHORIZED!"]}
and the type of cl_map becomes
cl_map :: ([Dynamic] -> b) -> [ZipList Dynamic] -> ZipList b
cl_map f = fmap f . sequenceA
One could polish this up a bit and make a coherent ecosystem out
of it, but Haskell programmers hardly ever use Dynamic. We just
don't come across the situations where Clojurists seem to think
it's necessary.
So in the end, as I claimed initially, this function can't be written in a simple, safe way in Haskell; and as the article I linked claims, Haskell's type system can't encode the type of the cl_map function.
It's nice that Haskell does offer a way to circumvent the type system to write somewhat dynamic code, but it's a shame that in order to write a relatively simple function we need to resort to that.
Note that the type of cl_map is perfectly static. It would be `Integer N => (a_0 ->... a_N -> r) -> [a_0] ->... [a_N] -> [r]` assuming some fictitious syntax.
> So in the end, as I claimed initially, this function can't be
> written in a simple, safe way in Haskell
Steady on! You posed a question and I gave an answer. You weren't
happy with that answer. I think it's a bit premature to conclude that
"this function can't be written in a simple, safe way in Haskell".
> as the article I linked claims, Haskell's type system can't encode the type of the cl_map function.
Could you say where you see that claim in the article? I can see
three mentions of "Haskell" in the body, two of them mentioning that
one researcher's particular implementation doesn't handle this case,
but not a claim that it can't be done.
> Note that the type of cl_map is perfectly static. It would be `Integer
N => (a_0 ->... a_N -> r) -> [a_0] ->... [a_N] -> [r]` assuming some fictitious syntax.
OK, fine, it's a bit clearer now what you are looking for. How about this:
> cl_map (uncurry (+)) ([1,2,3], [4,5,6])
[5,7,9]
> cl_map (+3) [1,2,3]
[4,5,6]
> let max3 (x, y, z) = x `max` y `max` z
> cl_map max3 ([1,2], [3,4], [5,6])
[5,6]
Notice that the function arguments are have different,
statically-known types! The type of this miracle function?
cl_map :: Default Zipper a b => (b -> r) -> a -> [r]
And the implementation?
-- Type definition
newtype Zipper a b = Zipper { unZipper :: a -> ZipList b } deriving Functor
-- Instance definition
instance a ~ b => D.Default Zipper [a] b where def = Zipper ZipList
-- These three instances are in principle derivable
instance P.Profunctor Zipper where
dimap f g = Zipper . P.dimap f (fmap g) . unZipper
instance Applicative (Zipper a) where
pure = Zipper . pure . pure
f <*> x = Zipper (liftA2 (<*>) (unZipper f) (unZipper x))
instance PP.ProductProfunctor Zipper where
purePP = pure
(****) = (<*>)
Given that the only two lines that actually matter are
newtype Zipper a b = Zipper { unZipper :: a -> ZipList b } deriving Functor
instance a ~ b => D.Default Zipper [a] b where def = Zipper ZipList
and the rest are boiler plate that could be auto-derived, I think this
is pretty satisfactory. What do you think?
First of all, thank you for bearing with me this long!
Still, you haven't written exactly the function I was asking for. You require a manual, compile-time step of transforming the N-ary function to a unary function taking a tuple. Still, it's impressive that this can define variable-length, variable-type tuples. Unfortunately I am not able at all to follow your solution, as it's using too many types that I'm not familiar with, and it seems to require some external packages, so I can't easily try it out in an online compiler to understand it better (as I have been doing so far).
Either way, I would say we are well outside the limits of an easy to understand way of specifying this kind of function - even if you are only showing 2 lines of code, it seems that your definition requires, outside of lists and functions (the objects we intended to work with): ZipList, Default, Functor, Profunctor, ProductProfunctor, Applicative, and a helper type. Even if these were derivable, someone seeking to write this function would still need to be aware of all of these types, some of which are not even part of the standard library; and of the way they work together to magically produce the relatively simple task they had set out to do.
> Could you say where you see that claim in the article?
The claim is presented implicitly: for one, they conjecture that, were Haskell or SML to "pragmatically support" such a feature, it would be used more often (offering as argument the observation that both Haskell's and SML's standard libraries define functions that differ only in the arity of their arguments, such as zipWith/zipWith3 in Haskell). This implies that, to their knowledge, it is not pragmatically possible to implement this in Haskell.
Similarly, given that in their "Related Works" section they don't identify any complete implementation of variadic polymorphism, it can be assumed that they claim at least not to have found one.
> Still, you haven't written exactly the function I was asking for
I'm afraid I'm now completely stumped about what you're asking for. If you have a function with a known arity and want to apply it to a known number of arguments then you can use the original formulation:
f <$> args1 <*> args 2 ... <*> argsN
You then asked what happens for unknown numbers of arguments, so I produced a solution that works with lists, which isn't very Haskelly, but does the job. After that you said you wanted something with a more specific type, so I came up with the answer that works generally over tuples (or indeed any type that contains a sequence of arguments). That's not satisfactory either! It seems you literally want a function with type `Integer N => (a_0 ->... a_N -> r) -> [a_0] ->... [a_N] -> [r]`. Well, I don't know how to do that in Haskell -- maybe my most recent solution extends to that -- but nor do I know why you'd want to do that! If you have a known number of arguments the first solution works fine. If you have an unknown number of arguments then you must have them all together in one datastructure, so the most recent solution works fine. Haskellers would be very happy with either of those and I don't see how we're missing out on programming convenience because of that. Maybe you could elaborate?
> I can't easily try it out in an online compiler to understand it better
Try this. It's a full working program. The packages it depends on are "profunctors" and "product-profunctors".
{-# LANGUAGE FlexibleInstances #-}
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE TypeFamilies #-}
module MyExample where
import qualified Data.Profunctor as P
import qualified Data.Profunctor.Product as PP
import qualified Data.Profunctor.Product.Default as D
newtype Zipper a b = Zipper { unZipper :: Traverse ZipList a b }
deriving Functor
instance a ~ b => D.Default Zipper [a] b where
def = Zipper (P.dimap ZipList id D.def)
instance P.Profunctor Zipper where
dimap f g = Zipper . P.dimap f g . unZipper
instance Applicative (Zipper a) where
pure = Zipper . pure
f <*> x = Zipper ((<*>) (unZipper f) (unZipper x))
instance PP.ProductProfunctor Zipper where
purePP = pure
(****) = (<*>)
cl_map :: D.Default Zipper a b => (b -> r) -> a -> [r]
cl_map f = getZipList . fmap f . runTraverse (unZipper D.def)
I will start by saying it took me a while to even parse the expression you provided. Whoever thought that inventing new operators is a way to write readable code should really be kept far away from programming languages. The article you provided didn't even bother to give a name to <*> and <$> so I could at least read them out to myself.
Anyway, bitter syntax sugar aside, the way you wrote the function I proposed was... a completely different function with similar results, which does not have the type I was asking for, and you only had to introduce 2 or 3 helper functions and one helper type to do it. I wanted to work with functions and lists, but now I get to learn about applicatives and ZipLists as well... no extra complication required!
Edit to ask: could this method be applied if you didn't know the number of lists and the function at compile time? CL's map would be the equivalent of a function that produces the expression you have showed me, but it's not clear to me that you could write this function in Haskell.
In my opinion, with few exceptions, the kind of programs advocates of dynamic typing want to write that static typing would have trouble dealing with, are artificial and not the common case. (Not "map" though, I need to review that case, but "map" is definitely a common and useful function!)
> Another example of many people's experience with static typing is the Go style of language
Remember that a lot of backlash against Go's type system comes from static typing advocates used to more expressive static type systems :) It'd be a shame if, after all we complained about Go's limitations, newcomers held Go as an example of why static typing is a roadblock...
> In my opinion, with few exceptions, the kind of programs advocates of dynamic typing want to write that static typing would have trouble dealing with, are artificial and not the common case. (Not "map" though, I need to review that case, but "map" is definitely a common and useful function!)
I mostly agree, don't get me wrong. And it's important to note that Common Lisp's `map` functions do more than what people traditionally associate with `map` - they basically do `map(foo, zip(zip(list1, list2), list3)...)`.
Still, this is a pretty useful property, and it is very natural and safe to use or implement, while being impossible to give a type to in most languages.
C++ can do it with the template system, as can Rust with macros (so, using dynamic typing at compile time).
Haskell can make it look pretty decent (if you can stand operator soup) by relying on auto-currying and inline operators and a few helper functions. I would also note that the Haskell creators also though that this functionality is useful, so they implemented some of the required boilerplate in the standard lib already.
In most languages, you can implement it with lambdas and zips (or reflection, of course).
So I think that this is a nice example of a function that is not invented out of thin air, is useful, is perfectly safe and static in principle, but nevertheless is impossible to write "directly" in most statically typed languages.
Just to show the full comparison, here is how using this would look in CL, Haskell and C#:
Note only the CL version, out of all these languages, can work for a function known at runtime instead of compile-time. None of the static type systems in common use can specify the type of this function, as they can't abstract over function arity.
IMO, Go is never a good example in static vs dynamic type system discussions (I mean, for this case: parametric polymorphism has been around since the 70s...).
The language developers themselves have repeatedly stated that its type system being very limited is intentional.
Sure, I know Go is a low blow to static typing. But in this particular regard, Java or C# don't fare much better either.
This is not a question of just supporting parametric polymorphism, but of abstracting over the number of arguments of a function, which is not supported in almost any type system I know of; and then of matching the number of arguments received with the type of function you specified initially.
As an example of this, I've been working through Crafting Interpreters off and on. Chapter 5 consists mostly of discussion of the visitor pattern (is this the same thing as double dispatch?). The author notices that the amount of code that must be written to implement the design is so large that it's best to write a program to generate all of that code. I followed along as best I could, and at the end I wrote the equivalent code in my preferred language, which I've included in this comment:
It's not that difficult to do in Scala, which is probably the language that comes most close to a mainstream language that has a typesystem powerful enough to express this.
There are better languages for expressing this more natural (such as Idris) but in the end, the fallacy seems to lie in your claim that this would be "safe and easy to do with dynamic typing". That's what you think until you find out that your solution works in 99% of the cases, except in some special cases, because the compiler didn't have your back.
Examples are the standard sort functions in Java and python, which were bugged for a very long time.
I just checked the documentation of lisps implementation and it is different from my code. If the input lists have a different size, the shortest list decides the result length and everything else is discarded.
This is of course possible to implement in Scala too, but I think it is a very bad thing to do that which can lead to bugs quite easy. I prefer my solution in that case.
Replying to add: actually, not only is the type obscure, it also relies on knowing the lists at compile time, while the CL function can do this at runtime (note that there is no dynamic behavior, it's simply that C++'s type system can't abstract over function arity).
How many hundreds of LOC would you like to write to support serializing and deserializing JSON for an endpoint that has a schema with around 20 fields, some of which are nested? If you are using Spring and Jackson, you will get to write around 300 LOC across 8 files before you get your hands on a single deserialized object. In any sane language you would use a library that enforces an arbitrary JSON schema to get the same validation guarantees provided by Jackson while writing maybe 25 LOC across maybe 2 files (if we generously count the JSON schema as code for this language but not for Java).
It seems like the more common approach among people who use Java is to write the 300 LOC across the 8 files then use the library to generate JSON schema, rather than the other way around. I wrote it myself because I did not want to tell my team that they had been doing things wrong for years before trying their approach once.
I would like to be able to explain the cost part better. It may just be personal bias of course.
1. There's no guarantee the correct theoretical model of your program fits the type system of your programming language.
2. Sometimes there are multiple correct models for different purposes in the same program, similar to how sometimes you need multiple views onto the same database tables.
3. Sometimes you just need the ability to bodge things.
> 2. Sometimes there are multiple correct models for different purposes in the same program, similar to how sometimes you need multiple views onto the same database tables.
Just wanted to point out that even though you can have multiple views or your database tables, they all still adhere to the same type system.
I guess it's a problem that can be overcome with type inference then? (I don't have to declare types on queries, updates, or views, just on the base tables.)
I think it gives people a sense of satisfaction in modeling real world in the relations between classes. The assertion seems to be that if to solve a problem it has to be correctly modelled into the type system of the language. Once the modelling is done correctly solution will arise by itself.
On the other end people who prefer weakly typed languages see problems as primarily that of data transformation. For example from HTML Form to Http Request to SQL Db to CSV File and so on.
Both approaches are differentiated by the perspective on the problem.
Please don't use weak/strong to denote type systems. Those terms are highly subjective and even non-technical people would quickly form an opinion about which is better. (Strong is good, weak is bad.) Static/dynamic is more accurate and less opinionated terminology.
Python is considered strongly typed, but usually it's placed in opposition with PHP or JavaScript. That's why I don't understand why the previous poster used strongly typed. The conversation was about static and dynamic programming languages.
I hated statix typing until I used Rust. Rust has Sum types (super powered enums), which provide what I was missing from dynamically typed langauges in languages like Java, C#, etc: namely the ability to have an "or" type (e.g. this is an integer or a string, and I want to be able to branch on that at runtime).
That, plus type inference makes the static typing pretty painless.
Agreed. Do note that many other languages before Rust do provide good static typing with the niceties you'd expect (and that are often missing from Java), such as type inference, sum types, etc. Some examples include, but are not limited to, Scala, Haskell, the ML family of languages, etc.
Could not agree more. That's why I'm going to Rust next once done with my current F# project - I want to have experience with both JIT and AOT languages that support sum types and Option / Discriminated Unions.
These data types are much more powerful than the fancy arrays that wowed me back in the day :)
In my experience both the benefits and costs of static typing are overstated. It doesn't mean that your code works if it compiles, and it doesn't mean that you can get rid of half your tests. But it's great for refactoring, and it's a useful form of documentation that can't lie, unlike comments. And if you're using a reasonable language it's not much additional work to add types; often with type inference it's zero work.
Clojure spec seems like the way to go. I really like that it defines what should be going in and out while leaving it really easy to merge incoming data without having to write a bunch of extra code.
For me it depends on the style of static types. I do find Java or C#'s static types to be helpful, but also time consuming. Elm or Haskell on the other hand don't force me to write out the static types while still giving me the benefits of them.
There's also the case that I find the type systems of Rust, Elm, etc to be much more helpful than the type systems of C++ or Sorbet (type system for Ruby).
It depends on the situation. I've had code that absolutely benefited from static types and it helped me find bugs before they happened.
But my current job has very, very little that would benefit from static typing. Adding it into the mix would slow us down, both literally and figuratively.
Most of what we do is just data input and data display. And most of that is text. There aren't really any calculations or anything, beyond some simple sizing of UI stuff.
For database stuff, an ORM with some validation rules is generally enough, and couldn't be replaced with static typing anyhow.
For anything that absolutely has to be a certain kind of data, there are things built into dynamic languages to check the type of something, and you just call it as needed on a case-by-case basis.
I think it’s the wrong metric to look at. Static typing still leaves plenty of room for bugs. The capability and discipline of the team would likely be more of a factor than the type system so the studies would be hard to get right.
However I do think static typing provides an enormous benefit to picking up code that is 5 years old and written by someone else. The ability to see “this is a nonnullable int32 value type” greatly reduces the amount of paths you have to go down when you have to change something or understand what’s going wrong with it. Tradeoff is you end up with a lot more code to maintain...
It helps a lot with refactoring and many other things even early in the project.
For example I‘m using TypeScript with a GraphQL code generator. Now let‘s assume I add a new value to a GraphQL enum. I run codegen, then fix everything until the compiler is happy. Afterwards, all places where this enum was ever touched will take it into account correctly, including mappings, translations, all switch statements, conditions, lists where some of the other values are mentioned and so on.
This is something that‘s not possible in a dynamic language and it‘s not even possible in Java, really.
You could do the same process if you used dynamic typing and good test coverage. Just make your change and keep fixing tests until everything is green.
But how could you have written tests against that new enum value if it didn't exist before? You would need to know in advance which places needed to be tested for it.
I would change that "5 years" to "5 months" (or maybe even 5 weeks), and argue that you can just as easily be that "someone else" on a long enough time-scale (and I don't mean 5 years.)
I just can't imagine how strong types can be viewed as anything other than an immense benefit. Maybe on some personal projects or a throwaway prototype or simple scripts it might not be worth the effort, but if you're working on any kind of software that needs to be maintained over a period of years, having strong types is always better than not having them. Strong types encode extremely valuable information about the composition of application data structures, omitting them can save time up front, but it's just a form of technical debt that every engineer on the project will have to pay back when they're tasked with debugging data errors in code they didn't write. So many engineering-hours wasted by developers stepping through a debugger trying to figure out why some complex object is arbitrarily missing certain properties or why the same property is sometimes a string but other times its an object with its own potential superposition of states. It's a nightmare.
The study IMHO is reality. Every large, actively maintained system I've ever heard of uses a strongly typed language, or was written in a dynamically typed language but has converted to a gradually typed language. You can't safely refactor without compile time type checking, and you can't maintain a non-trivial system over the long-term if you can't refactor it.
A contrary anecdata - I've worked at a bunch of places and know people who work at others that have large, actively maintained, decades-old systems using, e.g., Perl. The only one I know that was actively trying to migrate recently was looking at node instead.
If you're able to share, I'm curious what kinds of systems these are (and also, I guess, how development velocity compared to statically typed projects you've worked on).
Net-A-Porter is probably the prima Perl in London at the minute and have been for decades - although I think they started migrating to Modern Perl with microservices in 2018ish.
Photobox were still on a 10+ years old Perl system in 2018 when I left - development velocity there was only really hampered by an insane belief that rewriting into node.js microservices would solve every problem and the ORM + Object system being a ~10yr old handcrafted abomination from someone who had left 5+ years ago.
Can't speak for NAP but Photobox was extremely strong on unit + integration testing. IIRC, a full ground-up test suite run took a few hours because it started from a bare box, installed + tested Perl + every needed module, span up a blank database, then did the unit tests, then ran a full integration test suite against that server. That was only done for a release; for general commits, it was ok to run a subset of tests to show things worked / still worked.
Tests can't replace types, just like types can't replace tests. You need both.
Types can't check the correctness of everything, but they do prove that certain classes of errors don't exist in your program.
Tests, on the other hand, can test for many more types of bugs, but they can only look for errors, they can't prove correctness (except in very small, closed environments where you can literally test every possible combination of inputs and outputs).
> they do prove that certain classes of errors don't exist in your program
That's particularly important when refactoring because you want to assert that you haven't introduced new bugs, and the type system will often let you prove that with almost zero effort on your part.
If you add a new value to an enum, or a new argument to a method, you have to know all the places where you do switches on the enum or call the method, which is hard to do without static types.
You have to know when semantically it's an instance of the enum, and when it's just a string literal (e.g. in JavaScript, where enums are just a pre-defined set of allowed strings). Also, you might assign to it in one place, and then use the resulting variable in many other places farther down the call chain. Now you have to find all those usages.
Static typing makes both of those trivial if your language (or linter) has enum-exhaustiveness checks for switch statements.
You can't always find everything with a search in a dynamic language. Some things are resolved at runtime. In our Perl code base, finding function calls is difficult for this reason (you can eval a string name to get a module or function and call it, based on a configuration setting in a file).
In my personal opinion (based on personal observation) static typing helps to become more lazy and trusting since it enables features such as autocomplete, type hinting and so on where one basically gives away understanding of detail.
Don't get me wrong! I love being lazy and trusting because it allows to leverage more code than I'd be able to produce on my own, but usually it's also the source of many of my own mistakes and failures.
I strongly believe that I'm not the only one.
This is a surprising argument to me. In my opinion, the opposite is true: thinking about types makes you understand your building blocks better, whereas the danger of dynamic typing is that you can go a long way being "lazy" and not understanding/caring about the types involved, "it just works". Sometimes this speed is welcome, but (in my opinion) you end up crashing and burning sooner or later...
And I feel that I'm wasting time on understanding the stuff that I need to pass and have to remember 100 times more things, than I need to.
I want my code to be clear and with certain expectations fulfilled, rather than a mystery in front of me. I'm not there to learn what could be passed into my functions - I'm there to create functionality.
I've recently learned and ported a few projects from Javascript to TypeScript and I'll vouch that, so far, it is much better and easier to reason about my code and what it's doing. I also feel I need less test cases to adequately test my code.
In saying that, I'm interested in if there is any accepted, peer reviewed literature with quantifiable data as to whether strongly typed languages are "better" (whatever the study might define as better such as being faster, more scalable, etc). From what I've heard and read, most of the better-ness that strong typing provides is related to people problems and being able to scale a team, not necessarily scaling a system or making the system better. When learning Go and TypeScript after primarily writing Ruby and Javascript, I'm convinced of the better-ness strong typing provides whether it's related to readability, better IDE intellisense, or speed (although Go for example is faster then Ruby and JS not just because it's strongly typed, but compiled), I'm just interested in if there's real data to support using them instead of anecdata.
JS -> TS is usually just a matter of adding types, so I think by "ported" the previous commenter just meant adding types and tweaking as needed to make the type system happy.
I much prefer static typing - but the metric used in the article is around bug reduction. Regardless of static (Go or Java), dynamic (JS), or a bit weird (plpgsql), I don’t generally get the type of a variable or object wrong - because when I’m using the object I necessarily must already have a mental map of what it represents. It’s pretty rare to try to call the “fill()” method on a “line” object, so to speak, because I know it’s a line when I access it.
So I’d guess that the number of type related bugs in dynamic languages is just a little bit greater than in static languages, simply because it is harder to make that kind of mistake in a typed language. But as a category, they aren’t common mistakes in the first place.
I can confidently say that I’m a bit of an expert at writing bugs :) and of all the kinds of bugs I write, type related bugs are probably no where near the top of the list.
That’s not to say that static typing isn’t better - I definitely think it is. But I can also believe that it doesn’t necessarily reduce the bug count by a huge margin. (For whatever it’s worth I think the main benefits are documentation and refactoring...)
On that note, I've heard many times dynamic typing advocates say (or write) "I've seldom wrote a bug that was a type error".
But a lot of the time, their language is simply unable to encode certain properties as types, so by definition they don't think of some classes of bugs they do write as "type errors". Maybe in a statically typed languages they would have been type errors indeed!
It's as if the tool you use sometimes reinforces your blind spots: "you don't know what you don't know".
PS: anecdote is probably irrelevant, but I've written plenty of dumb type errors with Python. Things that would have been caught by a test or an external tool, sure, or the type checker of a statically typed language could have caught for me for free, leaving the more relevant logic tests to me. I tend to write type errors left and right. Maybe I'm simply not a good Python programmer, of course!
> I wish it would be possible to have better studies for that. I believe that static typing has huge benefits as software scales.
I also believe that, especially after it outscales what one person is able to (fully) overview. Static types are something a program can reason about so it allows so much more productivity boosting tooling to be created. This also goes way beyond simply catching type errors at compile time vs. runtime (a downside which can largely be mitigated by test coverage). Just look what an IDE for e.g. Java can do simply in helping you navigating a big codebase. Then throw in refactoring which is in many cases can even be a completely automated operation and in much more cases is at least greatly assisted by the tools. Tools for dynamic languages can often at most guess, making good guesses is hard so in practice you get mostly stuff which is pretty limited in its usefulness.
I'd love for the study to include various types of static types. Testing against only C#/Java style type systems seems fairly narrow compared to the various kinda of static type systems available.
I've contributed to large codebases that have static typing (C++ and TypeScript) and dynamic typing (JavaScript) and I've come to the conclusion over the years that static typing isn't really worth it as long as you have the discipline to write tests for your code. The most basic unit tests cover type checking concerns. Refactoring might require a bit more search/replace but I don't see how that is a big deal. Tests make refactoring safer than with just types. Tests act as good documentation of how you expect your code to behave and what expected inputs/outputs there are. You just don't get autocomplete which is a pretty overrated feature imo.
(Not a balanced evaluation, just cherry-picking failures. But I suppose we're at the point in the hype cycle where it's easier to find success stories being talked about.)
I don't have the necessary background to find a good "Kubernetes everywhere" ice shower, but if someone else found one and submitted it I think I could evaluate it.
I said it in a jokingly way, but it would really be a nice read.
The hype is huge in that train, but I'm sure there must be lots of professionals that have already learned about its shortcomings. Not sure if proper studies exist about Kubernetes yet, though. Hopefully you'll get a PR with some content.
I was really surprised docker or kubernetes wasn't one of the items on here. While I use both, they definitely both could use cold showers to make sure they provide value.
I seeded most of the list and know basically nothing about docker or kubernetes, so don't know of any cold showers myself. But I would be more than happy to edit a submission by someone who knows the space!
Possibly, but it doesn't have to be only that, or without any merit. Cold baths and even winter swimming [1] (in icy water) are a thing in many parts of the world.
Is thinking that everything is a Silicon Valley thing, a Silicon Valley thing? Is it really plausible that no one thought to feel manly about cold showers until a bunch of nerds came along?
>Is it really plausible that no one thought to feel manly about cold showers until a bunch of nerds came along?
No, but it's quite plausible that it was a niche thing that might have been a fad at some points in the past, only to be revived by a new generation that includes many fad-chacing types, SV people, and BS-artists (aka influencers)...
I'm pretty sure cold showers have been a thing to show your ability to live without comforts ever since hot showers became a possibility.
And the term fits here, I believe: cold showers do very much wake you up and bring you into reality quickly. There's no dreaming about hypes when you're under a cold shower.
No, it is more of a euphemism for "sobering up." Existed for quite awhile, usually used for drunks though. Has minor connotations of being an oddball health nut-people extolling the benefits of being immersed into chilly water.
Stoicism hasn't been popular since Victorian times. I think SV types and rationalists kind of revived it, since it was moral austerity without any real religious background. To an extent they do the same with meditation; it's somehow gone from being something associated with New Age thinking, to something atheist rationalists tout the benefits of while carefully avoiding any hint of spirituality.
Everyone I know who identifies as a stoic is an emotionally stunted software engineer who realistically isn't tasked with stoically shouldering very much of anything.
As a counterpoint, most of the people I know whom I would consider to follow a Stoic philosophy don't self-classify all that much.
I would also say that there's a pretty big difference between "stiff upper lip/no emotions" that people imagine when using the the adjective "stoic" and the Stoic writings of Marcus Aurelius and the like.
> I would also say that there's a pretty big difference between "stiff upper lip/no emotions" that people imagine when using the the adjective "stoic" and the Stoic writings of Marcus Aurelius and the like.
A quick way to find out what sort of self-proclaimed lover of Meditations you’re dealing with is to ask what they think of its physics and metaphysics.
Taking amusement in catching little social lies isn’t exactly gatekeeping. More like sport.
I don’t give a damn whether people call themselves Stoics or not and whether they’re sincere—whatever that means—or not, but I’m very sure, specifically, that there are a lot more fans of Meditations than people who’ve read it or even meaningfully read about it, which is funny in that “oh boy, aren’t we humans goofy” sort of way. It’s also an easy phenomenon to stumble on innocently while trying to discuss the book, though I think that goes for a lot of Very Important Books that more people claim to have read than actually have. IIRC someone wrote a whole tongue-partially-in-cheek guide to pretending at having read books, for the reason that it’s pretty common.
One of my hobbies is asking christians if they know about the time god told a bear to maul some kids for making fun of a bald guy. When they say they aren't I dab, tip my fedora, then moonwalk out of the room. Aren't humans goofy?
A Cold Shower for (early) testing of software, maybe:
There used to be an often cited paper by Boehm about the cost of catching bugs early vs late on production, usually mentioned by advocates of testing early, where the quoted conclusion was something like "studies show it's 10 times more costly to catch bugs late on production" or something like that. This is a very well known study, I'm likely misquoting it (the irony!) and readers here are probably familiar with it or its related mantra of early testing.
I haven't read the paper itself (I should!), but later someone claimed that a- Boehm doesn't state what people quoting him say he said, b- the relevant studies had serious methodological problems that call into question the conclusion he did say, c- there are plenty of examples where fixing bugs late on production wasn't particularly costly.
edit: I'm not arguing testing isn't necessary, in case that upset someone reading this post. I'm not really arguing anything, except that the study by Boehm that most people quote was called into question (and was probably misquoted to begin with). This doesn't prove/disprove anything, except maybe hinting at a possible Cold Shower. It does show that we as a field have a serious problem in software engineering with backing up claims with well designed studies and strong evidence, but this shouldn't come as a surprise to anyone reading this.
Laurent Bossavit tears the Boehm paper apart in his book "Leprechauns of Software Engineering"[1]. It's a good read for anyone interested in the empirical side of software research.
This gets at the heart of one of my big gripes about how we talk about engineering and technology.
Often a fancy new thing is introduced with a very long list of pros: "fast, scalable, flexible, safe". Rarely, is a list of cons included: "brittle, tough learning curve, complicated, new failure modes".
This practice always strikes me as odd because the first law of engineering is "everything is a trade-off". So, if I am going to do my job as an engineer I really need to understand both the "pros" and "cons". I need to understand what trade-off I'm making to get the "pros". And only then can I reason about wether the cost is justified.
>Researchers had programmers fix bugs in a codebase, either with all of the identifiers were abbreviated, or where all of the identifiers were full-words. They found no difference in time taken or quality of debugging.
I would not have expected that. Still, I prefer to use full(er) identifiers. I don't like to guess how things were abbreviated, especially when consistency isn't guaranteed. If I were using a different language and IDE, this might be better.
If you don't have more data than can fit on a reasonably large hard drive, you do not have big data and you are likely able to process it faster and cheaper on one system.
Thoughts on this one? I found the presentation to be somewhat mixed.
I found the initial comb through of the agile principles to be needlessly pedantic ("'Simplicity... is essential' isn't a principle, it's an assertion!"); anyone reading in good faith can extract the principle that's intended in each bullet of that manifesto.
The critique of user stories (~35 mins in) was more interesting; it's something we've been bumping up against recently. I think the agile response would be "if your features interact, you need a user story covering the interaction", i.e. you need to write user stories for the cross-product of your features, if they are not orthogonal.
I'm not really convinced that this is a fatal blow for user stories, and indeed in the telephony example it is pretty easy to see that you need a clarifying user story to say how the call group and DND features interact. But it does suggest that other approaches for specifying complex interactions might be better.
Maybe it would be simpler to show a chart of the relative priorities or abstract interactions? E.g. thinking about Slack's notorious "Should we send a notification" flowchart (https://slack.engineering/reducing-slacks-memory-footprint-4...), I think it's impossible (or at least unreasonably verbose) to describe this using solely user stories. I do wonder if that means it's impossible for users to understand how this set of features interact though?
Regarding the purported opposition in agile to creating artifacts like design docs, it's possible that I'm missing some conversation/context from the development of Agile, but I've never heard agile folks like Fowler, Martin, etc. argue against doing technical design; they just argue against doing too much of it too early (i.e. against waterfall design docs and for lean-manufacturing style just-in-time design) and that battle seems to have largely been won, considering what the standard best-practices were at the time the Agile manifesto was written vs. now.
All research is inconclusive? Sure. I wonder what kind of type systems were in there? I guess Java and similars are accounted and yet I wouldn’t put any faith in them.
ML, Swift, Haskell... now that’s something else.
My interpretation of that is both that we need more research and that this is a very hard problem to study. One of the big challenges is that the things which are easy to study aren't representative of real world conditions: if you have CS undergrads doing new development on toy problems, that isn't representative of what experienced developers at most businesses do. Having people pick less common languages can select for programmers who aren't representative of the general field[1] and are probably going to invest extra time making their favorite language look as good as possible. You can hire people, train them in various languages, and have them implement something generally applicable but that's now a really expensive study.
1. e.g. what percentage of the gains attributed to Lisp were more likely due to the candidate pool in the 90s/2000s skewing heavily towards people who learned it at elite CS programs, especially if you're doing a challenge competition which benefits from having studied various algorithms?
It doesn’t account for the communication value of static types. Personally, I consider static types primarily a communication tool, so IMO the review’s interesting but not very useful per se. Also the main point of it seems to be “research on this topic is mostly bad, so far, so who the hell knows what’s true”. It could be that the research has sucked, not that there’s little discernible difference between the two on the dimensions measured.
They don’t save me from having to go look at other files to learn things about the code I’m actually interested in; they can be misleading in ways that types aren’t—in particular, it’s very hard to know what sorts of things an all-green test suite guarantees versus a passing static type build, without a great deal more information; and they are, in practice, prone to rot and neglect in a way static types rarely are, and are harder to bring back into a useful state when that happens.
They serve very different purposes and generally are not first and foremost good communication tools the way static types are, for a bunch of reasons. That doesn’t mean tests aren’t very useful and welcome things to have, however.
My experience is that tests tend to be much harder to read, and take more effort to understand, than types. Types are a higher-level approximation for your program.
Seems like modern Python or TypeScript are close to sweet spot for typed code. You don't need to set types for everything, but just enough to get checks from a compiler in most important places (at least for TS, tooling for Python is still lacking). Java is slowly going there too, but from another direction.
I was working on a new type of locking mechanism and thought I would be smart by modelling it in spin [http://spinroot.com], which has been used for these kind of things before.
I ended up with a model that was proven in spin, but still failed in real code.
Given that's anecdata with a sample size of 1, but still was a valuable experience to me.
The title doesn't really relate to its content very well; the concept of taking cold showers has some scientific backing ([1] & [2]), and is also slightly hyped. After taking cold showers and getting some (minor) benefits for some years, the term "cold shower" started to get a positive association in my mind.
This article isn't about showers, nor positive results, making the title quite confusing :)
Topic: The curious case of the cyclist’s unshaven legs
From a comment (this part clearly intended to be witty I think):
Really, I thought it was weird, and probably inappropriate, to mix in so much of an outsider's amateur and unsopported opinion about science into an otherwise interesting story about leg hair drag.
I love being able to show that study, if you properly architect your sqlite system and am willing to purchase hardware, you can go a long long way, much further than almost all companies go, with your data access code needing nothing more than the equivalent of System.Data.Sqlite