Hacker News new | past | comments | ask | show | jobs | submit login

Can't we extend this argument to eliminating basically all static typing? And frankly that'd not even be wrong, and is why Alan Kay defined OOP as one that's dynamically typed and late bound, and we went against it anyway to keep relearning the same lessons over and over.



The argument is really more like: Always defer validation until the point where the data is actually consumed, because only the consumer actually knows what is valid.

Which is definitely a counterpoint to the oft-stated argument that you should validate all data upfront.

Either way though, you can still have types, the question is just when and where (in a distributed system, especially) they should be checked.


The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.

I was at Google when the "let's get rid of optional" crusade started. It didn't make sense to me then and over a decade later it still doesn't. If a program expects a field to be there then it has to be there, removing the protobuf level checking just meant that programs could now read garbage (some default value) instead of immediately crashing. But the whole reason we have types, assertions, bounds checking and so on is because, almost always, we'd like our software to NOT just blindly plough on into undefined territory when it doesn't understand something properly, so in reality it just means everyone ends up coding those very same required-ness assertions by hand.

Now, Google's business model is remarkably robust to generating and processing corrupt data, so you can argue that in the specific case of this specific company, it is actually better to silently serve garbage than to crash. This argument was made explicitly in other forms, like when they deleted all the assertions from the HTTP load balancers. But in every case where I examined an anti-required argument carefully the actual problem would turn out to be elsewhere, and removing assertions was just covering things up. The fact that so much of Google's code is written in C++ that not only starts up slowly but also just immediately aborts the entire process when something goes wrong also contributes to the brittleness that encourages this kind of thing. If Google had been built on a language with usable exceptions right from the start it'd have been easier to limit the blast radius of data structure versioning errors to only the requests where that data structure turned up, instead of causing them to nuke the entire server (and then the RPC stack will helpfully retry because it doesn't know why the server died, promptly killing all of them).

But this tolerance to undefined behavior is not true for almost any other business (except maybe video games?). In those businesses it's better to be stopped than wrong. If you don't then you can lose money, lose data, lose customers or in the worst cases even lose your life. I don't think people appreciate the extent to which the unique oddities of Google's business model and infrastructure choices have leaked out into the libraries their staffers/ex-staffers release.


> The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.

The middleman software in question often needed to process some part of the message but not others. It wasn't realistic to define a boundary between what each middleman might need and what they wouldn't need, and somehow push the "not needed" part into nested encoded blobs.

I'm not sure the rest of your comment is really addressing the issue here. The argument doesn't have anything to do with proceeding forward in the face of corrupt data or undefined behavior. The argument is that validation needs to happen at the consumer. There should still be validation.


> It wasn't realistic to define a boundary between what each middleman might need and what they wouldn't need, and somehow push the "not needed" part into nested encoded blobs.

This is an interesting argument that I would like to see more elaboration on, because that's the obvious solution. Effectively you're building a pipeline of data processors and each stage in the pipeline reads its own information and then passes along a payload with the rest of the information to the next stage. This would preserve full static typing with required fields, but I can see how it might inhibit some forms of dynamic instrumentation, eg. turning verbose logging on/off might dynamically reconfigure the pipeline, which would affect all upstream producers if they're wrapping messages for downstream consumers.

If this were a programming language I would immediately think of row typing to specify the parts that each stage depends on while being agnostic about the rest of the content, but I'm not sure how that might work for a serialization format. Effectively, you're pulling out a typed "view" over the underlying data that contains offsets to the underlying fields (this is the dictionary-passing transform as found in Haskell).


The particular piece of infrastructure I worked on sat in the middle of the search pipeline, between the front-end that served HTML web pages, and the back-end index. This middle piece would request search results from the back-end, tweak them in a bunch of ways, and forward them on.

These "tweaks" could be just about anything. Like: "You searched for Jaguar, but I don't know if you meant the cat or the car. The index decided that pages about the car rank higher so the first three pages of results are about the car. I'm going to pull some results about the cat from page 4 and put them near the top so just in case that's what you really wanted, you'll find it."

Google Search, at least when I worked on it, was composed of a huge number of such tweaks. People were constantly proposing them, testing if they led to an improvement, and shipping them if they do. For a variety of reasons, our middleman server was a great place to implement certain kinds of tweaks.

But what kinds of information are needed for these "tweaks"? Could be anything! It's a general-purpose platform. Search results were annotated with all kinds of crazy information, and any piece of information might be useful in implementing some sort of middleman tweak at some point.

So you couldn't very well say upfront "OK, we're going to put all the info that is only for the frontend into the special 'frontend blob' that doesn't get parsed by the middleman", because you have no idea what fields are only needed by the frontend. In fact, that set would change over time.

> If this were a programming language I would immediately think of row typing to specify the parts that each stage depends on while being agnostic about the rest of the content

Indeed, perhaps one could develop an elaborate system where in the schemas, we could annotate certain fields as being relevant to certain servers. Anywhere else, those fields would be unavailable (but passed through without modification or validation). If you needed the fields in a new place, you change the annotations.

But that sounds... complicated to design and cumbersome to maintain the annotations. Simply banning required fields solved the problem for us, and everything else just worked.


> Indeed, perhaps one could develop an elaborate system where in the schemas, we could annotate certain fields as being relevant to certain servers. Anywhere else, those fields would be unavailable (but passed through without modification or validation). If you needed the fields in a new place, you change the annotations.

I don't think it has to be elaborate. What I was thinking was something more like, in pseudo-C#:

    // the framework's general channel type from which messages are read
    public interface IChannel
    {
        T Read<T>() where T : interface;
    }

    // clients declare the interface they operate on:
    public interface IClientFields
    {
        public int Foo { get; set; }
        public string? Name { get; set; }
    }
    ...
    // client middleware function
    Task MiddlewareFn(IChannel chan)
    {
        var client = chan.Read<IClientFields>();
        ... // do something with client before resuming at next stage
    }
The client's interface type T must simply be a structural subtype of the underlying message type. As long as the underlying format is somewhat self-descriptive with a name and type map, you can perform the necessary checking that only applies locally to the client. Nothing fancy, and the required fields that client cares about are still there and the rest are ignored because they're never referenced. This could return an interface that contains a series of offsets into the data stream, which I believe is how capnproto already works.


Are you saying that each service would need to declare, separately, the subset of fields they operate on, and make sure that those fields are always a strict subset of the overall set of fields the protocol contains?

This essentially means declaring the same protocol multiple times, which seems like a big pain.


Assuming the service is not operating on the whole format, then it's already implicitly depending on the presence of those fields and also implicitly depending on the fact that they are a strict subset of the overall set of fields in the protocol. I'm not sure why making this fact slightly more explicit by having the client add a single interface would be a big pain.

In principle, this also enables type checking the whole pipeline before deployments since the interfaces can be known upfront rather than latent in the code.


Single interface?

In the search infrastructure example I mentioned up-thread, we had hundreds, maybe thousands of schemas involved.


I said a single interface per client, where I'm using "client" as a stage in the pipeline. Each piece of middleware that plugged into this pipeline already implicitly depends on a schema, so why not describe the schema explicitly as some subset of the underlying message?


Typescript has a types like Partial, Pick and Required which lets you work with subsets of fields of an existing type (https://www.typescriptlang.org/docs/handbook/utility-types.h...). Can something like that be built for Protobuf message processing?


It's easier to understand in context - some services (iirc web search but it might have been ads or something else very core) had kept adding fields to some core protobufs for years and years. It made sense, was the path of least resistance etc, but inevitably some of those fields became obsolete and they wanted to remove them but found it was hard, because every program that did anything with web search was deserializing these structures.

Truly generic middleware like RPC balancers did what you are saying, but there were also a lot of service specific "middlemen" which did need to look at parts of these mega-structures.

Now due to how protobufs work, you can do what you suggest and "cast" a byte stream to multiple different types, so they could have defined subsets of the overall structures and maybe they did, I don't remember, but the issue then is code duplication. You end up defining the same structures multiple times, just as subsets. With a more advanced type system you can eliminate the duplication, but there was a strong reluctance to add features to protobufs.


Honestly I wonder what is the big win in terms of performance by using static types here, because this sounds so terribly well fit for dynamic types (of which optionality by default is in fact a limited example). Such an odd choice to calcify a spec in a places where it changes all the time. "Static" optimizations should be local, not distributed.


I think you're defining consumer as the literal line of code where the field is read, whereas a more natural definition would be something like "the moment the data structure is deserialized". After all it's usually better to abort early than half way through an operation.

It was quite realistic to improve protobufs to help dig web search out of their "everything+dog consumes an enormous monolithic datastructure" problem, assuming that's what you're thinking of (my memory of the details of this time is getting fuzzy).

A simple brute-force fix for their situation would have been to make validation of required fields toggle-able on a per-parse level, so they could disable validation for their own parts of the stack without taking it away for everyone else (none of the projects I worked on had problems with required fields that I can recall).

A better fix would have been for protobufs to support composition. They could then have started breaking down the mega-struct into overlapping protos, with the original being defined as a recursive merge of them. That'd have let them start narrowing down semantically meaningful views over what the programs really needed.

The worst fix was to remove validation features from the language, thus forcing everyone to manually re-add them without the help of the compiler.

Really, the protobuf type system was too simple for Google even in 2006. I recall during training wondering why it didn't have a URL type given that this was a web-centric company. Shortly after I discovered a very simple and obvious bug in web search in which some local business results were 404s even though the URL existed. It had been there for months, maybe years, and I found it by reading the user support forums (nobody else did this, my manager considered me way out of my lane for doing so). The bug was that nothing anywhere in the pipeline checked that the website address entered by the business owner started with https://, so when the result was stuffed into an <a> tag it turned into <a href="www.business.com"> and so the user ended up at https://www.google.com/www.business.com. Oops. These bad strings made it all the way from the business owner, through the LBC frontend, the data pipeline, the intermediate microservices and the web search frontends to the user's browser. The URL did pass crawl validation because when loaded into a URL type, the missing protocol was being added. SREs were trained to do post-mortems, so after it got fixed and the database was patched up, I naively asked whether there was a systematic fix for this, like maybe adding a URL type to protobufs so data would be validated right at the start. The answer was "it sounds like you're asking how to not write bugs" and nothing was done, sigh. It's entirely possible that similar bugs reoccurred dozens of times without being detected.

Those are just a couple of cases where the simplicity (or primitivity) of the protobuf type system led to avoidable problems. Sure, there are complexity limits too, but the actual languages Googlers were using all had more sophisticated type systems than protobuf and bugs at the edges weren't uncommon.


> I think you're defining consumer as the literal line of code where the field is read

I am.

> After all it's usually better to abort early than half way through an operation.

I realize this goes against common wisdom, but I actually disagree.

It's simply unrealistic to imagine that we can fully determine whether an operation will succeed by examining the inputs upfront. Even if the inputs are fully valid, all sorts of things can go wrong at runtime. Maybe a database connection is randomly dropped. Maybe you run out of memory. Maybe the power goes out.

So we already have to design our code to be tolerant to random failures in the middle. This is why we try to group our state changes into a single transaction, or design things to be idempotent.

Given we already have to do all that, I think trying to validate input upfront creates more trouble than it solves. When your validation code is far away from the code that actually processes the data, it is easier to miss things and harder to keep in sync.

To be clear, though, this does not mean I like dynamic typing. Static types are great. But the reason I like them is more because they make programming easier, letting you understand the structure of the data you're dealing with, letting the IDE implement auto-complete, jump-to-definition, and error checking, etc.

Consider TypeScript, which implements static typing on JavaScript, but explicitly does not perform any runtime checks whatsoever validating types. It's absolutely possible that a value at runtime does not match the type that TypeScript assigned to it. The result is a runtime exception when you try to access the value in a way that it doesn't support (even though its type says it should have). And yet, people love TypeScript, it clearly provides value despite this.

This stuff makes a programming language theorist's head explode but it practice it works. Look, anything can be invalid in ways you never thought of, and no type system can fully defend you from that. You gotta get comfortable with the idea that exceptions might be thrown from anywhere, and design systems to accommodate failure.


I agree with a lot of this, but:

1. The advantage of having it in the type system is the compiler can't forget.

2. It's quite hard to unwind operations in C++. I think delaying validation to the last moment is easier when you have robust exceptions. At the top level the frameworks can reject RPCs or return a 400 or whatever it is you want to do, if it's found out 20 frames deep into some massive chunk of code then you're very likely to lose useful context as the error gets unwound (and worse error messages).

On forgetting, the risky situation is something like this:

    message FooRequest {
        required string query = 1;
        optional list<string> options = 2;   // added later
    }
The intention is: in v1 of the message there's some default information returned, but in v2 the client is given more control including the ability to return less information as well as more. In proto2 you can query if options is set, and if not, select the right default value. In proto3 you can't tell the difference between an old client and a client that wants no extra information returned. That's a bug waiting to happen: the difference between "not set" and "default value" is important. Other variants are things like adding "int32 timeout" where it defaults to zero, or even just having a client that forgets to set a required field by mistake.

TypeScript does indeed not do validation of type casts up front, but that's more because it's specifically designed to be compatible with JavaScript and the runtime doesn't do strong typing. People like it compared to raw JS.


> Consider TypeScript, which implements static typing on JavaScript, but explicitly does not perform any runtime checks whatsoever validating types. It's absolutely possible that a value at runtime does not match the type that TypeScript assigned to it. The result is a runtime exception when you try to access the value in a way that it doesn't support (even though its type says it should have). And yet, people love TypeScript, it clearly provides value despite this.

> This stuff makes a programming language theorist's head explode but it practice it works. Look, anything can be invalid in ways you never thought of, and no type system can fully defend you from that. You gotta get comfortable with the idea that exceptions might be thrown from anywhere, and design systems to accommodate failure.

It's only possible if you're doing something wrong type-wise. In particular, when ingesting an object you're supposed to validate it before/as you assign the type to it. Delaying the error until the particular field is accessed is bad TypeScript! Those kinds of exceptions aren't supposed to be thrown from anywhere.


I think this comes from everyone wanting to use the same schema and parser. For example, a text editor and a compiler have obvious differences in how to deal with invalid programs.

Maybe there need to be levels of validation, like "it's a text file" versus "it parses" versus "it type checks."


Sure, that would also have been a fine solution. There are lots of ways to tackle it really and some of it is just very subjective. There's a lot of similarities here between the NoSQL vs SQL debates. Do you want a schemaless collection of JSON documents or do you want enforced schemas, people can debate this stuff for a long time.

You can also see it as a version control and awareness problem rather than a schema or serialization problem. The issues don't occur if you always have full awareness of what code is running and what's consuming what data, but that's hard especially when you take into account batch jobs.


> The argument is actually more like: don't use badly written middleman software that tries to parse messages it doesn't need to parse.

> I was at Google when the "let's get rid of optional" crusade started. It didn't make sense to me then and over a decade later it still doesn't. If a program expects a field to be there then it has to be there, removing the protobuf level checking just meant that programs could now read garbage (some default value) instead of immediately crashing. But the whole reason we have types, assertions, bounds checking and so on is because, almost always, we'd like our software to NOT just blindly plough on into undefined territory when it doesn't understand something properly, so in reality it just means everyone ends up coding those very same required-ness assertions by hand.

Yeah, that's what stuck out to me from the linked explanation as well; the issue wasn't that the field was required, it was that the message bus was not doing what was originally claimed. It sounds like either having the message bus _just_ process the header and not the entire message or having the header have a version number that indicated which fields are required (with versions numbers that are newer than the latest the bus was aware of being considered to have no required fields). I don't claim that it's never correct to design a protocol optimizing for robustness when consumed by poorly written clients, but I similarly struggle to see how making that the only possible way to implement a protocol is the only valid option. Maybe the goal of cap'n proto is to be prescriptive about this sort of thing, so it wouldn't be a good choice for uses where there's more rigor in the implementation of services using the protocol, but if its intended for more general usage, I don't understand this design decision at all.


That's valuable what you say, and it's kinda odd some people here discard practical experience in favor of their subjective flavor of theoretical correctness.


The distributed part shifts the problem from "find types that represent your solution" to "find a system of types that enable evolution of your solution over time." I think this is why bad things like json or xml do so well: they work fine with a client dev saying, "I need this extra data" and the server dev adding it, and then the client dev consuming it.

The more modern approaches, like protobuf or capn proto are designed with the experience of mutating protocols over time.

It works pretty well too unless the new field changes the semantics of old field values, e.g. adding a field "payment_is_reversal_if_set" to a payment info type, which would change the meaning of the signs of the amounts. In that case, you have to reason more explicitly about when to roll out the protocol readers and when to roll out the protocol writers. Or version it, etc.


> Can't we extend this argument to eliminating basically all static typing?

No, because static typing exists in all sorts of places. This argument is primarily about cases where you're exchanging data, which is a very specific use case.


To elaborate on your point:

Static type systems in programming languages are designed to break at compilation-time. The reason this works is because all users are within the same “program unit”, on the same version.

In other words, static typing allows more validation to be automated, and removes the need for multiple simultaneous versions, but assumes that the developer has access and ability to change all other users at the same “time” of their own change.

I find this whole topic fascinating. It seems like programmers are limited to an implicit understanding of these differences but it’s never formalized (or even properly conceptualized). Thus, our intuition often fails with complex systems (eg multiple simultaneous versions, etc). Case in point: even mighty Google distinguished engineers made this “billion-dollar mistake” with required fields, even though they had near-perfect up-front knowledge of their planned use-cases.


It's actually the opposite. The billion dollar mistake is to have pervasive implicit nullability, not to have the concept of optionality in your type system. Encoding optionality in the type system and making things required by default is usually given as the fix for the billion dollar mistake.


Huh? Did you read the link, from the guy who was there during the major failure at Google that led to proto3 being redesigned without that flaw?

The whole lesson is that you can’t apply the lessons from static type systems in PLs when you have multiple versions and fragmented validation across different subsystems. Counter-intuitively! Everyone thought it was a good idea, and it turned out to be a disaster.


I did read the link and I was at Google at the time people started arguing for that. With respect, I think the argument was and still is incorrect, that the wrong lessons were drawn and that proto3 is worse than proto2.


Alright, fair enough. Apologies for the dismissive tone. Could you elaborate (or point to) these wrong lessons or an alternative?


OK, what do you do when a message comes in missing a field? Crash the server?


you reject the message in the framework? and if the client is aware it’s required they fail to send?

the bigger challenge with proto3 is that people use it both for rpc and storage, in some cases directly serializing rpc payloads. Disregarding how awful a choice that is, you likely want to trade off flexible deserialization of old data at the expense of rigidity, and conformance.


It remains a big asterisk to me, why was some random middleware validating an end-to-end message between two systems, instead of treating it as just an opaque message.

Why are we not having this debate about "everything must be optional" for Internet Packets (IP) for example? Because it's just binary load. If you want to ensure integrity you checksum the binary load.


Things like distributed tracing, auth data, metrics, error logging messages and other “meta-subsystems” is certainly typical use cases. Reverse proxies and other http middleware do exactly this with http headers all the time.


No one has near-perfect up-front knowledge of a software system designed to change and expand. The solution space is too large and the efficient delivery methods are a search thru this space.


I may have phrased it poorly. What I should have said is that Google absolutely could have “anticipated” that many of their subsystems would deal with partial messages and multiple versions, because they most certainly already did. The designers would have maintained, developed and debugged exactly such systems for years.


Makes sense: they knew arbitrary mutability was a requirement but did not think it thru for the required keyword.


Static types are a partial application/reduction when certain mutable or unknown variables become constants (i.e. "I for sure only need integers between 0-255 here").

I'm not rejecting static types entirely, and yes I was discussing exchanging data here, as Alan Kay's OOP is inherently distributed. It's much closer to Erlang than it is to Java.


> I'm not rejecting static types entirely, and yes I was discussing exchanging data here

OK I guess I'm having a hard time reconciling that with:

> basically all static typing


I'm not the person you're responding to, but I interpreted their comment as, "doesn't the argument against having protobuf check for required fields also apply to all of protobuf's other checks?"

From the linked article the post: "The right answer is for applications to do validation as-needed in application-level code. If you want to detect when a client fails to set a particular field, give the field an invalid default value and then check for that value on the server. Low-level infrastructure that doesn’t care about message content should not validate it at all."

(I agree that "static typing" isn't exactly the right term here. But protobuf dynamic validation allows the programmer to then rely on static types, vs having to dynamically check those properties with hand-written code, so I can see why someone might use that term.)


Sorry, I see how I'm vague. The idea is you have no "pre-burned" static types, but dynamic types. And static types then become a disposable optimization compiled out of more dynamic code, in the same way JIT works in V8 and JVM for example (where type specialization is in fact part of the optimization strategy).


You're describing dynamic types


But with the benefit of static types, and without the drawbacks of static types.


No. "Types only known at runtime" are dynamic types. "And also you can optimize by examining the types at runtime" is just dynamic types. And it does not have the benefit of static types because it is dynamic types.


This is devolving into a "word definition war" so I'll leave aside what you call static types and dynamic types and get down to specifics. Type info is available in these flavors, relative to runtime:

1. Type info which is available before runtime, but not at runtime (compiled away).

2. Type info which is available at runtime, but not at compile time (input, statistics, etc.).

3. Type info which is available both at compile time and runtime (say like a Java class).

When you have a JIT optimizer that can turn [3] and [2] into [1], there's no longer a reason to have [1], except if you're micro-optimizing embedded code for some device with 64kb RAM or whatever. We've carried through legacy practices, and we don't even question them, and try to push them way out of their league into large-scale distributed software.

When I say we don't need [1], this doesn't mean I deny [3], which is still statically analyzable type information. It's static types, but without throwing away flexibility and data at runtime, that doesn't need to be thrown away.


Short of time travel one can not turn (3) or (2) into (1). I'm not sure where the confusion here is or what you're advocating for because this isn't making sense to me.

> there's no longer a reason to have [1]

I guess if you're assuming the value of static types is just performance? But it's not, not by a long shot - hence 'mypy', a static typechecker that in no way impacts runtime.

I think this conversation is a bit too confusing for me so I'm gonna respectfully walk away :)


The confusion is to assume "runtime" is statically defined. JIT generates code which omits type information that's determined not to be needed in the context of the compiled method/trace/class/module. That code still "runs" it's "runtime".


Yes, the types that JIT omits are dynamic types.


It's up to you.

It's easy to imagine any statically typed language having a general-purpose JSON type. You could imagine all functions accepting and returning such objects.

Now it's your turn to implement the sum(a,b) function. Would you like to allow the caller to pass anything in as a and b?


This is like when people use protobuf to send a list of key-value mappings, and call that a protocol. (I've seen that same design in many protocol description arenas, even SQL database schemas that are just (entityId INT, key CLOB, value BLOB).


Do you need to make different versions of a program exchange information even though they do not agree on the types? No? Then this argument cannot be extended this way.


See my sibling comment, e.g. with respect to Rich Hickey's framing - https://news.ycombinator.com/item?id=36911033




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: