Hacker News new | comments | ask | show | jobs | submit login

I really think the future is schema-based.

The evolution of technologies goes something like this:

1. Generation 1 is statically typed / schemaful because it's principled and and offers performance benefits.

2. Everyone recoils in horror at how complicated and over-designed generation 1 is. Generation 2 is dynamically typed / schemaless, and conventional wisdom becomes that this is generally more programmer-friendly.

3. The drawbacks of schemaless become more clear (annoying runtime errors, misspelled field names, harder to statically analyze the program/system/etc). Meanwhile the static typing people have figured out how offer the benefits of static typing without making it feel so complicated.

We see this with programming languages:

1. C++

2. Ruby/Python/PHP/etc.

3. Swift, Dart, Go, Rust to some extent, as well as the general trend of inferred types and optional type annotations

Or messaging formats:

1. CORBA, ASN.1, XML Schema, SOAP

2. JSON

3. Protocol Buffers, Cap'n Proto, Avro, Thrift

Or databases:

1. SQL

2. NoSQL

3. well, sort of a return to SQL to some extent, it wasn't that bad to begin with given the right tooling.

If you are allergic to the idea of schemas, I would be curious to ask:

1. isn't most of your data "de facto" schemaful anyway? Like when you send an API call with JSON, isn't there a standard set of keys that the server is expecting? Isn't it nicer to actually write down this set of keys and their expected types in a way that a machine can understand, instead of it just being documentation on a web page?

2. Is it the schema itself that you are opposed to, or the pain that clunky schema-based technologies have imposed on you? If importing your schema types was as simple as importing any other library function in your native language, are you still opposed to it?




Completely agree. A key thing we realized recently at Snowplow was that people's data starts schema'ed - in MySQL tables, or Protocol Buffers, or Backbone models. Normally when data is being passed around in JSONs (e.g. into/out of APIs, into SaaS analytics), it means the original schema has been _lost_ - not that there was never a schema in the first place. And that's something that needs fixing. We released Iglu (http://snowplowanalytics.com/blog/2014/07/01/iglu-schema-rep...) as a JSON Schema repository system recently and it's been really cool seeing people start to use it for other schema use cases outside of Snowplow.


If anyone finds JSON Schema in Ruby to be too slow, I developed a Ruby-based schema system that is much faster:

http://rubygems.org/gems/classy_hash

https://github.com/deseretbook/classy_hash

I wrote it for an internal backend system at a small ecommerce site with a large retail legacy.

Edit: Ruby Hashes (the base "language" used by Classy Hash) aren't easily serialized and shared, but if there's enough interest, it would be possible to compile most JSON Schema schemas to Classy Hash schemas.


Have you looked contracts.ruby (https://github.com/egonSchiele/contracts.ruby)? I'm sure you could overlap some code


Interesting. It looks like contracts.ruby does for method calls what Classy Hash aims to do for API data.


> 1. isn't most of your data "de facto" schemaful anyway? Like when you send an API call with JSON, isn't there a standard set of keys that the server is expecting? Isn't it nicer to actually write down this set of keys and their expected types in a way that a machine can understand, instead of it just being documentation on a web page?

I'd argue that it's not.

Your schema is implicitly defined somewhere in the business logic, and you have to first learn the schema description language in order to translate your application code into schema description code. And when the application code changes, you won't be very excited to adjust the schema again.

Sometimes it's worth the effort and makes development easier, often it's the opposite. An error message saying `error: articleId missing in sale object` is more informative than `schema error in line 4282`.


Here is a JSON Schema validation failure taken straight out of the Snowplow test suite (pretty printed):

  {
    "level": "error",
    "schema": {
      "loadingURI": "#",
      "pointer": ""
    },
    "instance": {
      "pointer": ""
    },
    "domain": "validation",
    "keyword": "required",
    "message": "object has missing required properties ([\"targetUrl\"])",
    "required": [
      "targetUrl"
    ],
    "missing": [
      "targetUrl"
    ]
  }
You can't seriously prefer a NullPointerException (or choose your poison) three functions later.


I'll admit that this doesn't look too bad, but it's an additional effort. You wouldn't do this for very simple formats.


I'm not sure I agree with all your points. For programming languages hopefully Python and Ruby will not be going away anytime soon. Javascript mind share is also continually growing.

For databases, weren't there other reasons people got excited about NoSQL databases? Not having a schema was one aspect of it, but it mostly had to do with scaling. Now people realize SQL scales just fine for pretty much most of the use cases that were getting replaced with NoSQL. And also that most data(in webpages at least) is relational in nature.


> For programming languages hopefully Python and Ruby will not be going away anytime soon.

Neither is C++. But languages being designed these days don't look like Python or Ruby, they look like Swift, Dart, and Go.


I'm not sure I would lump Dart in with those as its typing discipline is completely optional. Also Julia, Racket and Clojure are all relatively recent and very actively developed. I think dynamic and schema-less approaches have some very serious legs yet :)


I'm not sure I would describe Racket as recent.


Racket is not a single language as much as a family of languages quite a few which are quite recent.


okay, fair enough.


Hmm. Maybe because we don't need a new high-level language Python or Ruby, but there are opportunities for better low-level languages?


Agreed. The recent spate of compiled languages says more about C++ than anything else.


I like the idea of schemas but only when they're built into the message (which is probably not a good way of conveying my meaning).

Essentially JSON gives you numbers, strings and nulls so when accepted on the other side it obviously knows what's a number, what's a string, etc.

Honestly if JSON could be expanded to essentially be the same but add additional types along with bolting on new types (extensible) then I think it would be perfect for the job.

At least in my opinion.


Schemas built into the message certainly have the benefit of being self-describing. But they also have downsides:

- encoding the schema along with the message makes encodings like this less efficient.

- without an ahead-of-time schema, you don't have any canonical list for all the fields that can exist and their types. Instead this gets specified in ad-hoc ways in documentation. For example, like this: https://developers.facebook.com/docs/graph-api/reference/v2....

That URL describes a schema for groups. The schema exists, it's just not machine-readable! That means you can't use it for IDE auto-completion, you can't reflect over it programmatically, and you can't use it to make encoding/decoding more CPU/memory efficient. It's so close to being useful for these purposes, why not just take that final step and put it in a machine-readable format?


It's really frustrating how in 2014 almost everybody is still writing API definitions which are only human-readable. Two worthy exceptions:

- https://github.com/balanced/balanced-api - https://helloreverb.com/developers/swagger

You can have self-describing messages without having to embed the schema in the instance. Instead you embed a reference to the schema in the message. We came up with an approach to this called self-describing JSONs: http://snowplowanalytics.com/blog/2014/05/15/introducing-sel... The Avro community do something similar.


> Instead you embed a reference to the schema in the message

AKA XML DTD


You're talking about typing, not really about schemas. With JSON, you can know whether a value is a number as opposed to a string, but you can't know whether it's supposed to be a number.


Dumb question: why doesn't the value being a number tell us it's suppose to be a number?


Assuming the service is operating correctly, I think the more accurate statement is that it doesn't tell you the value will always be a number. That is, perhaps a field has multiple valid types. Maybe that field won't exist all the time or similar data may be available in a different struture. Without a schema, these questions can't really be answered.


Example: Postal codes. Say you're transferring an address in JSON and you have a postal code field. In the UK, postal codes are strings (e.g. "BS42BG"), easy enough. Now, someone enters a US postal code (90505). Should we transfer it as a number, or a string?


Definitely as a string. Numbers aren't things that have digits. Numbers are things you do math with.


OK, that's logical. So where do we specify this without a schema? What happens if a client sends a number instead of a string to the server? Should it accept it and convert it, or return an error?


Many technologies are developed, few are adopted. That's where your 3's are.

The usefulness of schema is inversely proportional to the rate of change. They are great for getting it right, but what's the point if it all has to change before you are done?

Rate of change is a question of fact, not personal preference.

NB. I like getting things right, choose static typing, and am developing a tooling technology to further this.


So I'm confused, are you in favor of the transit way or against it?


Transit appears to be schema-less, so I would be in favor of other formats that have explicit schemas.


Transit lets you define your own semantic types, with handlers and decoders to map from/to your programming language types.

What exactly would one gain from using schemas, if I can send the value (state) of any of my static types to another application using Transit?


> What exactly would one gain from using schemas, if I can send the value (state) of any of my static types to another application using Transit?

Interoperability with other languages, for one. The static type you defined in your language can't be used with any other languages. Schemas are static types that can be used across languages.


Right, but you can write a decoder an let Transit convert your type to an equivalent type in another language.

That's the whole point of Transit, interoperability with other languages. Having a good set of scalar types, basic composite types, and the ability to extend it with your own semantic types built recursively from the base types.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: