Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
MiniLang: A type-safe C successor that compiles directly to x86_64 assembly (github.com/nicup14)
53 points by NICUP14 on Dec 20, 2023 | hide | past | favorite | 56 comments


> A type-safe C successor that compiles directly to x86_64 assembly.

...

> Warning

> There is no safety measure regarding string literal manipulation. Doing this will most probably result in a segmentation fault.

  # String literal (int8*)
  "abc"

  # Character literal (int8)
  'c'

  # Undefined behaviour
  "abc" at 0 = 'd'
This doesn't seem very type safe, at least not yet.

This is a cool project, though! I have been wanting to learn to write compilers. They're the coolest things.

Also, this is a side note, but what I actually hope is the successor to C is higher level languages having the option to restrict oneself to a low-level variant that can manipulate pointers, and/or let you embed assembly under macros and special functions, like this old blog shows how to do for Lisp[1]. It would be cool if more languages let you do stuff like that.

[1]: https://pvk.ca/Blog/2014/03/15/sbcl-the-ultimate-assembly-co...


That's what "under development" means, but you're correct


>I actually hope is the successor to C is higher level languages

Rust

>having the option to restrict oneself to a low-level variant that can manipulate pointers

unsafe Rust


A killer feature of C is its simplicity. It is so simple that there are dozens, probably hundreds of independent C compilers and C standard libraries.

Rust has no formal specification and no independent implementations even close to complete. gccrs is still struggling to compile the standard library after years of development by some of the smartest compiler writers in the world.

Rust is a successor to C++, not C. Rust may be a lot of things but I don't believe it will ever be a successor to C.


> Rust is a successor to C++, not C. Rust may be a lot of things but I don't believe it will ever be a successor to C.

Well, the Linux kernel which is C and ASM, rejected C++ (for reasons although not technically because of the language itself so much) and it is adopting Rust. Rust is a language & a runtime. You can write C-like code in Rust and C++-like code in Rust or mix'n'match however you like (you can even get close to ML style languages).

> Rust has no formal specification and no independent implementations even close to complete

Rust has plenty of formal specifications. Their whole process is to formally specify how a thing works & then implement it against that spec.

It's not standardized by an external body / doesn't have a snapshot of the standards that are relevant to implement for a given edition. I view standardizing a language as an anti-pattern for similar reasons that attempts to do that with human languages fails - it ossifies things whereas languages really need to continue to grow and evolve (+ standards bodies become their own weird little fiefdoms that often have nothing to do with achieving the goals they're nominally supposed to be focusing on).

Same goes for independent implementations. It's fine for C because C as a language is a very simple thing and hasn't really changed in 30+ years and doesn't really have a runtime (most of the things people call C is actually POSIX).

It has not worked out well for C++ which has stagnated quite badly, having to implement the same feature 3+ times and having representatives from each arguing against features that are more difficult to inject into their architecture and for features that are easier is part of the reason. There are positives in that errata can be found more likely when you implement the same thing 3 times, but it's not clear to me that it's net better than just implementing a thing in the nightly release chain & letting it bake until it's ready.

Having a single frontend simplifies a lot of things. gccrs struggling is a positive thing as far as I'm concerned - it means that the Rust language continues to evolve and isn't concerning itself with needing to support a fork of the frontend.


If you want simple, there's always Pascal.


… and Modula 2!!


Rust isn’t even really a successor to C++


Rust is too complex and clunky to ever gain widespread acceptance outside its niches.

It's been a decade already, us recovering and current zealots we all need to wake up and smell the roses.

For system programming, treating memory as a critical resource with strictly traced ownership is a good design. For the other 97% of programing, memory needs to be there when you need it, to be plentiful and to get out of your way and not be a source of runtime bugs or compiler friction. Slight inefficiency is acceptable.


> Rust is too complex and clunky to ever gain widespread acceptance outside its niches.

Rust is a systems programming language (it didn't start out that way but is where it pivoted to) so I think we agree on that niche. Most people don't work on problems that Rust would help with, but the places it's suited for is seeing more Rust adoption, not less and those niches are quite large. For example, Linux kernel, Android, Windows kernel, Chrome, etc etc. Basically any C/C++ codebase, if it really needs to be C/C++, will get migrated and there's a metric fuckton of such code. There's probably more JS, Java, Python, Ruby, etc but those programs don't need to be migrated to Rust unless they need more performance/multithreading and can benefit from Rust in that way. Rust not making sense for JS, Java, Python, Ruby, etc coders does not make it a failure or a valid claim that it doesn't have "widespread acceptance" - after all those other languages also don't see widespread acceptance outside their niches.


What you're describing is not the successor of C, though. It's one of the many managed-memory languages.


Well, as usual, use whatever language fits your needs best. But Rust's main benefit is not performance, it is safety. There are very few languages in the industry that I feel have my back when it comes to reliability and one of them is Rust.


It's both. It's fast enough to provide an alternative to C and C++ purely due to its performance, even ignoring that it's more enjoyable to write Rust than C and C++, and ignoring that Rust code is reliable. That's why Discord switched from Go to Rust for performance reasons, and why there are Python developers that replaced C with Rust for high-performance Python extension modules.


> Slight inefficiency is acceptable.

Sure - and in rust that probably means boxed types everywhere. Maybe the worst feature of Java was the easy access to bare types. For rust it might be a culture thing - but that might change.


A decade is nothing in the lifespan of a programming language


I've written some Rust. This isn't by far the first language that aims to replace C. I haven't touched Rust for a while, and now decided that I want to get into Spark (that's Ada's thing). And going through the process of learning a new language that's somewhere close to C, for... I don't know exactly which time now, I realized that:

The unsafe aspects of C is what you pay for comfort. Sometimes, when you write a program you simply don't know what type it's going to be, and the details of that type aren't important right now. You just need the convenience of casting everything to everything or passing "void *" around just to get things done.

Similar to Rust's lifetimes Ada has what it calls "access checks". Roughly, it's a static check that your pointers are always valid. Similar to Rust's lifetimes it's a huge pain to deal with. Sometimes you'd sit in front of your monitor and be like "this code is correct! why don't you just do what I'm asking you to do and let it fail at runtime so I can tell where the problem is?" And then you start to weasel around. You don't spend time actually working on the problem you need to solve, you just typedef this, expand the definition of that, try making something a constant, a variable, try not passing some piece of data -- all kinds of manipulations just to make the checker happy.

All while in C you can just tell it to shut up and move onto the next goal. It's more natural to deal with problems when your code actually fails.

So, in the aftermath: do I like C programs? -- Absolutely no. They are unsafe and will potentially do all sorts of bad things. But do I like writing in a safer language? -- No... not really.

I wish there was a way for humans to write in C and then some magical compiler would read that code, understand what the human actually wanted and produced an Ada or Rust program from it. While there isn't such a thing, maybe the next best thing is to write first in C, and then rewrite it in a safer language?



I'm pretty sure I had seen a C# project years back that jailbroke the iPhone as well, but I don't even remember how it worked, it was probably a decade back.


Rust is much more akin to C++ than C in terms of complexity. It's early but Zig[1] might fit the bill.

[1] https://ziglang.org/


Not sure how. Zig is targeted as a systems level language but isn't memory safe. It's complex enough in its own way. I don't see it faring well against Rust - it'll have it's proponents and enthusiasts but I doubt it'll ever see the same adoption as Rust (i.e. it'll be closer to what D accomplished trying to supplant C++).


I assume also Zig, Nim and D fall into this category as well. I'm not sure about Crystal, and others so apologies if I missed some.


Memory safety and type safety are different things. They usually go together, but not necessarily so.


In this case, a type sade programming language would stop you from trying to write to read-only memory at compile time.


This seems like it ought to make sense, but do you have a concrete example?


Null pointer dereference, out of bound array accesses.


I was looking for concrete issues which violate either memory safety or type safety but not both. However, these examples are clearly both.

Let's look at the null pointer case first: Is this a memory safety issue? Yes, this is definitely a memory safety issue. The actual hardware has no concept of a "null" pointer, it just uses raw addresses - so if we allow this operation typically we're accessing arbitrary memory which happens to have an address corresponding to our "null" pointer. How about type safety? Well we violate type safety if we perform any operation on an object which isn't sanctioned by its type, and dereferencing isn't a sanctioned operation for the null pointer type.

Now how about array bounds? Memory safety is pretty obvious here, this is the classic stack smash beloved of "hackers". And type safety? The bounds miss is not a sanctioned operation for this type.


That's not really how most type systems work. Take array access. The typical typing is: if you have an `Array[A]` and reference an `Int` index you get an element of type `A`.

ref: Array[A] Int => A

This says nothing about what happens if you access an out of bounds element. In a memory safe language this is enforced by a runtime check. Without memory safety you get YOLO semantics like in C.


I don't know what kind of type system you have in mind then. Do you have an example of language where all array types have a way to check, statically, that all indexing operations are in bound? How would that even work? A language without dynamically-sized arrays at all, maybe?


I think you can do this with proof assistants, and with the SPARK version of Ada.

You can also do it with a dependent type system. I haven't used it, but from what I've read, in the ATS programming language, which is dependently typed, you can guarantee that array index accesses and pointer arithmetic never goes out of bounds, at the cost of having to pass proof variables explicitly to various functions.

This poster goes over how to ensure you don't cause buffer overflows in it:

https://bluishcoder.co.nz/2014/04/11/preventing-heartbleed-b...

https://bluishcoder.co.nz/2018/01/10/capturing-program-invar...

It's a bit obscure; this is the only person who's ever blogged about ATS, as far as I can tell.


Two things.

Firstly, it is simply not the case that all constraints must be able to be statically enforced for them to constitute part of the type's definition. I'd guess that's a huge part of the confusion here.

But, yes, actually all WUFFS indexing is statically checked. It's using inference to conclude the range of the index, so if it won't fit that's a static type mismatch.


Link: https://github.com/NICUP14/MiniLang

# Mini Lang

A type-safe C successor that compiles directly to x86\_64 assembly.

## Features

* Minimal * Compiled * Typed * Functional* * Inter-op with C functions

Minimal - As close as possible to actual assembly code while maintaining as many high-level features as possible.


Firstly, I guess you state these aren't built in yet, so I guess it might change, but I wonder if the lack of unsigned primitive types is intentional? It seems like something I would want to be built in.

Secondly, you have byte as int8 and char as int8 - in reality, either could be unsigned surely? For me, byte is unsigned. As I mostly do C# these days, char is not generally used as a numeric value also. It feels like - if you have byte, char can represent an actual character value. But it feels like not making char multibyte (UTF8 would work) it makes your typing system need to deal with all the WChar nonsense/legacy. Better to just make it UTF8 now? If you don't have byte, I guess you can have char, but that feels like more legacy.


This language calls itself functional - does it support higher order functions? What about closures? Or function composition and (de)currying? Does it have pattern matching like you find in Haskell or ML?

Also, what about C is type unsafe and how does minilang fix it? Would love to have seen some comparisons.


>what about C is type unsafe

For example, that it lets you implicitly cast incompatible pointers, and implicitly cast between integer and pointers. I believe this compiles:

function_pointer f = "a"["a"];

f();


How does this language address C's biggest problem, memory safety?


It doesn't.


> # Undefined behaviour > "abc" at 0 = 'd'

Why is this undefined? Because the string literal is read-only?


They are apparantly allocated in a non-writable section. Non-const qualified string literals cannot* be modified in C either. The following snippet will cause a segfault:

    int main(void) {
        char *s = "abc";
        s[0] = 'd';
        puts(s);
    }
*) At least in any environment I'm aware of. If you know one where they are, let me know.


...not in WASM (running in browsers at least, other WASM runtimes might implement the WASM heap differently and allow write-protected pages).

PS: FWIW, wasmtime happily runs the code:

    wasmtime bla.wasm
    dbc


(Of course, since this is undefined behaviour - https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf , section 6.4.5 para 7 - literally any behaviour from any implementation is perfectly compliant.)


My guess is that the actual behaviour depends on the runtime environment. On some platforms you might get a segfault, on others it might overwrite the first character.


Yeah I think this is the answer. Maybe it should be obvious, but I had to think it through. Maybe not the best example for the first-exposure documentation in the README?


Maybe indexing starts at 1 ?


Oh, my bad. I confused somehow the assignment for an equality test.


Well, this looks a bit early to make a strong opinion on the language, but it's always nice to see a new take on low-level-meets-safe! While I'm personally a big fan of Rust, it's clear that there are still many alternative approaches to explore.


Strange syntax choices, Pascal/Ruby style statements with C-like type notation order?


This isn't C-like type notation order though?

    let variable: type = value
Or do you mean pointer and array notation?

    let pointer: type* = address
    let array: type[n] = [elem1, elem2, ..., elemn]
...at least the array type looks better to me then putting the size in front. Putting the '*' after the type is debatable though.


Yes, I meant pointer (and array) notation. To give a view examples of notation in other languages:

Modula-2, p: POINTER TO INTEGER

Nim, p: ref int

Odin: p: ^int

Go: p: *int


The modifiers (pointer, "function taking _ returning _", "array N of", etc.) are prefixed, but they could just as well be postfixed. I think people don't do it because it's more similar to English to make the modifiers prefixed.

C:

  int *p;
Prefix:

  p: pointer to int;
Postfix:

  p: int pointer;
Another example.

C:

  int (*fp_arr[5])(int, int);
Prefix:

  fp_arr: array 5 of pointer to function(int, int) to int;
Postfix:

  fp_arr: int (int, int) pointer array 5;
This is something I was thinking about because some non-English languages use post positions instead of prepositions, and say the verb last instead of in the middle. For example, in English, you could say "He looks at me," and in some other languages, you'd say the equivalent of "He me-at looks". Instead of "I went to the bathroom" it would be the equivalent of "I bathroom-to went." And instead of "the fifth day of the month" it might be something like "month of fifth day", except "of" has the opposite meaning. I'm speaking vaguely because I've only read about such languages from a linguistics perspective, and I don't know them. Specifically, head-initial vs head-final languages: https://en.wikipedia.org/wiki/Branching_(linguistics)


And as for arrays,

    VAR array: ARRAY n OF type = [elem1, elem2, ..., elemn];
also doesn't look questionable to me, with or without square brackets around "n".


I've noticed that the trend with new languages is the pascal style postfix type declaration. E.g:

    let foo: int = 0
Instead of

    int Foo = 0;
This makes me sad


The former syntax allows user defined types (that aren't reserved keywords) to be used, correct?


What so you mean? You can declare user-defined types in C and C++ without that syntax.

In C++:

  struct MyType {
      int field;
  };

  MyType m;
In C, you would need a typedef to get the same behavior, but the same applies otherwise.

  typedef struct {
      int field;
  } MyType;


You're correct. Of course the price for this is not only does it cause compiler authors to go bald earlier, it also means that you have natural language-style garden path parsing problems where the meaning of an expression isn't obvious at all and may take several attempts to discern.

And so it also leads to situations where a human thinks "I made a Doodad" but the compiler sees instead "For some reason the human just decided to draw my attention to the fact Doodads exist, which I already knew". In a language like C or C++ where you do need to constantly repeat yourself because it's designed for 1970s compiler technology, warning about this will lead to too many false positives, so it's not done. C++ lock guards are a famous example where this happens - leading to code where authors and even sometimes reviewers believe a lock has been taken properly, but from the compiler's perspective there was never any lock at all so mutual exclusion isn't achieved.


This is mostly an issue with C++ specifically. If you implement the lexer hack[1], C can be parsed using an otherwise context-free grammar.

C++ suffers from requiring arbitrary tokens of lookahead before being able to determine whether something is a declaration or an expression statement, which is a consequence of the combination of C-style declarations, function-like casts, and function-like constructors.

For example, here, the parser cannot know whether the variables x, y, and z are being redeclared in the same scope, or if they're being evaluated and discarded, until it reaches the "new int" part. Last time I tested it, clang++ parsed it correctly, but not g++.

  int x, *y, *z;
  {
    int(x), *y, z, new int;
  }
In C++, "int(x)" is equivalent to "(int) x" (and this doesn't exist in C), but as a holdover from C, it also allows redundant parentheses in declarations, so one could also just redeclare x in a new scope like this (in both C and C++):

  double x;
  {
    int(x); /* x is an int now */
    x = 5;
  }
I think this is a variant of the most vexing parse[2], which your lock guard example also sounds like an example of.

You probably already know about this, but I was surprised by that example even knowing that C++ requires arbitrary tokens of lookahead to parse correctly, because I had only seen the Timekeeper example from Wikipedia.

There's also the problem that it can't parse something like this unless it knows whether it's supposed to be a template instantiation or an expression statement, but I think that's just an extension of the typename/variable problem that it inherited from C.

  a < b, c > d;
[1]: https://en.wikipedia.org/wiki/Lexer_hack

[2]: https://en.wikipedia.org/wiki/Most_vexing_parse


On the other hand, it makes me happy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: