
Main is usually a function, so when is it not? (2015) - phreack
http://jroweboy.github.io/c/asm/2015/01/26/when-is-main-not-a-function.html
======
ChuckMcM
Here is a link to the several times this has appeared here:
[https://news.ycombinator.com/from?site=jroweboy.github.io](https://news.ycombinator.com/from?site=jroweboy.github.io)

I find these "hey there is this language called C and you can do some really
weird stuff in it!" articles humorous because from the beginning of time there
have been people who are fascinated by this aspect of the C language and gave
rise to the whole obfuscated C contest thing.

The idea that a symbol references an address and its type is only a
convenience in the source code always messes people up. When I've taught C to
people there is always that one person who says "What if you cast a string
pointer to a function and called it! Huh?" and I explain that is perfectly
legal C and has been exploited for years and you can practically see their
brain change conceptual planes in mid-air :-)

~~~
microcolonel
I like to start with the assembler, then just tell people that high level
languages are fancy assemblers. Some of them restrict you to using the
designer's favourite macros, some of them allow you to build on highly general
tools at approximately the lowest level, but ultimately all computer
programming languages are fancy macro assembler syntaxes.

~~~
tomjakubowski
This is an OK mental model until a wholly compliant C or C++ compiler blows it
all to smithereens with an optimization exploiting UB.

You gotta write C or C++ for the abstract machine defined by the standards.
Thinking about C in terms of some concrete assembly language or architecture
can be problematic.

------
poizan42
The thing I'm wondering about with this is why ld by default creates only two
LOAD segments, one r-x and one rw-. .text and .rodata are put into the former
while .data is put into the latter. This is the only reason this works. So my
question is why there isn't a third LOAD segment with access r-- for the read
only data? If anything it would reduce the surface of where to find ROP
gadgets slightly so couldn't hurt, no?

~~~
ajross
That's architecture-specific. On some machines there is indeed a separate
"execute-only" segment in the ELF file. But on many architectures (x86) there
has historically been no hardware support for that, and for compatibility
reasons the standard linker output still maintains the same scheme.

~~~
poizan42
It's not a lack of execute-only segment but of a read-only, no-execute segment
I'm talking about. x86 has had support for that for exactly as long as it has
supported no-execute, which the second LOAD segment is marked with (i.e. the
one that .data goes in).

Platform support is irrelevant, systems not supporting the access flags simply
gives you more access. E.g. running a modern Linux on pre-Athlon 64 CPUs will
just cause all readable pages to be executable as well.

~~~
pm215
The compatibility requirement is the result of the historic lack of x86
platform support, though: x86 used not to support r-- permissions, so it put
rodata in an r-x segment, so there are likely programs in the wild which
accidentally rely on that, so the linker can't now tighten the rodata
permissions without breaking some existing set of programs of unknown size.
Whether you think that's a good tradeoff depends on your opinion on the size
of that set of programs and how heavily you weight 'avoid breaking code that
used to work' against 'tighten permissions for security reasons'.

It would be interesting to know if you can ask the linux linker to put rodata
in its own r-- segment -- I scanned the docs but didn't see an option for it.

~~~
poizan42
> x86 used not to support r-- permissions,

Depends on what you mean, the original page table entries have a R
(Read/Write) flag, if not set the page is read-only. What you couldn't do was
mark a page non-executable. But nevertheless the second LOAD segment is marked
rw- and not rwx, so it would seem that it wasn't deemed a problem in the past
having segments with unsupported permissions.

At the time when we got the NX bit it did happen that some programs broke
because they expected executable data, but the security benefits were more
important.

> It would be interesting to know if you can ask the linux linker to put
> rodata in its own r-- segment -- I scanned the docs but didn't see an option
> for it.

You have to write your own linker script, see e.g. [0].

[0] [http://www.cl.cam.ac.uk/~srk31/blog/devel/custom-elf-
phdrs.h...](http://www.cl.cam.ac.uk/~srk31/blog/devel/custom-elf-phdrs.html)

~~~
pm215
By the r-- syntax I meant specifically 'readable, not writable, not
executable' as distinct from rw- 'readable, writable, not executable' or r-x
'readable, not writable, executable'.

~~~
dfox
On i386 the descriptor cannot be both writable and executable at same time.
But in order to support sane semantics for C, typical Unix OS (which for
purposes of this discussion includes 32bit Windows) loads CS, DS and SS with
different descriptor selectors that nevertheless alias to same range of linear
addresses and thus essentially disable most of the MMU's protection logic and
rely only on paging. And traditional 32bit i386 page table entries only have
two flags: accessible at all (called "present") and writable.

------
panic
On the subject of machine code as data, there was an interesting text file
released at this year's SIGBOVIK:
[https://www.cs.cmu.edu/~tom7/abc/paper.txt](https://www.cs.cmu.edu/~tom7/abc/paper.txt)
(you'll need to resize your browser window to read it properly -- for an
easier reading experience you can also check out the PDF and video at
[https://www.cs.cmu.edu/~tom7/abc/](https://www.cs.cmu.edu/~tom7/abc/)).

~~~
userbinator
That is an amazing work of art. I only skimmed it briefly but it would make
for great spare-time reading.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=8951283](https://news.ycombinator.com/item?id=8951283).

------
gumby
The program entry point is actually _start (which does some setup and later
calls main()) so for even more extreme TA befuddlement, write a program that
doesn't even call main()!

~~~
mmjaa

        int noMain()     __attribute__ (constructor);
        int noMain() {
            printf("Goodbye, main()");
        } // &etc.

~~~
dfox
This actually would not link, because _start() or something it calls into
(depending on implementation of CRT on given platform) would contain
unresolved reference to main. (and goven the fact that all this CRT startup
code is usually one .o, you cannot just patch out the part that calls main(),
you have to replace it completely)

On the other hand something around the lines of:

    
    
        void _start(){
          write(1, "Hello world\n", 12);
          exit(0);
        }
    

should probably work, given that you link it without CRT startup code (ie. gcc
-nostdlib)

~~~
im3w1l
Now for more confusion, have a main function that _start doesn't call.

Now for even more confusion, have the main function (which never runs) call
the _start function.

~~~
gumby
You are evil!

~~~
therein
He should be writing some crackmes.

------
userbinator
_We joked with him about how he needs to make a program that works, but the
grading TAs wouldn’t be able to figure out how it works._

Unless you come across a TA like me (I've been one before), who will comment
on the fact that using xor+inc or push+pop would be shorter ways to set a
register to a small immediate. ;-)

------
protomyth
> Since I knew the target system is going to be 64bit Linux

Knowing the target system makes a lot of these things quite a bit easier. I
had really good luck in college knowing our graphics teacher was using a 286.

As a TA you pretty much can be a hardass or save time for the rest of the
semester and just mark 100.

Another early scenario was declaring main() as void in some embedded systems.
I guess there was nothing to return to but it was still odd.

~~~
TazeTSchnitzel
It's weird how compilers complain if main() is void, despite it not having to
return anything if it's int.

~~~
user5994461
You have to return something. If you don't, the exit code is the value that
happened to be in memory at this point.

~~~
clarry
N1256 5.1.2.2.3 p1: If the return type of the main function is a type
compatible with int, a return from the initial call to the main function is
equivalent to calling the exit function with the value returned by the main
function as its argument; reaching the } that terminates the main function
returns a value of 0. If the return type is not compatible with int, the
termination status returned to the host environment is unspecified.

~~~
user5994461
main() is __cdecl__ or __stdcall__ on the major platforms.

That calling convention on x86 specifies that the return value is red from the
EAX register. So, when your main() function exits, the return value is red
from EAX, it's that simple.

The compilers may add boilerplate code around the main, in fact main() is
rarely the real main() function, but that doesn't change the spec.

~~~
TazeTSchnitzel
Did you read the comment you just replied to?!

~~~
pjc50
The two comments are not incompatible, they are just very different
worldviews. One tells you what the standard says (that the termination status
is unspecified if main does not return an int), the other tells you what
usually happens (you get what happened to be in AX).

And as I've just tested, gcc doesn't return zero termination status if you
reach the } at the end of main.

~~~
user5994461
Calling conventions are as much a standard as the C spec.

The main() is called like a regular function, by the _system thing that
executes programs_.

Depending on the compiler and the flags, the main() is not the real entry
point of the program. It can add another entry point to do some magic, like
setting the return code.

------
kaushiks
The CLR uses a similar technique to take over control of execution of a
managed EXE -
[http://srevas.net/notes/2007/12/25/mscoree/](http://srevas.net/notes/2007/12/25/mscoree/)

------
Deestan
Unless the assignment in question specifies architecture, the TA could
(should? I know I would) run it on a 32 bit Windows environment and mark it
down for not working.

~~~
userbinator
Not surprisingly, there are tricks one can use to figure out what the
environment is, using the same binary; start here:

[https://stackoverflow.com/questions/38063529/x86-32-x86-64-p...](https://stackoverflow.com/questions/38063529/x86-32-x86-64-polyglot-
machine-code-fragment-that-detects-64bit-mode-at-run-ti)

...and if you want to expand it to run on a SPARC or MIPS or something else...

[https://hackaday.io/project/18614-polyglot-one-binary-
multip...](https://hackaday.io/project/18614-polyglot-one-binary-multiple-
architectures)

The general idea is to find a set of bytes which can be interpreted in
different (and valid) ways by all the architectures you want to support, and
using those differences, jump to architecture-specific code.

------
monochromatic
Is it undefined behavior to have main be anything other than a function that
returns int?

~~~
delinka
UB comes with high-level languages. "main" is not specified in the high-level
language, but by the OS. You write some machine code, give it the name "main"
and the OS jumps to that location and starts executing. One should adhere to
the "C calling convention" (managing the stack correctly) if that code is
expected to behave within the system.

But "main is not a function" does not produce undefined behavior.

~~~
clarry
> "main" is not specified in the high-level language, but by the OS. You write
> some machine code, give it the name "main" and the OS jumps to that location

main() is specified by C. The OS as such doesn't know or care about main.
Headers in the binary executable specify where the OS should begin execution,
and this is rarely in main.

------
louithethrid
ah, this reminded me of some embeded oses where driver came as precompiled
arrayblobs, that are included with ifdef guards set in some central config
tool

