I remember the days of the green threads lib on Linux. It abused fork to use multiple cores, so it was really just pushing the problem back to the kernel's process scheduler. It also made 'ps' output ugly as hell on machines running multithreaded code. This was the late 1990s.
I wouldn't go so far as to say threads can't be implemented as a library, but to implement real threads efficiently in a library would require some way for the kernel to expose the scheduler. It would probably be possible for the kernel to provide extremely basic low-level scheduler syscalls and push everything else down into user-space in a more microkernel-ish system. These would be things like "get core count," "get current core," "start code on core X," etc. Given the higher overhead of syscalls vs. user-mode-only code, this might perform worse than threads in the kernel due to a larger number of kernel/user context switches.
LinuxThreads, as you described, implemented threads as processes sharing the whole address space.
The PTNG project attempted to add a full N:M hybrid scheduler just around the time M:N threading was going outof fashion.
In the end NPTL (basically extending fork to cover the differences between posix and linux semantics and adding futexes for fast aignaling) won.