I would guess contention on sync primitives in the dispatcher. The queues look like they need to be locked before they can be modified:
// Scheduling helpers. Sched must be locked.
static void gput(G*); // put/get on ghead/gtail
This is the comment on the lock:
in the uncontended case,
* as fast as spin locks (just a few user-level instructions),
* but on the contention path they sleep in the kernel.
* a zeroed Lock is unlocked (no need to initialize each lock).
A better approach for multicore would probably be work-stealing queues.