Hm. The biggest problem there are the definition of "diagonal" and "box".
I mean non-square sudokus are still in NP. If a solution can be validated in polynomial time, a problem is in NP.
And to validate a (x, y) sized sudoku, you need to check (x * y) [ number of total squares] * (x [row] + y [column] + max(x, y)ish [diagonal-ish] + ((x/3) + (y/3)) [box]). The boxes might be weird, but boxes are smaller than the overall sudoku, so we are looking at some max(x, y)^4 or so. Same for the diagonals. The input is x*y numbers, so max(x, y)^2. Very much polynomial in the size of the input[1]
And it should also be easy to show that if an (n, n) sized sudoku has a solution, an (n+k, n+k) sized sudoku has a solution. You kinda shove in the new numbers in knights-kinda moves and that's it.
1: this can be a bit weird, because you need to be careful "what you're polynomial in". If your input is a number or two, you might be polynomial in the magnitude of the number, which however is exponential with the input length.
In this case however, we wouldn't have encoding shenanigans, since we're just placing abstract symbols from the turing machine's alphabet onto an imagined grid.
That is another very interesting part of complexity theory, yeah.
Like, "Constant time" means, "Runtime independent from input". And, well, solving any sudoku of size 9 or less is a constant times 6.6e21. Maybe a bit more in the exponent, but meh.
Like in graph theory, there are some algorithms for I think maxflow, which solve the thing in O(node_count^4). Theory can push this down to like O(node_count^3) or O(node_count^2.7) or less. That's amazing - you can lose almost 2 orders of magnitude.
Well, implementation of these algorithms and more detailed analysis point out _huge_ precomputations necessary to achieve the speedups. In practice, you'd only see speedups if you had graphs with multiple billions of nodes. In practice, if you deal with a boring subset like "realistically relevant", asymptotically worse algorithms may be the objectively better choice.
Like in this case. Here, some O(n^5) - O(n^9) depending on what the solver does can be better than O(1) for many practical purposes.
In such areas, intuition is little more than a lie.
The problem class of "Solve an arbitrary Sudoku of Size 9" might even be constant runtime, since it's a finite set to search through.