To follow up, after experimenting a bit with the source code:
1. Please, for the love of God, and for scientific reproducibility, specify library versions explicitly, and use pyproject.toml instead of an incomplete requirements.txt.
2. The 1,000 Sudoku examples are augmented with hand-coded permutation algorithms, so the actual input data set is more like 1,000,000 examples, not 1,000.
I don't know how common this is, but the fschat library maintainers went for at least a year without making an official release or updating the version number in their GitHub repo, so the only way to both have current code and a reproducible build (without just including the fschat library directly, of course) was to pin it to a particular GitHub commit hash, which would get you code that was current, but with the version number from 12+ months earlier.
fschat is pretty popular for LLM-related work, so I assume this is at least not unheard-of for other notable third-party libraries.
I don't remember the exact scenario but it might have been related to the underlying python or some sys library being a little different and then the dependency lock not being compatible with it.
1. Please, for the love of God, and for scientific reproducibility, specify library versions explicitly, and use pyproject.toml instead of an incomplete requirements.txt.
2. The 1,000 Sudoku examples are augmented with hand-coded permutation algorithms, so the actual input data set is more like 1,000,000 examples, not 1,000.