Hey, HN!
Long-time lurker; a bit intimidated to post this:
I'm an IB diploma candidate (in HS), and writing a research paper is an important part of that curriculum. I am hoping to write my paper on how an LLM's training data impacts its output, comparing one trained on, say, Wikipedia as opposed to Reddit.
I have access to some reasonably powerful Nvidia GPUs and plenty of time to train.
I'm fairly decent at "technology," as wide of an umbrella as that is -- I use Linux, have messed with things like koboldcpp, etc. -- but my programming abilities are weak; all I've done is 6.00.1x (intro to python) through edX.
Does this seem like a reasonable project? I know the results will be bad, but will they be enough to measure differences in some way?
Consider a finetune - they're faster and relatively cheap (like, under $30 rented compute time). The link above lists them, but the steps are to gather a dataset, do the training, and evaluate your results. LLMs are about instruction/evaluation, so it's easy to show results, measure perplexity, and compare against the base model.
If you're interested in a building a limited dataset, fun ideas might be quotes or conversations from your classmates, lessons or syllabi from your program, or other specific, local, testable information. Datasets aren't plug and play, and they're the most important part of a model.
However, even using the same dataset can yield different results based on training parameters. I'd keep it simple and either make the test about the impact of differences in training parameters using a single dataset, or pick two already created datasets and train using the same parameters for comparison.
Good luck in IB! I was in it until I moved cities, and it was a blast.