> While LLM projects typically require an exorbitant amount of resources, it is important to remind ourselves that research does not need to assemble full-fledged massively expensive systems in order to have impact.
Four research students from the Singapore University of Technology and Design are pretraining a 1.1B Llama model on 3 trillion tokens using a handful of A100's.
They're also providing the source code, training data, and fine-tuned checkpoints for anyone to run.
I’d agree — but would argue affordable for a sponsored dissertation program with 3 research students and an associate professor. They’re actually still training it!
For one run, yes. But if they are testing new architecture or something like that, they need at least dozens of them. If they are not testing new architecture, finetuning is almost always the way to go.
I'm still using old-school maximum-entropy models and HMMs for NLP and integrating them into our products. The model size is less than 1MB, and the inference time is really negligible. These models are used in batch processing for billions of entries and with no access to internet. Not sure if LLMs can fit this task?
If inference time is negligible, you might find that a larger model with slightly better metrics is affordable. Not that an LLM is the way to go, but perhaps if you’re already in for the squeeze maybe there’s more juice to be had.
A tougher question for me here is the why, vs how: 'Why research LLMs in academia if you can work in industry on the same problems with more resources?'. Systems researchers face something similar, where big systems are easier to research at Google etc. and get paid much more to do so.
The bit on Reasoning seems relevant: many industry roles for that will be taken by people ahead of the pre-graduation student, and it's research that fewer industry labs will be doing in an extreme way than say engineering more efficient inferencing. Doing that kind of research today is a lot more exciting than when we were doing it 10 years ago, so it's a special time.
I'm still in the stone ages trying to use topic modeling on a free text field. Can LLMs help me decipher a large dataset of human text and put it in buckets?
Absolutely. Give it a try. I pasted these comments into ChatGPT and asked it to score them according to slashdot’s system (which i didn’t actually define) and got this reply:
In Slashdot's scoring system, comments are scored from -1 to 5, based on factors like insightfulness, informativeness, and whether they're interesting or funny. Here are the hypothetical scores for the comments from Hacker News:
owlninja (42 minutes ago): This user is asking about the applicability of LLMs for text categorization in large datasets. It's a straightforward query, indicating a need for information. Score: 3 (Interesting) - As it opens up a discussion on a relevant and technical topic.
causalmodels (18 minutes ago): This comment provides a direct solution with a resource (zod-gpt). Score: 4 (Informative) - It not only addresses the query but also provides a specific tool to help achieve the goal, adding value to the discussion.
quickthrower2 (41 minutes ago): This comment suggests an approach for creating buckets using a sample and an LLM. Score: 3 (Interesting) - It proposes a practical method, contributing constructively to the original query.
These scores are subjective and would depend on the perspectives of the individual moderators or the community's voting on Slashdot.
You can reduce sentences to vectors and then create similarity scores to build a graph over the corpus. If you choose to create clusters then use a llm to summarise them to create labels.
I just did this today using openAI’s function calling. I have a bunch of elements in a scene and trying to classify them between various ‘buckets’ has been the challenge. The way I set it up is that in the schema expected, you take that top level free text, wrap it in quotes and it becomes the parent object with the elementCategory as a string required inside it, with a list of all category types in the description. Then loop through and create a dynamic schema based on how you chunk your data, then add a validation step at the end to ensure that chatgpt doesn’t forget, add or change keys. I found that if you create a numberedKeywordString for your chunk, and wrap each in quotes, then stuff that numberedKeywordString into the system prompt then it’s solid and will ensure the LLM catches every key and gets the associated elementCategory value.
It works quite well for dynamic classification in my purposes but I’m sure there is a better way.
With no problem. The easiest way is to go with GPT4 api for a start (you just tell it how you want it organised / what metadata to extract) - it will be costly, but you can prove that it works, and generate examples to fine tune a simpler model to do the same.
absolutely here is an example with openai. First pass you explain the task and let it reply to a hundredish examples. Fix the errors, fine tune. And it should be much better. You can repeat that any time you need until you reach satisfactory results but they come quickly when I worked on large number of short texts i wanted to classify and extract info from.
Check out TinyLlama; https://github.com/jzhang38/TinyLlama
Four research students from the Singapore University of Technology and Design are pretraining a 1.1B Llama model on 3 trillion tokens using a handful of A100's.
They're also providing the source code, training data, and fine-tuned checkpoints for anyone to run.