NLP Research in the Era of LLMs

sunpazed · on Dec 22, 2023

> While LLM projects typically require an exorbitant amount of resources, it is important to remind ourselves that research does not need to assemble full-fledged massively expensive systems in order to have impact.

Check out TinyLlama; https://github.com/jzhang38/TinyLlama

Four research students from the Singapore University of Technology and Design are pretraining a 1.1B Llama model on 3 trillion tokens using a handful of A100's.

They're also providing the source code, training data, and fine-tuned checkpoints for anyone to run.

YetAnotherNick · on Dec 22, 2023

Even if they ran it without facing any issues and 0 testing, it would have taken 35k A100 hours or $70k-100k. It is not cheap to do it.

sunpazed · on Dec 22, 2023

I’d agree — but would argue affordable for a sponsored dissertation program with 3 research students and an associate professor. They’re actually still training it!

YetAnotherNick · on Dec 22, 2023

For one run, yes. But if they are testing new architecture or something like that, they need at least dozens of them. If they are not testing new architecture, finetuning is almost always the way to go.

JacobiX · on Dec 22, 2023

I'm still using old-school maximum-entropy models and HMMs for NLP and integrating them into our products. The model size is less than 1MB, and the inference time is really negligible. These models are used in batch processing for billions of entries and with no access to internet. Not sure if LLMs can fit this task?

janalsncm · on Dec 22, 2023

If inference time is negligible, you might find that a larger model with slightly better metrics is affordable. Not that an LLM is the way to go, but perhaps if you’re already in for the squeeze maybe there’s more juice to be had.

lmeyerov · on Dec 22, 2023

I like this article

A tougher question for me here is the why, vs how: 'Why research LLMs in academia if you can work in industry on the same problems with more resources?'. Systems researchers face something similar, where big systems are easier to research at Google etc. and get paid much more to do so.

The bit on Reasoning seems relevant: many industry roles for that will be taken by people ahead of the pre-graduation student, and it's research that fewer industry labs will be doing in an extreme way than say engineering more efficient inferencing. Doing that kind of research today is a lot more exciting than when we were doing it 10 years ago, so it's a special time.

owlninja · on Dec 22, 2023

I'm still in the stone ages trying to use topic modeling on a free text field. Can LLMs help me decipher a large dataset of human text and put it in buckets?

jcims · on Dec 22, 2023

Absolutely. Give it a try. I pasted these comments into ChatGPT and asked it to score them according to slashdot’s system (which i didn’t actually define) and got this reply:

In Slashdot's scoring system, comments are scored from -1 to 5, based on factors like insightfulness, informativeness, and whether they're interesting or funny. Here are the hypothetical scores for the comments from Hacker News:

owlninja (42 minutes ago): This user is asking about the applicability of LLMs for text categorization in large datasets. It's a straightforward query, indicating a need for information. Score: 3 (Interesting) - As it opens up a discussion on a relevant and technical topic.

causalmodels (18 minutes ago): This comment provides a direct solution with a resource (zod-gpt). Score: 4 (Informative) - It not only addresses the query but also provides a specific tool to help achieve the goal, adding value to the discussion.

quickthrower2 (41 minutes ago): This comment suggests an approach for creating buckets using a sample and an LLM. Score: 3 (Interesting) - It proposes a practical method, contributing constructively to the original query.

These scores are subjective and would depend on the perspectives of the individual moderators or the community's voting on Slashdot.

causalmodels · on Dec 22, 2023

If you know your buckets you can pretty easily get most LLMs to generate structured outputs using something like zod-gpt [1].

[1] https://github.com/dzhng/zod-gpt

anon-feaata2311 · on Dec 22, 2023

Have a look at Using text embeddings and approximate nearest neighbour search to explore ...

https://medium.com/gft-engineering/using-text-embeddings-and...

You can reduce sentences to vectors and then create similarity scores to build a graph over the corpus. If you choose to create clusters then use a llm to summarise them to create labels.

anon-feaata2311 · on Dec 22, 2023

Interestingly I don't see much progress on the core of vector DBs... Does anyone have so innovative references?

soultrees · on Dec 22, 2023

I just did this today using openAI’s function calling. I have a bunch of elements in a scene and trying to classify them between various ‘buckets’ has been the challenge. The way I set it up is that in the schema expected, you take that top level free text, wrap it in quotes and it becomes the parent object with the elementCategory as a string required inside it, with a list of all category types in the description. Then loop through and create a dynamic schema based on how you chunk your data, then add a validation step at the end to ensure that chatgpt doesn’t forget, add or change keys. I found that if you create a numberedKeywordString for your chunk, and wrap each in quotes, then stuff that numberedKeywordString into the system prompt then it’s solid and will ensure the LLM catches every key and gets the associated elementCategory value.

It works quite well for dynamic classification in my purposes but I’m sure there is a better way.

bravura · on Dec 22, 2023

Try this: https://github.com/refuel-ai/autolabel

Then the main challenge just becomes prompt design, which can sometimes be nebulous for NLP annotation.

kolinko · on Dec 22, 2023

With no problem. The easiest way is to go with GPT4 api for a start (you just tell it how you want it organised / what metadata to extract) - it will be costly, but you can prove that it works, and generate examples to fine tune a simpler model to do the same.

quickthrower2 · on Dec 22, 2023

It might help you create the buckets at least if you take a sample of say 1000 and give that to the LLM.

ta988 · on Dec 22, 2023

absolutely here is an example with openai. First pass you explain the task and let it reply to a hundredish examples. Fix the errors, fine tune. And it should be much better. You can repeat that any time you need until you reach satisfactory results but they come quickly when I worked on large number of short texts i wanted to classify and extract info from.