I'm curious about how large language models will do in finance, considering the one thing LLMs do remarkably poorly is math.
I use ChatGPT to keep track of tasks and Todo lists. It works phenomenally well for me, and the natural language back-and-forth helps keep me motivated. I give it a set of tasks, with time estimates, and it organizes these tasks for me, and I tell it when I complete them, and it updates my task list.
The one funny mistake it makes is that when it groups my tasks (say I have 3 "Work" tasks and 2 "Personal" tasks") it sums up the total estimated time for each task group, but the totals are often wrong, especially when I start adding new tasks or completing tasks.
When so much of finance requires numeric accuracy, I'm curious how BloombergGPT handles numbers.
"Doing math" is not really a goal of bloombergGPT. Take a look at our applications in the paper, which include information extraction, sentiment, reasoning, knowledge. These are models of language meant for use on text documents.
There are some aspects of the datasets that require numerical reasoning (ConvFinQA), but that's not the same as doing math.
Can you share how you're using GPT for handling your lists? I'm mostly curious how you get the lists back and forth into a UI or whatever, and maybe prompts if you have anything clever.
Me: Hello ChatGPT. My name is hn_throwaway_99. I'd like for you to use your vast underlying knowledge as a sort of personal coach. I won't be asking you any sensitive questions like medical or legal advice, so I won't need you to give me disclaimers on those topics. I primarily would like to use our conversation to help me with my problems with procrastination, and specifically to help me prioritize and get some tasks done. I have a particular problem with mindlessly browsing the internet (for example, visiting reddit or Hacker News or Wikipedia) when I get stuck or bored, so I'm hoping to use you to keep me on track instead of giving in to those temptations. Does that make sense?
ChatGPT: Hello hn_throwaway_99! I'm happy to help you as a personal coach to address your problems with procrastination and to help you prioritize tasks. It makes perfect sense to use our conversation as a way to keep you on track and avoid distractions like mindlessly browsing the internet. Feel free to share your current tasks or goals, and I can provide suggestions, encouragement, or strategies to help you stay focused and productive.
Me: Great, thanks very much. I'm going to group my tasks by topic. Note these tasks aren't listed in priority order. For each task I'll give an estimate for how long I think it will take.
Work tasks:
1. Finish Jira ticket foo. Estimate: 2 hours.
2. Write job req for bar. Estimate: 1 hour.
Personal tasks:
...
Home tasks:
...
---
From there I just tell it when I complete tasks or add new tasks, and I ask it "Can you please output my outstanding task list?" - I only had to ask that once, after that it always output my correct task list summary after I told it I added or completed a task. It groups my tasks by the categories I gave it (Work, Personal and Home). I was using GPT-4. I also like how it uses encouraging words and advice as I go through my items.
You could instead ask it to "write out" the mathematical expression it wants to calculate in valid python instead of trying to hallucinate an answer directly. This could then be evaluated by a python REPL and passed back to BloombergGPT.
GPT-4 doesn't do remarkably poorly at math. It's well above average actually. What it struggles at currently (and by struggle i mean rather bang average) is calculus.
I was using GPT-4 with a prompt of a tabletop game's rulebook pages that were relevant to building an army. It got the points math so wrong it build me a 1,000 point army, then when asked to fix it I got a ~750pt army. Then when correcting it again I got it to ~515 points and manually fixed it the rest of the way.
I think this needs to be heavily qualified. GPT-4 does well at mathematics taught to children, but so does Wolfram Alpha with some prodding. Calculus is also taught to bright high schoolers. What can GPT-4 do with group theory, topology, universal algebra, analysis?
The example I gave above, where it did remarkably poor math (e.g. one task that took 1 hour and another that took 2 hours and the total was 6 hours) was with GPT-4.
I wonder if we should go back to ontologies now. It should be fairly easy to have a LLM generate entries for any ontology. Then we need to check them for truth, obviously, but that can be parralized and potentially automated.
Say, I'd like an ontology of the current stock market focusing on the relationship between natural persons and public companies, board members, well-known analysts, investors and so on. This would be tedious for anyone to do, but should be fairly simple with a LLM.
Another task, maybe a little bit further into the future is categorizing open source intelligence. Think of oryxspioenkop.com and their famous lists of lost equipment in the Russian invasion of Ukraine. It's tedious and time-consuming but generates a valuable dataset. Here, image recognition would be necessary, but the principle is still the same, no?
Come to think of it, how does a company like TomTom generate map data nowadays?
The arxiv paper is written for researchers who are building these models. We benefited tremendously from reading papers on GPT-3, PaLM, Chinchilla, Galactica, Gopher, Bloom, OPT, most of which are closed models. We are contributing back to that community, and the collective experience of people who are training these models. We learned a lot that will help others who are making their own decisions about how to train models.
If you are looking for an API to use, this isn't for you.
Science means objective reproduction, to give back in science will mean giving the full model that others can explore and test and improve on. Giving a bunch of words doesn't help the science move forward.
A "bunch of words" has been the standard for the dissemination of scientific results for decades. Worse, those results are often locked behind a paywall...
Publishing models is definitely better than only a paper, but I think you're being unnecessarily harsh in that case.
We would all prefer it if OpenAI released a proper paper about GPT4, for instance, even if they did not release the complete model alongside it.
Being harsh is not my intent, definitely not at a personal level. OpenAI is definitely not a role model, but at best they release API that can be tested and explored and bench marked. At least, it moves the needle. In this case, there is no way to test let alone reproduce. I think people need to expect more from scientists and researchers and call them out when they fall short
Wait... judging by their own press release, this thing looks terrible. They didn't benchmark it on finance-specific tasks against GPT-4, or even GPT-3.5, just a bunch of more-or-less random also-ran models. It beat those models, but that's not saying much at all, and the fact that they didn't benchmark it against any OpenAI model in finance-specific tasks speaks volumes.
(They did benchmark it against GPT-3 in general-purpose tasks and, unsurprisingly, GPT-3 came out on top.)
We didn't compare to GPT-4, or any instruction tuned model. We're comparing a causal LM pre-trained only (BloombergGPT) to other causal LM pre-trained only models. You need to compare like to like.
Would an instruction tuned model work better? Sure! I would hope so. That seems like a good next step to explore.
It's also difficult to make scientific comparisons to models that we know nothing about. I don't know what data GPT-4 or GPT-3-5-turbo used for training, how many parameters they have, etc. We can't make any scientific conclusions when comparing two systems if we don't know anything about one of the systems.
We compared to GPT-3 results when they were available. Does GPT-3 do better on general purpose datasets? Yes, as we expected it would. It's a much larger model (~3.5 times the size). No surprise there. If anything, the surprise was how well bloombergGPT did against these larger models, beating them on some datasets.
Reading the paper it's apparent what a huge contribution HuggingFace made by doing the BLOOM work in the open. There's so much knowledge there that anyone else trying to train a LLM can use.
Also it looks like public filings are a large dataset that isn't currently being used by other LLMs.
1). Finance is high dynamic. BloombergGPT retrains LLM using a mixed dataset of finance and general sources is too much expensive (1.3M hours). Lightweight adaptation is highly favorable.
2). Internet-scale finance data (timely updates using an automatic data curation pipeline) is critical. BloombergGPT has privileged data access and API access. A promising alternative is "democratizing Internet-scale finance data".
3). Another key technology is "RLHF (Reinforcement learning from human feedback)", which is missing in BloombergGPT. RLHF enables learning individual preferences (risk-aversion level, investing habits, personalized robo-advisor, etc.)
This must be the 'big data' I heard so much about.
LLMs in general, I mean. They seem to be the first widespread application for large, unstructured datasets. Still hype-y, but maybe even a /practical/ application.
i like how they used a little more than 50% of domain specific text and a little less of general text. it beats the other LLMs on financial tasks for 4 out of 5 tasks usually by a large margin and but another one squeaks past it on NER.
They'll release training logs but no model file or data? How does anyone trust that this isn't just overfitted garbage trained directly on their benchmarks?
So the question is, do we in future even need specialized models or will GPT4/5 outperform anything else? If general models can perform well in niche domains, it'd massively speed up adoption in companies.
I use ChatGPT to keep track of tasks and Todo lists. It works phenomenally well for me, and the natural language back-and-forth helps keep me motivated. I give it a set of tasks, with time estimates, and it organizes these tasks for me, and I tell it when I complete them, and it updates my task list.
The one funny mistake it makes is that when it groups my tasks (say I have 3 "Work" tasks and 2 "Personal" tasks") it sums up the total estimated time for each task group, but the totals are often wrong, especially when I start adding new tasks or completing tasks.
When so much of finance requires numeric accuracy, I'm curious how BloombergGPT handles numbers.