I use scientific language models professionally. I skimmed the paper and was immediately disappointed.
- They benchmarked against general models like GPT-3 but not well-established specific models that have been trained for specific tasks like SPECTER[0] or SciBert[1]. Specter outperformed GPT-3 on tasks like citation prediction two years ago. Nobody seriously uses general LLMs on science tasks, so nobody who actually wants to use this cares about your benchmarks. I want to see task-specific models compared to your general model, otherwise whats going to happen is I either need to run my own benchmarks or, much more likely, I shelve your paper and never read it again. If you underperform some that's fine! If you don't compare to science-specific models all you're claiming is that training on science data gives better science results... thats not exactly an impressive finding. Fine-tuning is a separate thing, I get it, but pleeeeeease just give the people what they want.
- Not released on huggingface. No clue why not. On the back-end this appears to be based on OPT and huggingface compatible, so I'm really confused.
- Flashy website. Combine 1&2 with a well designed website talking about how great you are and most of my warning lights got set off. Not a fan.
@authors, if you're lurking, please release more relevant benchmarks for citation prediction etc. Thanks.
My big disappointment is, as always with models released by Facebook, is that they're all under a non-commercial license, which means they're effectively useless for anything.
They have something like this on the website:
> We believe models want to be free and so we open source the model for those who want to extend it.
IANAL, but it would seem to me this license covers the model itself and not output of the model.
This is a copyright license for the model, so I think that should just mean you can't sell the model or a derivative of the model.
I guess when it's released you have to fill out some form or click some box to accept some license agreement, that in practice is a contract saying you won't use it for commercial purposes, but if you were to just download it from somewhere your only restrictions would be on redistributing it, but not on its use.
Open source is not a just a term with margin for interpretation: to be open source, you must comply with the 10 rules defined by the open source initiative. Restricting commercial usage goes against rule 6.
You can call it readable source or whatever, but it's not open source as defined by OSI.
"6. No Discrimination Against Fields of Endeavor
The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research."
> to be open source, you must comply with the 10 rules
The open source initiative didn't not invent this expression. They worked hard to promote their idea of it, and its application. They did a lot of good, but aren't an autorative source when it comes to its definition.
The reality is that the vast majority of software developers do not consider a strict conformance to the 10 OSI criteria as being necessary to apply the term "open source".
Maybe they're all just wrong, but it's worth considering why.
> the vast majority of software developers do not consider a strict conformance to the 10 OSI criteria as being necessary to apply the term "open source"
[citation needed]
My counter claim, without citation, is that I actually believe (from experience) that the vast majority of 'open source' projects are in fact released under licenses that already comply with the 10 OSI criteria, and are therefore 'approved' OSI licenses. This is easily witnessed by looking at the licenses of the majority of open source projects — or perhaps even just the most popular ones.
That would seem to go against your claim regarding 'most developers'.
But it's not actually a debate about 'most developers', it's about the OSS projects out there, not individual devs, no?
Can I ask how you use scientific language models professionally? Or do you have any articles/reviews on how they are being used, and how people see their potential and shortcomings?
Not going to get into details on my own work here, but I'll comment generally on use-cases.
I think a good way to think about scientific language models is that they're useful in exactly the same ways general language models are, but in a very narrow domain (stuff having to do with scientific papers & patents, for the most part).
Use-cases that are possible/useful today:
- Annotation of scientific texts: is this paper about computer science?
- Scientific search: please give me researchers or papers most similar to an input query.
- Helping PhD Students graduate (only kind of kidding)
Use-cases I think will be possible/useful in the forseeable future:
- Scientific question answering: e.g. ask the model to explain a chemical process
- Scientific advice or guidance: e.g. ask what method might be appropriate in a situation.
- Text completion/editing/etc: e.g. help me write my paper. You could probably do more of this today if more $ was invested in science models, we're likely ~5 years behind whatever is going on in the "normal" language space.
As far as potential / shortcomings I'm really pessimistic. I don't think large language models for science are very useful outside of bespoke projects or ever will be for people doing serious science. The main issue is that these models are way too general - if you have a specific science problem you want to solve, its almost always going to be better to train a model to specifically address that problem. You would never, for example, ask a model like Galactica to do what AlphaFold does. Eventually you might be able to, but its never going to outperform a specific model, so if you're a researcher trying to get the best results why would you use it?
I should also add, scientists really care about precision. When summarizing a news story exact words might not be that big a deal, but if you're trying to summarize a scientific paper getting a word wrong can REALLY matter. The bar these models need to clear before scientists trust them with tasks where precision matters is likely much, much higher than in other domains.
I think the most likely outcome is that ~75% of LLM use for scientific text outside of academic research papers will be for search related products. That's definitely a place where they can make a big difference: help people find and understand cool papers that are relevant to their research.
My big disappointment is that the model does not provide sources and recommended reading. Which is something we can now do and would increase the usefulness of the model significantly.
I tried it on two topics I am a domain expert in both in the suggested „lecture notes on …“. It produced rethorically nice sounding sentences with little actual content, that quickly desolved into non-sense. I guess to an outside observer that might appear similar to what happens in academia often :):
There is no doubt in my mind that Galactica fine-tuned on these specific datasets will outperform all these previous models. But yeah, someone should definitely do that and perform the benchmarks.
I’ve been vaguely following all the AI news on text to image and text that comes out from promos. But I have no idea how a benchmark for text would work. Is benchmarking subjective? Is it based on accuracy of information? How do you actually measure a benchmark for something like this?
Different benchmarks are performed for different tasks. As there are a lot of things you can use language models for, there are a lot of benchmarks.
With respect to subjectivity it really depends on the task - some tasks are quite amenable to objective classification. One common task for science language models is citation prediction: do these two papers share a citation link? Obviously that's a really simple accuracy metric to report.
Often things are not so simple. An example might be keyphrase extraction - standard practice there is to have grad students sit down with a highlighter and use the terms multiple students agree on (simplification, but not by much). From there it just gets messier. Are you reporting accuracy of all keywords identified or all sentences correctly processed? What about sentences with multiple keywords? What about sentences with no keywords? Very messy, appropriate metrics can be a real topic of debate.
- They benchmarked against general models like GPT-3 but not well-established specific models that have been trained for specific tasks like SPECTER[0] or SciBert[1]. Specter outperformed GPT-3 on tasks like citation prediction two years ago. Nobody seriously uses general LLMs on science tasks, so nobody who actually wants to use this cares about your benchmarks. I want to see task-specific models compared to your general model, otherwise whats going to happen is I either need to run my own benchmarks or, much more likely, I shelve your paper and never read it again. If you underperform some that's fine! If you don't compare to science-specific models all you're claiming is that training on science data gives better science results... thats not exactly an impressive finding. Fine-tuning is a separate thing, I get it, but pleeeeeease just give the people what they want.
- Not released on huggingface. No clue why not. On the back-end this appears to be based on OPT and huggingface compatible, so I'm really confused.
- Flashy website. Combine 1&2 with a well designed website talking about how great you are and most of my warning lights got set off. Not a fan.
@authors, if you're lurking, please release more relevant benchmarks for citation prediction etc. Thanks.
[0] - https://arxiv.org/abs/2004.07180 [1] - https://arxiv.org/abs/1903.10676