It is still a crazy question though because if you seen most laptops in the last 15 years there is basically no room for them except on the large workstation thinkpads or large gaming laptops.
Just talk to them as if they were already your friend. Most of what you talk about with friends isn't just mutual interests and you start conversations with them all the time.
This blog post describes the basic work of a research engineer and nothing more. The amount of surprise the author has seems to suggest they haven't really worked in ML for very long.
Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.
The paper you're talking about is "Deal or No Deal? End-to-End Learning for Negotiation Dialogues" and it was just AIs drifting away from English. The crazy news article was from Forbes with the title "AI invents its own language so Facebook had to shut it down!" before they changed it after backlash.
Friendly reminder that articles like this are not written by Forbes staff but are published directly by the author with little to no oversight by Forbes. Basically a blog running on the forbes.com domain. I'm sure there are many great contributors to Forbes, just saying that by lacking editorial oversight then by definition the domain it was published on is meaningless. I see people all the time saying something like, "It was on Forbes it must be true!" They wouldn't be saying that if it was published to Substack or Wordpress.com.
Expert difficulty is also recognizing that articles from "serious" publications like The New York Times can also be misleading or outright incorrect, sometimes obviously so like with some Bloomberg content the last few years.
Another GitHub PM here. Thanks for the feedback! We're currently working on adding a way restrict PR creation to collaborations only. We've also heard some feedback around evaluating PRs against contributing guidelines which would allow maintainers to clearly define criteria that PRs must meet, so we're exploring that option as well.
> the generation of 281,128 augmented examples, from which 1,000 were
held out as a benchmark test set.
This model is trained on a custom dataset of 280k examples then tested on 1k very similar examples from the same dataset. Of course it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.
This is a reasonable hobby project and interesting approach to synthetic data generation but not impressive research.
At minimum you should test your model on other benchmarks that have similar tasks e.g. docbench
It's not novel research, but I think it drives home the point that many narrow applications of AI do not require the largest, latest (and most expensive) models. And in many of those cases, a small fine-tuned model is the most performant and cost-effective.
It is probably obvious to most who follow the space closely, but you'd be surprised how many engineers don't recognize this.
Well, one day it might be at the level of shell scripting. I don't think about "the tradeoffs of building a specialized shell script", I just do it because it's cheap and easy and solves a problem right then and there.
I don't know how you would even begin to make this kind of same observation for ML models, but seems possible. The 2010s weren't exactly building out "trivial" models, but compared to the architectures and optimizations out now, yeah those models are toy by comparison.
yes! check out https://distillabs.ai/ – follows a similar approach except the evaluation set is held out before the synthetic data generation, which I would argue makes it more robust (I'm affiliated)
> Of course, it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.
My understanding is generally this is not considered an obvious result. In that high parameter generalist models largely outperform lower parameter specialists.
The real issue is they tested on data in their training set. *
Yes, but due to it being derived from the same underlying source dataset, it is effectively evaluating on the training dataset, not an independent validation/ test dataset.
The difference is subtle but important. If we expect the model to truly outperform a general model, it should generalize to a completely independent set.
They synthetically generated 290k examples and kept 10k of them for testing.
It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.
The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.
> All example are already correlated because they are generated in the same way.
All examples of “document information extraction” would be correlated no matter where they come from because they all would be “document information extraction” examples…
The real question is whether or not the examples are representative of the broad “document information extraction” use-case.
The problem is the methodology they use to hold them out. For a truly independent validation set, they need to hold out the material before augmentation, not after.
If you hold out after augmentation, then you leverage biases from the training regimen already and hence you artificially boost your model's performance. This is not sufficient to demonstrate your model is generalizing properly.
In analogy: instead of taking leaves off of different trees, they are taking leaves from different branches from the same tree.
That would definitely make the evaluation more robust. My fear is that with LLMs at hand people became allergic to preparing good human-labelled evaluation sets and would always to some degree use an LLM as a crutch.
Haha that's crazy I'm so used to reading RL papers that when the blog linked to a textbook about RL I just filled in Sutton & Barto without clicking on the link or thinking any further about the matter.
I think the other criticism I have is that the historical importance of RLHF to ChatGPT is sort of sidelined, and the author at the beginning pinpoints something like the rise of agents as the beginning of the influence of RL in language modelling. In fact, the first LLM that attained widespread success was ChatGPT, and the secret sauce was RLHF... no need to start the story so late in 2023-2024.
I think it's pretty obvious it's 1. Given the recent huge, clearly politically-motivated cuts from the current administration, it feels pretty likely that FOIA could be disrupted under the guise of "cost-saving".
And I think you're supposed to be generous to the commenter, not the current administration ;)
I love that Jabref supports working with multiple libraries (having multiple open the same time, moving entries between). Best Zotero could do was restart with difference preference files (has that changed? haven't used it in some time).
And really like that Jabref syncing requires just syncing the library folder. Zotero syncing really nudges you to the paid plan. setting up webdav just isn't as simple and the list of supported providers isn't that long.
It really helped me that the backend is a plain bibtex file. I could resolve issues with it myself. I can also version libraries with git.