More

mnkv · 2026-05-06T02:53:11 1778035991

you mean like the thinkpad trackpoint?

tempest_ · 2026-05-06T03:18:10 1778037490

I assume they mean an actual roller ball (like https://en.wikipedia.org/wiki/Compaq_Contura )

It is still a crazy question though because if you seen most laptops in the last 15 years there is basically no room for them except on the large workstation thinkpads or large gaming laptops.

blacksmith_tb · 2026-05-06T03:13:28 1778037208

Not the OP, but some older laptop designs had a small trackball where modern machines have a touchpad, e.g. the early PowerBooks[1]

1: https://en.wikipedia.org/wiki/PowerBook_180

mnkv · 2026-05-04T20:31:45 1777926705

Just talk to them as if they were already your friend. Most of what you talk about with friends isn't just mutual interests and you start conversations with them all the time.

mnkv · 2026-03-10T20:30:56 1773174656

This blog post describes the basic work of a research engineer and nothing more. The amount of surprise the author has seems to suggest they haven't really worked in ML for very long.

Honestly? This is the best its ever been. Getting stuff to run before huggingface and uv and docker containers with cuda was way worse. Even with full open-source, go try to run a 3+ years old model and codebase. The field just moves very fast.

mnkv · 2026-02-13T18:35:03 1771007703

The paper you're talking about is "Deal or No Deal? End-to-End Learning for Negotiation Dialogues" and it was just AIs drifting away from English. The crazy news article was from Forbes with the title "AI invents its own language so Facebook had to shut it down!" before they changed it after backlash.

Not related to alignment though

https://www.forbes.com/sites/tonybradley/2017/07/31/facebook...

frenchtoast8 · 2026-02-13T21:44:26 1771019066

Friendly reminder that articles like this are not written by Forbes staff but are published directly by the author with little to no oversight by Forbes. Basically a blog running on the forbes.com domain. I'm sure there are many great contributors to Forbes, just saying that by lacking editorial oversight then by definition the domain it was published on is meaningless. I see people all the time saying something like, "It was on Forbes it must be true!" They wouldn't be saying that if it was published to Substack or Wordpress.com.

Expert difficulty is also recognizing that articles from "serious" publications like The New York Times can also be misleading or outright incorrect, sometimes obviously so like with some Bloomberg content the last few years.

kevin_thibedeau · 2026-02-13T22:16:04 1771020964

Forbes is basically a chumbox aggregator now. I'd lend more credence to any Substack.

conorcleary · 2026-02-14T11:26:45 1771068405

okay so those are two very wide blanket statements, we'll all give you the op to turn back on this.

mnkv · 2026-01-22T19:29:07 1769110147

My ideal is copilot that would evaluate the PR against some basic guidelines that maintainers write down.

And perhaps a way to filter PRs to just contributor PRs would be easy to implement and pretty useful

moraesc · 2026-01-27T02:18:44 1769480324

Another GitHub PM here. Thanks for the feedback! We're currently working on adding a way restrict PR creation to collaborations only. We've also heard some feedback around evaluating PRs against contributing guidelines which would allow maintainers to clearly define criteria that PRs must meet, so we're exploring that option as well.

mnkv · 2025-12-31T03:57:03 1767153423

Nice work. A while back, I learned convolutions using similar animations by Vincent Dumoulin and Francesco Visin's gifs

https://github.com/vdumoulin/conv_arithmetic

_giorgio_ · 2026-01-04T16:14:36 1767543276

Very good arxiv paper, I wish there where some updates on that.

mnkv · 2025-09-30T18:00:37 1759255237

> the generation of 281,128 augmented examples, from which 1,000 were held out as a benchmark test set.

This model is trained on a custom dataset of 280k examples then tested on 1k very similar examples from the same dataset. Of course it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.

This is a reasonable hobby project and interesting approach to synthetic data generation but not impressive research.

At minimum you should test your model on other benchmarks that have similar tasks e.g. docbench

gundmc · 2025-09-30T18:42:24 1759257744

It's not novel research, but I think it drives home the point that many narrow applications of AI do not require the largest, latest (and most expensive) models. And in many of those cases, a small fine-tuned model is the most performant and cost-effective.

It is probably obvious to most who follow the space closely, but you'd be surprised how many engineers don't recognize this.

Garlef · 2025-09-30T18:50:03 1759258203

It's a matter of ROI: When is it worth it to build something specialized?

sigbottle · 2025-09-30T19:00:28 1759258828

Well, one day it might be at the level of shell scripting. I don't think about "the tradeoffs of building a specialized shell script", I just do it because it's cheap and easy and solves a problem right then and there.

I don't know how you would even begin to make this kind of same observation for ML models, but seems possible. The 2010s weren't exactly building out "trivial" models, but compared to the architectures and optimizations out now, yeah those models are toy by comparison.

ImJasonH · 2025-09-30T18:51:16 1759258276

Is anybody working on making building specialized things easier and cheaper?

-_- · 2025-09-30T19:27:28 1759260448

Yes! At https://RunRL.com we offer hosted RL fine-tuning, so all you need to provide is a dataset and reward function or environment.

selim-now · 2025-10-01T07:15:47 1759302947

yes! check out https://distillabs.ai/ – follows a similar approach except the evaluation set is held out before the synthetic data generation, which I would argue makes it more robust (I'm affiliated)

bangaladore · 2025-09-30T18:10:01 1759255801

> Of course, it is specialized to outperform general models on this specific task in this specific domain with this specific json format for output.

My understanding is generally this is not considered an obvious result. In that high parameter generalist models largely outperform lower parameter specialists.

The real issue is they tested on data in their training set. *

* Incorrect-- Edit misread parent comment.

littlestymaar · 2025-09-30T18:43:08 1759257788

> The real issue is they tested on data in their training set.

Hm, no.

They trained on a part of their synthetic set and tested on another part of the set. Or at least that's what they said they did:

> from which 1,000 were held out as a benchmark test set.

Emphasis mine.

_carltg · 2025-09-30T21:51:18 1759269078

Yes, but due to it being derived from the same underlying source dataset, it is effectively evaluating on the training dataset, not an independent validation/ test dataset.

The difference is subtle but important. If we expect the model to truly outperform a general model, it should generalize to a completely independent set.

bangaladore · 2025-09-30T19:24:57 1759260297

Thanks, rereading it makes it clear that you are correct.

disiplus · 2025-09-30T18:18:42 1759256322

They did not test on the data that they tested, that's not what he wrote.

DetroitThrow · 2025-09-30T18:22:15 1759256535

They synthetically generated 290k examples and kept 10k of them for testing.

It's worth pointing out that that's technically not testing on the training set, but looking at how similar examples are in the dataset, it's clear that severe overfitting would be unavoidable. That also makes the headline very misleading.

The weights may not be published since using it for document extraction on even the same format but with slightly different content or lengths would show how abysmal this finetune does outside of the synthetic data.

bangaladore · 2025-09-30T19:25:00 1759260300

Thanks, rereading it makes it clear that you are correct.

kingjimmy · 2025-09-30T18:15:15 1759256115

in todays news, overfit models are overfit.

m3kw9 · 2025-09-30T18:06:07 1759255567

So they tested using training examples? Lmao

fxwin · 2025-09-30T18:12:46 1759255966

> held out

Aperocky · 2025-09-30T18:33:03 1759257183

Actually in this case that's not exactly true:

> generation of 281,128 augmented examples

All example are already correlated because they are generated in the same way.

littlestymaar · 2025-09-30T18:47:03 1759258023

> All example are already correlated because they are generated in the same way.

All examples of “document information extraction” would be correlated no matter where they come from because they all would be “document information extraction” examples…

The real question is whether or not the examples are representative of the broad “document information extraction” use-case.

_carltg · 2025-09-30T21:53:50 1759269230

The problem is the methodology they use to hold them out. For a truly independent validation set, they need to hold out the material before augmentation, not after. If you hold out after augmentation, then you leverage biases from the training regimen already and hence you artificially boost your model's performance. This is not sufficient to demonstrate your model is generalizing properly.

In analogy: instead of taking leaves off of different trees, they are taking leaves from different branches from the same tree.

selim-now · 2025-10-01T07:20:25 1759303225

That would definitely make the evaluation more robust. My fear is that with LLMs at hand people became allergic to preparing good human-labelled evaluation sets and would always to some degree use an LLM as a crutch.

fxwin · 2025-10-02T21:26:03 1759440363

I would agree with that

mnkv · 2025-06-28T01:54:15 1751075655

reasonable post with a decent analogy explaining on-policy learning, only major thing I take issue with is

> Reinforcement learning is a technical subject—there are whole textbooks written about it.

and then linking to the still wip RLHF book instead of the book on RL: Sutton & Barto.

dawnofdusk · 2025-06-28T05:50:55 1751089855

Haha that's crazy I'm so used to reading RL papers that when the blog linked to a textbook about RL I just filled in Sutton & Barto without clicking on the link or thinking any further about the matter.

I think the other criticism I have is that the historical importance of RLHF to ChatGPT is sort of sidelined, and the author at the beginning pinpoints something like the rise of agents as the beginning of the influence of RL in language modelling. In fact, the first LLM that attained widespread success was ChatGPT, and the secret sauce was RLHF... no need to start the story so late in 2023-2024.

mnkv · on March 18, 2025

I think it's pretty obvious it's 1. Given the recent huge, clearly politically-motivated cuts from the current administration, it feels pretty likely that FOIA could be disrupted under the guise of "cost-saving".

And I think you're supposed to be generous to the commenter, not the current administration ;)

mnkv · on Sept 25, 2024

how does this compare to zotero?

trueismywork · on Sept 25, 2024

Doesn't. I used jabref for a long time, zotero is better. Zotero has integration with browsers and sync which is its biggest advantage

decafb · on Sept 26, 2024

I love that Jabref supports working with multiple libraries (having multiple open the same time, moving entries between). Best Zotero could do was restart with difference preference files (has that changed? haven't used it in some time).

And really like that Jabref syncing requires just syncing the library folder. Zotero syncing really nudges you to the paid plan. setting up webdav just isn't as simple and the list of supported providers isn't that long.

It really helped me that the backend is a plain bibtex file. I could resolve issues with it myself. I can also version libraries with git.