More

hyperopt · 2025-03-13T19:51:21 1741895481

Charles Petzold wrote one of my favorite books - "Code: The Hidden Language of Computer Hardware and Software". Very excited to see how this turns out and thanks for giving some of this knowledge away for free!

tocs3 · 2025-03-13T20:27:35 1741897655

I would also recommend the "NAND to Tetris" book. Covers much the same ground (as I remember things anyway) but is a hands on approach. I enjoyed Code also though and is worth a look for those interested.

hyperopt · on March 31, 2023

The demo for this is great. It's the best non-corporation assistant I've used so far. I suspect most of the gains here relative to the Alpaca model might have to do with the fact that the ShareGPT data are full conversations. They allow for the assistant to respond to the earlier messages in a cohesive way. As opposed to Alpaca, the data was a single question and answer, so the model seems to lose context of earlier information. Also, the coding abilities of Vicuna are significantly improved relative to Alpaca, to the point that I began to suspect they might be calling out to OpenAI in the backed. Please, release model weights and finetuning training data.

hyperopt · on March 29, 2023

Not even close.

hyperopt · on March 28, 2023

Does anyone know of any good test suites we can use to benchmark these local models? It would be really interesting to compare all the ones capable of running on consumer hardware so that users can easily choose the best ones to use. Currently, I'm a bit unsure how this compares to the Alpaca model released a few weeks ago.

paxys · on March 29, 2023

The measure of a "good" model is still very subjective. OpenAI has used stuff like standardized test scores to compare the latest iterations of GPT, but that is only one of the many possible objective measures and might not be relevant in a lot of cases. Maybe we'll come to a consensus around such a methodology soon, or maybe it'll be something every user has to judge on their own depending on their goals.

pmoriarty · on March 29, 2023

It's even harder to measure when what you value are subjective things like creativity.

aiappreciator · on March 29, 2023

The simplest and quickest benchmark is to do a rap battle between GPT-4 and the local models. Copy paste the responses between them to enable the cross-model battle.

It is instantly clear how strong the model is relative to GPT-4.

Filligree · on March 29, 2023

Have you tried it? How did it do?

pcthrowaway · on March 29, 2023

Someone here did Bard v GPT-4 a few days ago, and GPT-4 mopped the floor with Bard: https://news.ycombinator.com/item?id=35252612

IIAOPSW · on March 29, 2023

You're talking to the model right now.

sdrinf · on March 29, 2023

Test suites are not reflection complete! https://sdrinf.com/reflection-completeness -essentially, the moment a set of testing data gets significant traction, it becomes a target to optimize for.

Instead, I strongly recommend to put together a list of "control questions" of your own, that covers the general, and specific use cases you're interested in. Specifically, I'd recommend adding questions on topics you have high degree of expertise on; and topics where you can figure out what "expert" answer actually looks like; then run it against the available models by yourself.

sebzim4500 · on March 29, 2023

>Test suites are not reflection complete!

This is true of all the existing NLP benchmarks but I don't see why it should be true in general. In machine vision, for example, benchmarks like ImageNet were still useful even when people were trying to optimize directly for them. (ImageNet shows its age now but that's because it's too easy).

I hope we can come up with something similarly robust for language. It can't just be a list of 1000 questions, otherwise it will end up in the training data and everyone will overfit to it.

For example, would it be possible to generate billions of trivia questions from WikiData? Good luck overfitting on that.

astrange · on March 29, 2023

You can try holding a tournament with it vs other models if you can think of a game for them to play.

meghan_rain · on March 29, 2023

Idea: Create a set of tests, that AI experts vet, but are kept secret. New models are run against them and only the scores are published.

singularity2001 · on March 29, 2023

https://github.com/openai/evals ?

drexlspivey · on March 29, 2023

https://github.com/openai/evals

hyperopt · on Jan 14, 2023

Fantastic! I was involved in the freemyipod project back when the 4G was still a current product. In fact, the freemyipod.org wiki ran from a server I kept downstairs in my parent's home. My favorite memory was when we all built robots to brute force a return address (https://freemyipod.org/wiki/Nanotron_3000). Glad to see the project is still alive!

pfoof · on Jan 14, 2023

I totally remember that Notes exploit! And when Nano 3g showed corrupted scrollbar

hyperopt · on Jan 9, 2019

Discussion on this is already at https://news.ycombinator.com/item?id=18858724

hyperopt · on Dec 17, 2018

Nice! Extending this further to support associative and commutative operations yields a functionality similar to that provided in Mathematica. This approach was taken by the MatchPy project to hopefully integrate into SymPy so that they can use the Rubi integration ruleset.

hyperopt · on Dec 7, 2017

In CA, Quest Diagnostics requires a doctors note for the tests I've taken. Other providers or other tests may be different.

hyperopt · on May 19, 2016

That's what I'm thinking. I was anticipating the release of GPU instances, but now I'm thinking that they will simply leapfrog over GPU instances straight to this.

hyperopt · on May 19, 2016

In March at the GCP NEXT keynote [1], Jeff Dean demos Cloud ML on the GCP. He casually mentions passing in the argument "replicas=20" to get "20 way parallelism in optimizing this particular model". GCE does not currently offer GPU instances. I've never heard the term replicas in the GPU ML discourse. These devices may enable a type of parallelism that we have not seen before. Furthermore, his experiments are apparently using the Criteo dataset, which is a 10GB dataset. Now, I haven't looked into the complexity of the model or to what extent they train it to, but right now that sounds really impressive to me.

1: https://youtu.be/HgWHeT_OwHc?t=2h13m6s