Hacker News new | past | comments | ask | show | jobs | submit | johndough's comments login

You are correct. Dicts are ordered by insertion. Also, I'd like to add that, maybe surprisingly, sets are not.

    >>> set([3, 2, 1])
    {1, 2, 3}

    >>> set([10, 100, 1000])
    {1000, 10, 100}

I did a quick comparison on MNIST with a small ConvNet, comparing this AdamWSCheduleFree optimizer against a few other optimizers (RAdam, NAdam, AdamW, SGD, Adam, Adafactor, SophiaG). The validation accuracy seems to be okay and the train loss decreases remarkably quickly.

Validation accuracy: https://i.imgur.com/8ZtX7Rd.png

Train loss: https://i.imgur.com/o5XdQ29.png

Code: https://bpa.st/NVJQ (currently only runs on my computer, but not enough time to clean it up)

Note that this is just a toy benchmark with very little hyperparameter tuning. You could probably get similar results with most optimizers and an appropriate schedule. Nevertheless, I appreciate every hyperparameter that I do not have to set manually.

In summary, this seems to be a promising optimizer. I'll add it to my list of optimizers to try for new deep learning projects.

<"I'll add it to my list of optimizers to try for new deep learning projects. "

Can you share the list of your go to optimizers outside of the Adam family?

I think there's Adam and "Nothing is obviously substantially better than Adam so why bother?"

I've had a lot of luck with CAME https://arxiv.org/abs/2307.02047

> Can you share the list of your go to optimizers outside of the Adam family?

Sure! It depends a bit on what I'm doing.

If I want to optimize someone else's model, I start with Adam, because that's most-likely what the hyperparameters have been optimized for. Once I've verified that Adam works, I'll try other optimizers.

If I have very few parameters and don't care about overfitting, I try LBFGS, which usually gets to the local optimum the fastest. Note that this will likely find a sharp local optimum. For better generalization performance, you often prefer a wide optimum, so the model still works if there is a bit of drift in the data.

If I do not want to mess around with learning rates, I use Adafactor, which is a bit slower, but usually works okay without any tuning.

If I had very little memory available, I'd use SGD, but in my opinion it's not worth the hassle of tuning learning rate, momentum, dampening and weight decay. I'd rather use a smaller model if possible.

I usually do not train with extremely large batch sizes, but if I did, I'd try the optimizers which claim to work well for large batch sizes.

All in all, it probably does not matter too much which optimizer you are using, as long as you tuned it a little bit. Same goes for the model, loss functions, activation functions and all that other fluff.

What /is/ important is that you design your problem in such a way that it is as easy as possible to solve. For example, it is very difficult to read arbitrary hand-written text from an image. If you have control over where the data comes from, it would be better to write the text character by character into a printed grid with additional optical markers for image registration. Or even better, replace it with a multiple choice list. If there are not too many exceptional cases, an "other" option for manual review could be added. Often, automating 99 % of the work is more than good enough and it is better to keep a human in the loop to handle edge cases.

Secondly, control the data capture as strictly as possible. For example, use uniform lightning, place the object to recognize at exactly the same position, exclude disruptive elements, etc.

Lastly, data is king. If your training data does not match the test data, you can train all you want and still get garbage results. Either collect enough training data to cover all test cases or, if that is not possible from the start, retrain with new data regularly. Data augmentation might help to some degree, but it is impossible to predict everything.

I was looking for Python code, but did not find anything, until I realized that it said "Python-like" instead of "Python" on the website. I'd suggest to change the title to reflect that. I did not recognize the *.fizz files as Python.

I agree, the body of the action and functions are Python, actually starlark (a subset of python). I'll update the description.

Android supports Ethernet over USB-C, but a quick Google search seems to suggest that performance is lacking. I found this odyssey of some developer trying to reach someone at Google to get it fixed, but they keep giving him the ol' runaround. Works fine on Apple devices though.

* https://support.google.com/android/thread/251842240/ethernet...

* https://support.google.com/googleplay/android-developer/thre...

* https://issuetracker.google.com/issues/319406707

I noticed that many of those websites abuse the debugger functionality, freezing the browser whenever someone tries to open the developer console. Did you find a way around that?

The post mentioned using Fiddler instead. https://docs.telerik.com/fiddler/configure-fiddler/tasks/ins...

If read-only access is all that is needed, SSLKEYLOGFILE might help. https://my.f5.com/manage/s/article/K50557518

Yes. You can disable breaking in dev mode, which gets you a bit further.

But they have some script in there as well that checks for if you're in dev mode somehow, and as well on top of that misbehaves if you're in private mode.

At that point I got tired of fighting it, and realized none of this Javascript nonsense was going to work against Fiddler or other proxy-based solution.

You don't even need to do that, just get the video to start playing and then press F12 to open the dev tools as the video will keep streaming and you'll still be able to see the network requests despite the messages.

That approach will miss the transmission of the playlist file. Chasing the individual fragments of video is not the right way to go about it.

I doubt that anyone is going to download and search through over 800 TB just to find a badly formatted copy of some book that could be found much quicker on different websites with better formatting. Authors are losing fractional cents here at most.

so just like Office Space? (paraphrasing) "We steal a fraction of a cent from each transaction, who do we hurt? Nobody. We just put the remainder into our account!"

Sorry that's not how damages are calculated in the US tort system.

I do not know how damages are calculated in the US tort system. What do they say about the books3 dataset?

I also think that the case is different here, since in your example, there is a specific amount of money being stolen, while in the books3 case, there is an unspecified amount of money not being made by the authors.

I am pretty sure if the authors were trying to license their works for this purpose we would just not use them at all; it is difficult to see under what circumstances they would stand to profit from this other than by suing people after the fact over it.

I think you could argue that authors could profit from their works being cited in an LLM response. It could drive sales of their works much like citations do on the web. The counter argument is that and LLM could give you the Clif Notes version of the work and thus taking away a portion of sales.

In a world where the options were to

1) pay the author,

2) implement guaranteed citation of the author any time the model gave an answer that was directly derivative, with an option to not do so if the summary was sufficiently vague, or

3) ignore the author's book completely as training data

we would all choose 3).

And the authors would probably be very happy that you did.

The penalty is up to $150k per violation.

For uploading, not downloading.

All this would not be necessary if Signal did not collect phone numbers at all.

The usual excuse is that they need phone numbers to combat spam, but that is only because they allow arbitrary contact requests form random people. It would be easy to imagine accounts without arbitrary contact permission. Contact requests could still be exchanged by e.g. meeting offline in person or with time-limited friend request codes.

The article included comments from Signal devs and Whittaker about this exact issue. There are valid reasons that Signal does not want to stop using phone numbers.

> “You reach a threshold where you’re actually reducing privacy,” Whittaker said. She gave an example of a person who faces severe threats and normally maintains vigilance but whose mother is only on WhatsApp because she can’t figure out the numberless Signal. The high-threat person would be stuck using the less secure option more often.

How does that make sense? Signal just made phone numbers for contact discovery optional, in which case this person still couldn't find their mother, even though Signal has their number on file.

What people are asking for is for phone numbers to be optional for account creation and identification. Everybody that wants to could still provide (and verify) their phone number for contact discovery, and this could even remain the default for non-sophisticated users as the one described above.

So that seems more like a retroactive justification for an early design choice (Signal was originally TextSecure and used SMS as a transport layer, so making numbers the primary key made total sense back then). The only thing that still makes sense to me today is spam prevention:

> Requiring phone numbers also makes it considerably harder for spammers to abuse Signal. “The existence of a handful of small apps that don’t really have a large scale of users, that don’t require phone numbers, I don’t think is proof that it’s actually workable for a large-scale app,” Whittaker said.

One possible solution could be to tie numberless account creation to a nominal donation payment: Still not great, but spam prevention is unfortunately not free to Signal either.

It's probably also related to them not wanting to make Signal selfhostable. The server (and client) code is open source[0] but is reliant on an external SMS service to be self-deployed (as well as AWS for file storage, GCM and APN for push notifs but those aren't nearly as much of a barrier; AWS has numerous FOSS reimplementations while for GCM you can use ntfy), something which Signal devs have stated they don't want to happen (since providing libsignal implementations is, as far as I understand, part of how Signal makes money).

SMS is afaik the only real barrier they have to preventing that.

[0]: As in, AGPL and effectively a source dump with no instructions.

Do fused multiply-add operations for matrices really matter? There are two relevant cases I could come up with.

1. You have large matrices. In this case, I'd think that the O(n^2) addition can be ignored, because it gets dwarfed by the the O(n^3ish) multiplication.

2. You have small matrices. In this case, the computation is likely bound by memory bandwidth. Waiting for the matrices A and B takes most of the time. Then you multiply them, which should go quickly, since the matrices are small. Then you do the addition with the matrix C, but since the product AB is already in cache and since you'd have to wait for the matrix C anyway, there is not much to be gained with a fused multiply-add.


> Megatron is a large, powerful transformer [...]

You can ask your website: "What is the computational complexity of self-attention with respect to input sequence length?"

It'll answer something along the lines of self-attention being O(n^2) (where n is the sequence length) because you have to compute an attention matrix of size n^2.

There are other attention mechanisms with better computational complexity, but they usually result in worse large language models. To answer jart: We'll have to wait until someone finds a good linear attention mechanism and then wait some more until someone trains a huge model with it (not Groq, they only do inference).

Changing the way transformer models works is orthogonal to gaining good performance on Mistral. Groq did great work reducing the latency considerably of generating tokens during inference. But I wouldn't be surprised if they etched the A matrix weights in some kind of fast ROM, used expensive SRAM for the the skinny B matrix, and sent everything else that didn't fit to good old fashioned hardware. That's great for generating text, but prompt processing is where the power is in AI. In order to process prompts fast, you need to multiply weights against 2-dimensional matrices. There is significant inequality in software implementations alone in terms of how quickly they're able to do this, irrespective of hardware. That's why things like BLAS libraries exist. So it'd be super interesting to hear about how a company like Groq that leverages both software and hardware specifically for inference is focusing on tackling its most important aspect.

One GrogCard has 230 MB SRAM, which is enough for every single weight matrix of Mixtral-8x7B. Code to check:

    import urllib.request, json, math

    for i in range(1, 20):
        url = f"https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/resolve/main/model-{i:05d}-of-00019.safetensors?download=true"

        with urllib.request.urlopen(url) as r:
            header_size = int.from_bytes(r.read(8), byteorder="little")
            header = json.loads(r.read(header_size).decode("utf-8"))
            for name, value in header.items():
                if name.endswith(".weight"):
                    shape = value["shape"]
                    mb = math.prod(shape) * 2e-6
                    print(mb, "MB for", shape, name)
tome's other comment mentions that they use 568 GroqChips in total, which should be enough to fit even Llama2-70B completely in SRAM. I did not do any math for the KV cache, but it probably fits in there as well. Their hardware can do matrix-matrix multiplications, so there should not be any issues with BLAS. I don't see why they'd need other hardware.

OK, thanks, that's useful to know. Personally I'm not involved directly in implementing the model, so I don't know what we do there.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact