More

ubutler · 2025-01-26T00:20:01 1737850801

Kagi Professional is $10/m and comes with unlimited searches.

ubutler · 2025-01-11T09:03:34 1736586214

+1 It’s almost as indispensable as tqdm for a data scientist at least.

ubutler · 2025-01-11T09:02:19 1736586139

MessagePack can encode rows as well and then you just need to manage linking the keys during deserialization. In fact, it can encode arbitrary binary without needing base64 like JSON.

ubutler · 2025-01-11T08:58:39 1736585919

Although MessagePack is definitely not a drop-in replacement for JSON, it is certainly extremely useful.

Unlike JSON, you can’t just open a MessagePack file in Notepad or vim and have it make sense. It’s often not human readable. So using MessagePack to store config files probably isn’t a good idea if you or your users will ever need to read them for debugging purposes.

But as a format for something like IPC or high-performance, low-latency communication in general, MessagePack brings serious improvements over JSON.

I recently had to build an inference server that needed to be able to communicate with an API server with minimal latency.

I started with gRPC and protobuf since it’s what everyone recommends, yet after a lot of benchmarking, I found a way faster method to be serving MessagePack over HTTP with a Litestar Python server (it’s much faster than FastAPI), using msgspec for super fast MessagePack encoding and ormsgpack for super fast decoding.

Not sure how this beat protobuf and gRPC but it did. Perhaps the Python implementation is just slow. It was still faster than JSON over HTTP, however.

junon · 2025-01-11T19:39:13 1736624353

Makes me wish Cap'n Proto was more prevalent and developer friendly. It ticks quite a few of my boxes.

ubutler · 2024-11-30T10:52:36 1732963956

Nice. You can do the same in Python like so and surprisingly it only takes a couple seconds to a minute to get the print out, no need for importing any special libraries (just `sys` to enable printing large numbers).

```python

import sys sys.set_int_max_str_digits(0) # Allows for printing very large numbers.

x = (2 * 136_279_841) - 1

print(x)

```

nneonneo · 2024-11-30T11:44:12 1732967052

Use two leading spaces to create monospaced code on this site, like so:

  import sys
  sys.set_int_max_str_digits(0) # Allows for printing very large numbers
  x = (2 ** 136_279_841) - 1
  print(x)

Also, note that 1 << 136_279_841 is much faster than 2 ** 136_279_841; the former runs in <10ms, while the latter takes over 600ms on my machine (a 60x difference).

philshem · 2024-11-30T12:11:28 1732968688

Could explain the << calculation a bit? I read about the operator but it’s not clear how this works

schoen · 2024-11-30T12:26:30 1732969590

It means "left shift" in binary. This corresponds to adding a specified number of zeros at the end of the binary representation of the original number. So 1 << 0 is binary 1, 1 << 1 is binary 10, 1 << 2 is binary 100, 1 << 3 is binary 1000...

If you think about the meaning of place value in binary, this is exactly the same as raising two to a specified power. Each time you shift one place further left in binary, it's equivalent to multiplying the existing number by two. So repeating that a specified number of times is multiplying by a specified power of two.

nanis · 2024-11-30T12:28:20 1732969700

1 is 2 to the power 0 ... 0b0001

shifted left once, it becomes 2 to the power 1 ... 0b0010

shifted left twice, it becomes 2 to the power 2 ... 0b0100

shifted left three times, it becomes 2 to the power 3 ... 0b1000

etc until

shifted left 136_279_841 times, it becomes 2 to the power 136_279_84 ... 0b1000...many zeros...0000

subtract 1, it becomes

0b0111...many ones...1111

schoen · 2024-12-01T01:51:05 1733017865

One funny thing about Mersenne primes is that, as a result of what you describe, they are exactly those primes whose binary representation consists of a prime number of ones!

The smallest Mersenne prime, three, is binary 11, while the next largest is seven (111), then 31 (11111), then 127 (1111111). The next candidate, 2047 (11111111111), is not prime.

pansa2 · 2024-11-30T12:27:10 1732969630

Shifting an integer left `n` bits is equivalent to multiplying it by `2 ** n`

ubutler · 2024-10-12T00:33:35 1728693215

I had already started using Kagi to create bangs that run searches like “site:reddit.com %s” but glad to see this made even easier!

ubutler · 2024-09-11T14:51:08 1726066268

Simhash is an extremely fast and simple algorithm for detecting near duplicate text at scale which makes it particularly useful for deduplicating AI training datasets.

ubutler · 2024-08-26T13:52:59 1724680379

Is it “this couple reign” or “this couple reigns”? The former just feels wrong despite it being the title.

benterix · 2024-08-26T14:08:08 1724681288

Looks terrible to me but apparently both are acceptable:

https://www.merriam-webster.com/grammar/is-couple-singular-o...

> When writing of a couple getting married, it is more common to use the plural form ("the couple are to be wed"). When writing of an established couple, it is more common to use a singular verb ("the couple has six puppies, each more destructive than the next").

So according to MW we're a bit more right than NYT.

jgwil2 · 2024-08-26T14:38:03 1724683083

I'd expect the former in British English, because in that standard, grammatically singular nouns that refer to multiple people are conjugated in the plural. The latter would be typical for American English. It's a bit surprising to see the former used by an American author in an American publication, however.

kemitchell · 2024-08-26T15:53:16 1724687596

> It's a bit surprising to see the former used by an American author in an American publication, however.

Strange things happen to American writers who read lots of Russian novels translated by Brits.

Synaesthesia · 2024-08-26T14:02:58 1724680978

I think reigns sounds better. A couple is actually a singular noun.

pvg · 2024-08-26T15:07:13 1724684833

https://en.wikipedia.org/wiki/Collective_noun

pdabbadabba · 2024-08-27T15:31:35 1724772695

Yep. And that Wikipedia article explains, collective nouns are normally taken to be singular in American English:

> In American English, collective nouns almost always take singular verb forms (formal agreement). In cases that a metonymic shift would be revealed nearby, the whole sentence should be recast to avoid the metonymy. (For example, "The team are fighting among themselves" may become "the team members are fighting among themselves" or simply "The team is infighting.") Collective proper nouns are usually taken as singular ("Apple is expected to release a new phone this year"), unless the plural is explicit in the proper noun itself, in which case it is taken as plural ("The Green Bay Packers are scheduled to play the Minnesota Vikings this weekend").

I guess "couple" may be one of the exceptions?

pjc50 · 2024-08-26T15:43:02 1724686982

Translation issue from the original English.

(joke)

ubutler · 2024-07-20T09:19:27 1721467167

* without already labelled training data (assuming you're referring to causal LLMs).

If you have labelled training data (or semi-labelled), BERT takes the cake, both in terms of accuracy and efficiency. In fact, you can have luck with getting a CLM to generate noisy labels and then training BERT/RoBERTa on that to get a robust strong classifier.

PaulHoule · 2024-07-20T15:25:38 1721489138

Between my experience and arXiv papers I've read I'd say this:

Personally I am willing to label 1000-2000 documents to create a training set. It's reasonable to make about 2000 simple judgements in an 8-hour "day" so it is something you could do at work or in your spare time in a few days if you want.

You can compute an embedding and then use classical ML algorithms from scikit-learn such as the support vector machine. My recommender can train a new model in about 3 minutes and that includes building about 20 models and testing them against each other to produce a model which is well tested and probability calibrated. This process is completely industrial, can run unattended, and always makes a good model if it gets good inputs. Running it every day is no sweat.

You can also "fine-tune" a model, actually changing the parameters of the deep model. I've fine-tuned BERT family models for my recommender, it takes at least 30 minutes and the training process is not completely reliable. A reliable model builder would probably do that 20 or so times with different parameters (learning rate, how long to train the model, etc.) and pick out the best model. As it is the best models from it is about as good as my simpler models, and a bad one is worse. I can picture industrializing it but I'm not sure it's worth it. In a lot of papers people just seemed to copy a recipe from another paper and don't seem to do any model selection.

My problem is fuzzy: the answer to "Do I like this article?" could vary from day to day. If I had a more precise problem "fine tuning" might pull ahead. Some people do get a significant improvement which would make it worth it, particularly if you don't expect to retrain frequently.

I see papers where somebody does the embedding transformation but instead of pooling over the tokens (averaging the vectors) they input the vectors into an LSTM, GRU or something like that and train the recurrent model. This kind of model does great when word order matters as in sentiment analysis. I found that kind of model was easy to train in a repeatable way a decade ago so that's an option I'd consider.

jerrygenser · 2024-07-20T17:28:56 1721496536

You're better off trying to figure out what features you like about articles that are less ambiguous as signal. You would then be able to finetune models to classify whether those features are present. Whether it's classificaton of chunks/sentences/tokens. For these, a bert model could be fine tuned to efficiently detect it.

ubutler · 2024-07-14T03:55:33 1720929333

I misinterpreted the title as Apple approving development of iOS apps on PC. That would’ve been an exciting development.