MessagePack can encode rows as well and then you just need to manage linking the keys during deserialization. In fact, it can encode arbitrary binary without needing base64 like JSON.
Although MessagePack is definitely not a drop-in replacement for JSON, it is certainly extremely useful.
Unlike JSON, you can’t just open a MessagePack file in Notepad or vim and have it make sense. It’s often not human readable. So using MessagePack to store config files probably isn’t a good idea if you or your users will ever need to read them for debugging purposes.
But as a format for something like IPC or high-performance, low-latency communication in general, MessagePack brings serious improvements over JSON.
I recently had to build an inference server that needed to be able to communicate with an API server with minimal latency.
I started with gRPC and protobuf since it’s what everyone recommends, yet after a lot of benchmarking, I found a way faster method to be serving MessagePack over HTTP with a Litestar Python server (it’s much faster than FastAPI), using msgspec for super fast MessagePack encoding and ormsgpack for super fast decoding.
Not sure how this beat protobuf and gRPC but it did. Perhaps the Python implementation is just slow. It was still faster than JSON over HTTP, however.
Nice. You can do the same in Python like so and surprisingly it only takes a couple seconds to a minute to get the print out, no need for importing any special libraries (just `sys` to enable printing large numbers).
```python
import sys
sys.set_int_max_str_digits(0) # Allows for printing very large numbers.
Use two leading spaces to create monospaced code on this site, like so:
import sys
sys.set_int_max_str_digits(0) # Allows for printing very large numbers
x = (2 ** 136_279_841) - 1
print(x)
Also, note that 1 << 136_279_841 is much faster than 2 ** 136_279_841; the former runs in <10ms, while the latter takes over 600ms on my machine (a 60x difference).
It means "left shift" in binary. This corresponds to adding a specified number of zeros at the end of the binary representation of the original number. So 1 << 0 is binary 1, 1 << 1 is binary 10, 1 << 2 is binary 100, 1 << 3 is binary 1000...
If you think about the meaning of place value in binary, this is exactly the same as raising two to a specified power. Each time you shift one place further left in binary, it's equivalent to multiplying the existing number by two. So repeating that a specified number of times is multiplying by a specified power of two.
One funny thing about Mersenne primes is that, as a result of what you describe, they are exactly those primes whose binary representation consists of a prime number of ones!
The smallest Mersenne prime, three, is binary 11, while the next largest is seven (111), then 31 (11111), then 127 (1111111). The next candidate, 2047 (11111111111), is not prime.
Simhash is an extremely fast and simple algorithm for detecting near duplicate text at scale which makes it particularly useful for deduplicating AI training datasets.
> When writing of a couple getting married, it is more common to use the plural form ("the couple are to be wed"). When writing of an established couple, it is more common to use a singular verb ("the couple has six puppies, each more destructive than the next").
So according to MW we're a bit more right than NYT.
I'd expect the former in British English, because in that standard, grammatically singular nouns that refer to multiple people are conjugated in the plural. The latter would be typical for American English. It's a bit surprising to see the former used by an American author in an American publication, however.
Yep. And that Wikipedia article explains, collective nouns are normally taken to be singular in American English:
> In American English, collective nouns almost always take singular verb forms (formal agreement). In cases that a metonymic shift would be revealed nearby, the whole sentence should be recast to avoid the metonymy. (For example, "The team are fighting among themselves" may become "the team members are fighting among themselves" or simply "The team is infighting.") Collective proper nouns are usually taken as singular ("Apple is expected to release a new phone this year"), unless the plural is explicit in the proper noun itself, in which case it is taken as plural ("The Green Bay Packers are scheduled to play the Minnesota Vikings this weekend").
* without already labelled training data (assuming you're referring to causal LLMs).
If you have labelled training data (or semi-labelled), BERT takes the cake, both in terms of accuracy and efficiency. In fact, you can have luck with getting a CLM to generate noisy labels and then training BERT/RoBERTa on that to get a robust strong classifier.
Between my experience and arXiv papers I've read I'd say this:
Personally I am willing to label 1000-2000 documents to create a training set. It's reasonable to make about 2000 simple judgements in an 8-hour "day" so it is something you could do at work or in your spare time in a few days if you want.
You can compute an embedding and then use classical ML algorithms from scikit-learn such as the support vector machine. My recommender can train a new model in about 3 minutes and that includes building about 20 models and testing them against each other to produce a model which is well tested and probability calibrated. This process is completely industrial, can run unattended, and always makes a good model if it gets good inputs. Running it every day is no sweat.
You can also "fine-tune" a model, actually changing the parameters of the deep model. I've fine-tuned BERT family models for my recommender, it takes at least 30 minutes and the training process is not completely reliable. A reliable model builder would probably do that 20 or so times with different parameters (learning rate, how long to train the model, etc.) and pick out the best model. As it is the best models from it is about as good as my simpler models, and a bad one is worse. I can picture industrializing it but I'm not sure it's worth it. In a lot of papers people just seemed to copy a recipe from another paper and don't seem to do any model selection.
My problem is fuzzy: the answer to "Do I like this article?" could vary from day to day. If I had a more precise problem "fine tuning" might pull ahead. Some people do get a significant improvement which would make it worth it, particularly if you don't expect to retrain frequently.
I see papers where somebody does the embedding transformation but instead of pooling over the tokens (averaging the vectors) they input the vectors into an LSTM, GRU or something like that and train the recurrent model. This kind of model does great when word order matters as in sentiment analysis. I found that kind of model was easy to train in a repeatable way a decade ago so that's an option I'd consider.
You're better off trying to figure out what features you like about articles that are less ambiguous as signal. You would then be able to finetune models to classify whether those features are present. Whether it's classificaton of chunks/sentences/tokens. For these, a bert model could be fine tuned to efficiently detect it.
reply