Hacker News new | past | comments | ask | show | jobs | submit | jncraton's comments login

It would be nice to see the Phind Instant weights released under a permissive license. It looks like it could be a useful tool in the local-only code model toolbox.


The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:

> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.

ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.

[1] https://github.com/ggerganov/llama.cpp/pull/2926


Who is already using speculative decoding? I haven't seen anything about it in the llama.cpp or ollama docs.



You might be interested in "Text Embeddings Reveal (Almost) As Much As Text":

> We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.

https://arxiv.org/pdf/2310.06816.pdf

There's certainly information loss, but there is also a lot of information still present.


Yeah, that paper is what I was thinking about. https://simonwillison.net/2024/Jan/8/text-embeddings-reveal-...

“a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly”.


Google released the T5 paper about 5 years ago:

https://arxiv.org/abs/1910.10683

This included full model weights along with a detailed description of the dataset, training process, and ablations that led them to that architecture. T5 was state-of-the-art on many benchmarks when it was released, but it was of course quickly eclipsed by GPT-3.

It was common practice from Google (BERT, T5), Meta (BART), OpenAI (GPT1, GPT2) and others to release full training details and model weights. Following GPT-3, it became much more common for labs to not release full details or model weights.


> PNG uses deflate. General byte-level patterns. It does not do bespoke image-specific stuff.

That's not quite the whole story. PNG does include simple filters to represent a line as a difference from the line above, and that may be what the original post is referring to. [1]

[1] https://en.wikipedia.org/wiki/PNG#Filtering


There is a large amount of theoretical research on the subject of energy limits in computing. For example, Landauer's principle states any irreversible change in information requires some amount of dissipated heat, and therefore some energy input [1].

Reversible computing is an attempt to get around this limit by removing irreversible state changes [2].

[1] https://en.wikipedia.org/wiki/Landauer%27s_principle

[2] https://en.wikipedia.org/wiki/Reversible_computing


Then it's about finding a mechanical toffoli or fredkin gate liie this


You might be interested in OpenWorm:

https://openworm.org/

This paper might be helpful for understanding the nervous system in particular:

https://royalsocietypublishing.org/doi/10.1098/rstb.2017.037...


This is great to see. It looks like the size of the embedding vector is half the size of text-embedding-ada-002 (768 vs 1536) while providing competitive performance. This will save space in databases and make lookups somewhat faster.

For those unaware, if 512 tokens of context is sufficient for your use case, there are already many options that outperform text-embedding-ada-002 on common benchmarks:

https://huggingface.co/spaces/mteb/leaderboard


The 768D-sized embeddings compared to OpenAI's 1536D embeddings are actually a feature outside of index size.

In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].


Unless I'm missing something, it should be possible to map out in advance which dimensions represent syntactic aspects, and then downweigh or remove them for similarity comparisons. And that map should be a function of the model alone, i.e. fully reusable. Are there any efforts to map out the latent space of ada models like that?


You wrote „out of the box“, did you find a way to improve this?


You can do PCA or some other dimensionality reduction technique. That’ll reduce computation and improve signal/noise ratio when comparing vectors.


Unfortunately this is not feasible with a large amount of words due to the quadratic scaling. But thanks for the response!


Not sure what you mean by large amount of words. You can fit a PCA on millions of vectors relatively performantly, then inference from it is just a matmul.


Not true. You need a distance matrix (for classical PCA it's a covariance matrix), which scales quadratically with the number of points you want to compare. If you have 1 Mio. vectors, each creating a float entry in the matrix, you will end up with approx (10^6)^2 / 2 unique values, which is roughly 2000Gb of memory.


You might be interested in TinyStories:

https://arxiv.org/abs/2305.07759

> In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities.


OpenLLaMA models up to 13B parameters have now been trained on 1T tokens:

https://github.com/openlm-research/open_llama


unfortunately not openllama-33b yet


20b done


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: