rajhans's comments

rajhans · 2024-04-24T21:16:22 1713993382

We have published some insights here. https://medium.com/snowflake/snowflake-arctic-cookbook-serie...

rajhans · 2024-04-24T18:53:13 1713984793

Arctic dev here. Yes keeping all experts in memory is the recommendation here and understandably that is a barrier to some. But once you have 1 H100 node or two (gpu middle-class I guess...?), then a few things to note: 1. FP6/FP8 inference is pretty good. How to on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main... (vllm support coming soon!) 2. Small number of activated parameters shine in batch inference case for cloud providers.

kiratp · 2024-04-24T19:42:37 1713987757

> 2. Small number of activated parameters shine in batch inference case for cloud providers

Could you elaborate more please? Batch inference activates pretty much all the experts since token in every sequence in a batch could hit a different expert. So at Bs=128 you’re not really getting a sparsity win.

karmasimida · 2024-04-25T01:36:29 1714008989

That is my reading too, if you consider latency as the utmost inference metric, then you need all models in memory all the time.

What is you guys 70B configuration, do you guys try TP=8 for the 70B model for a fair comparison?

kristianp · 2024-04-25T02:55:29 1714013729

1 H100 is only 80GB of HBM. I guess you mean a server with 4xH100 is 1 node?

karmasimida · 2024-04-25T06:41:42 1714027302

this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.

Which is ... a lot to say the least.

And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.

kristianp · 2024-04-25T07:41:47 1714030907

Thanks for the correction, there are indeed 8x nodes. https://developer.nvidia.com/blog/introducing-nvidia-hgx-h10...

rajhans · 2024-04-24T18:48:15 1713984495

One of the modelers working on Arctic. We have done no alignment training whatsoever.

tzekid · 2024-04-26T09:13:56 1714122836

That's actually awesome tbh.

Wonder what effect alignment training will have on the output quality

xcdzvyn · 2024-04-24T18:56:35 1713984995

Thank you.

rajhans · on March 28, 2018

On day one, Spoke's ML and NLP algorithm can respond with the right answer that is relatively directly related (i.e. has a good amount of keyword overlap with the query) without any training. For complex queries, if Spoke does not find the right answer, you can "train" Spoke in a very simple way and very quickly (as a part of your normal ticketing/knowledge management flow without any extra effort) by responding with the right answer. And to answer your question, Spoke's algorithm learns from these responses in real-time and Spoke's performance for your organization will get better over time as more responses are accumulated.

For full transparency, Spoke will not deliver 100% accuracy for complex questions because "open domain" (unlike closed domain stuff like pizza ordering that Alexa or Google Home handle) natural language understanding is still far from being solved. However, it will augment your ticketing/knowledge management flow nicely by automatically resolving an increasingly large chunk of your questions.

rajhans · on Feb 9, 2017

Great work! A few questions for the author(s): In the article, you have listed 9 feature extractors/templates. In the final model, what's the total number (or rough magnitude) of features? How much data (or ballpark estimate) did you train this on? Did you try to deal with potential difference in distribution across your data sampling sources?