Arctic dev here. Yes keeping all experts in memory is the recommendation here and understandably that is a barrier to some. But once you have 1 H100 node or two (gpu middle-class I guess...?), then a few things to note:
1. FP6/FP8 inference is pretty good. How to on a single node: https://github.com/Snowflake-Labs/snowflake-arctic/tree/main... (vllm support coming soon!)
2. Small number of activated parameters shine in batch inference case for cloud providers.
> 2. Small number of activated parameters shine in batch inference case for cloud providers
Could you elaborate more please? Batch inference activates pretty much all the experts since token in every sequence in a batch could hit a different expert. So at Bs=128 you’re not really getting a sparsity win.
this is essentially 400b params. With FP8, comparing to Grok'3 320B model, which requires 320GB VRam in int4, I think what the OP meant is actually 8 H100.
Which is ... a lot to say the least.
And all optimization is for latency, not throughput, because with 8 H100, you can easily hosted 4 replicas of 70B.
On day one, Spoke's ML and NLP algorithm can respond with the right answer that is relatively directly related (i.e. has a good amount of keyword overlap with the query) without any training. For complex queries, if Spoke does not find the right answer, you can "train" Spoke in a very simple way and very quickly (as a part of your normal ticketing/knowledge management flow without any extra effort) by responding with the right answer. And to answer your question, Spoke's algorithm learns from these responses in real-time and Spoke's performance for your organization will get better over time as more responses are accumulated.
For full transparency, Spoke will not deliver 100% accuracy for complex questions because "open domain" (unlike closed domain stuff like pizza ordering that Alexa or Google Home handle) natural language understanding is still far from being solved. However, it will augment your ticketing/knowledge management flow nicely by automatically resolving an increasingly large chunk of your questions.
Great work! A few questions for the author(s): In the article, you have listed 9 feature extractors/templates. In the final model, what's the total number (or rough magnitude) of features? How much data (or ballpark estimate) did you train this on? Did you try to deal with potential difference in distribution across your data sampling sources?