Hacker News new | past | comments | ask | show | jobs | submit | rdli's comments login

The blog post was a little unclear, so my summary was:

- They used QwQ to generate training data (with some cleanup using GPT-4o-mini)

- The training data was then used to FT Qwen2.5-32B-Instruct (non-reasoning model)

- Result was that Sky-T1 performs slightly worse than QwQ but much better than Qwen2.5 on reasoning tasks

There are a few dismissive comments here but I actually think this is pretty interesting as it shows how you can FT a foundation model to do better at reasoning.


I wish they would have compared to the r1 distills of qwen2.5


I took a brief look (~5 minutes). My $0.02 is that it's not clear what problem you're trying to solve. I get what some of the features do (e.g., templated prompts) but it would be v helpful to have an example of how you actually use magentic, versus the non-magentic way. It feels like a lot of syntactic sugar, if I'm being honest (not a bad thing, but something you might want to be clear about, if that's the case.)


(author here) I didn't put this in my post, but one of my favorite moments was when I read some of the LlamaIndex source code which pointed to the GitHub commit where they copied the code verbatim from LangChain. (LangChain is MIT-licensed, so it's OK, but I still thought it was funny!)


Not a bad move by Red Hat. Red Hat lost the battle of the cloud to Azure, AWS, and Google, but AI is still a nascent space. vLLM's deployment model fits neatly into Red Hat's traditional on-premise / support-centric business model.


I'm working on something like this! It's simple in concept, but there are lots of fiddly bits. A big one is performance (at least, without spending $$$$$ on GPUs.) I haven't found that much in terms of how to tune/deploy LLMs on commodity cloud hardware, which is what I'm trying this out on.


You can use ONXX versions of embedding models. Those run faster on CPU.

Also, don’t discount plain old BM25 and fastText. For many queries, keyword or bag-of-words based search works just as well as fancy 1536 dim vectors.

You can also do things like tokenize your text using the tokenizer that GPT-4 uses (via tiktoken for instance) and then index those tokens instead of words in BM25.


Thanks! I should have been clearer -- embeddings are pretty fast (relatively) -- it's inference that's slow (I'm at 5 tokens/second on AKS).


Could you sidestep inference altogether? Just return the top N results by cosine similarity (or full text search) and let the user find what they need?

https://ollama.com models also works really well on most modern hardware


I'm running ollama, but it's still slow (it's actually quite fast on my M2). My working theory is that with standard cloud VMs, memory <-> CPU bandwidth is an issue. I'm looking into vLLM.

And as to sidestepping inference, I can totally do that. But I think it's so much better to be able to ask the LLM a question, run a vector similarity search to pull relevant content, and then have the LLM summarize this all in a way that answers my question.


Oh yeah! What I meant is having Ollama run on the user's machine. Might not work for the use case you're trying to build for though :)


This style of embeddings could be quite lightweight/cheap/efficient https://github.com/cohere-ai/BinaryVectorDB


Embedding models are generally lightweight enough to run on CPU, can be done in the background while the user isn't using their device.


This is cool! I've been trying out bits & pieces of the RAG ecosystem, too, exploring this space.

Here's a question for this crowd: Do we see domain/personalized RAG as the future of search? In other words, instead of Google, you go to your own personal LLM, which has indexed all of the content you care about (whether it's everything from HN, or an extra informative blog post, or ...)? I personally think this would be great. I would still use Google for general-purpose search, but a lot of my search needs are trying to remember that really interesting article someone posted to HN a year ago that is germane to what I'm doing now.


I definitely think there are opportunities to provide more useful & personalized search than what Google offers for at least some queries.

Quality aside, I think the primary challenge is in figuring out the right UX for delivering that at scale. One of the really great advantages of Google is that it is right there in your URL bar, and that for many of the searches you might do, it works just fine. Figuring out when it doesn't and how to provide better result then seems like a big unsolved UX component of figuring out personalized search.


(Author here)

One of the things that confused me is that regression models can be predictive, just like time series forecasting — they just do so in a different way. I tried to make this clear in the article (or maybe I’m not understanding what you’re saying).

In a regression model, you’re predicting target variables from feature variables. In a time series, you’re predicting the same variable from its past behavior. This is a subtle but crucial difference.

(And then you can do time series with covariates, which combines the two.)


Many of the most important time series prediction models are called "autoregressive", meaning they are regression models predicting the target from (prior values of) itself. This suggests that statisticians don't really share the view that these domains are distinct, or that regression models should only predict with different variables from the target.


Correct. In terms of what "kind" of model it is, it's all just a variation of the same linear model, y = bx.

That said, there are a lot of special considerations involved with timeseries data. There is a large number of specialized tools, techniques, and model families dedicated to time series modeling, which don't make sense to use for other kinds of problems. And all of those special time series tools exist to solve problems that do not arise in other modeling situations. So in practice, times series modeling is a distinct specialization from other kinds of modeling.


Right, AR(n) is a regression model, as are models which take only exogenous variables.

My question is this. According to definitions, can the latter (f(X_t) = y_t) be a time series model if each row of data is a time step? It doesn't have any autoregressive terms in X, so I don't know if it categorically is a time-series model.

Not that this question even matters, it's purely a taxonomy/terminology question.


Yes, it is. A time series model is any model where the data varies over time; that is, a time series model is any model of time series data. And timeseries data is broadly anything where the data for a single thing/entity varies over time. There are no strict definitions here, just common conventions.


Okay. And we can also say that there's some time series models that aren't regression models, right? For example, Kalman Filter is a "model" of a time series but isn't a regression.


Correct.

Although the term "regression" is a misnomer anyway, and often when people say "regression" they mean "linear model". And by "linear model", we mean specifically a model in which outputs/predictions are some fixed linear combination of the input.

It is however possible to interpret the Kalman filter as a kind of dynamic regression model. Check out here if you want a good math workout on that topic: https://stats.stackexchange.com/q/330696

(Another somewhat distinct meaning of the term "regression" is any model with a "continuous" outcome variable. This is usually in contrast to "classification", which is any model that has a "categorical" or discrete outcome variable.)


I have a time series forecasting methodology question that I'll drop here.

Suppose I have exogenous variables that vary over time, X(t). X is about 100 features. What are some methods I can apply onto X(t) to automatically engineer features that may be useful at predicting some noisy y(t)?

I want to simultaneously capture interactions/interdependence between the columns of X, as well as the autocorrelation structure of X.

If I treat X as merely tabular data, throwing it into a traditional regression model (e.g. XGBoost), it can capture the interdependence structure in X, but it will neglect the autocorrelation structure... Unless I manually engineer features that capture the autocorrelation structure in X (e.g. rolling/shifted/differenced features), but I want to explore methods that do that automatically.


It might not be that important to fully capture the autocorrelation structure within X.

Usually our models are doing something like "Y = f(X) + E" where E is some unknown random noise and f() is the relationship that we are trying to infer from the data. We usually take X as "given" or "known", so in that case we are looking at Y conditional on some specific value of X.

If we are just trying to make good predictions, then we don't necessarily care about the structure among the components of X unless that structure tells us something about how Y is affected by X.

Imagine the following "true" relationships in the data, where E and H are unmeasurable random noise:

  Y(t) = b0 + b1 * X(t) + b2 * X(t-1) + E(t)
  X(t) = c * X(t-1) + H(t)
Knowing b0, b1, and b2 is sufficient to predict "Y minus random noise". Knowing c doesn't help us at all.

If you're interested in obtaining good-quality estimates of b1 and b2, then you'll have a problem. That's because the direct effect of X(t-1) on Y is conflated with the indirect effect of X(t-1) on Y via X(t). But if you're just trying to make good predictions for Y, then you don't care as much about confidently distinguishing between b1 and b2.


if the variables in X(t) have the same time steps, I'd probably look at the cross correlation function of the X vs y, and then built another model on the X to predict X(t+n) and use that as an input for Y(t).


> built another model on the X to predict X(t+n)

I like this idea.

Practically, how would this look? Say X has 100 columns. Do we estimate 100 separate models f_{i}(X_{t}) = X_{i, t+1}, then generate 100 predictions for each time step, and then feed those 100 predictions into a regression to predict Y_{t}?

> cross correlation function of the X vs y

Is this supposed to be combined somehow with the f_{i} outputs?


> > cross correlation function of the X vs y

Is this supposed to be combined somehow with the f_{i} outputs?

I'd rank the variables by their CCF, and use the top(n) to try to predict the series of interest.

Like, split Y in half, then use the X(1:(t/2)+n) to predict Y(t+n) to see if it works, and then if it works OK, actually model the top n X series and use them to really predict the Y.

It's a pretty manual approach, but you could automate it once you have a better idea what you're aiming for.


Apologies for dropping offline shortly after posting this. :(

I should say that I enjoyed this post. And I think leaning into that confusion is my aim. In particular, my point of stochastic versus random is that they are more synonym than they are anything else. Just words that different groups came to use covering similar things.

Which is not to say that their aren't differences in the crowd that uses each term. I posit that most of the differences is in the aims of the crowd, and at the end of the day, you can get a lot of mileage by embracing the similarities. As opposed to the default of contrasting on the differences.

As a fun example, to me, if you view time not as just a number that always goes up, but as a number that cycles through the seasonal values, then it is easy to view as most any other feature. Similarly, the past is easy to envision as a feature of the present.

I do think the way you described a lot of time series analysis fits the fun read I had where Mandlebrot proposed a fractal view of time series predictions. Where you are looking for self similar behavior in the series data and reflecting/overlaying it on itself. But... as is probably guessable from the rest of my post, a lot of this is far outside of my comfort area. Love reading about it from a distance.


(Author here). Or a lot of trial & error :).


My job is now primarily Time Series Forecasting, and we’ve spent so much time improving our feature selection and engineering. When I started I thought “run correlations against target variables, find the best bunch and as long as we can explain them and their relation to the target we are good”

I was wrong.


I work mostly with regressions and often it is almost more informative when something you expected to be a significant term isn't. Can help track down interesting behavior.

More recently Machine learning has really enhanced what you can do with regression. For example multivariate regressions when there are non-linear (or partially linear) relationships between feature and target variables.

For example recent regression problem involved a chemical reaction. It was suspected that a particular feature above a threshold began to display non linear behavior but it was difficult to pinpoint exactly where it began departing from linearity. ML was very helpful analyzing this.

Other than regressions and timeseries forecasting I think it's worth knowing about K-means clustering and PCA (Principal Component Analysis)/ PLS (Projection to latent structures) as well.

I've found PCA to be pretty unknown but very useful I've had success using it in the past and found it useful to explain the relationship not just between the data features and the target variable but also how the features relate to each other.


I’m just about to start digging in to 8 years of data from a few power plants with 16 turbines in total to see if I can identify some problems we might have before the sensor measurements exceed the alarm threshold.

Taking bearing temperature as an example, I think I will identify periods of time where the machine has already been generating for an hour so temperature have stabilized and then I will have bearing oil inlet temperature and machine load as independent variables, and bearing oil outlet and bearing metal temperatures as dependent values. Seems like it should be straightforward to find any anomalies but I’ve just started googling how to do this yesterday. There are lots of vendors hawking predictive maintenance software but I can’t imagine that I couldn’t get similar results with a few weeks effort and armed with Python and all of the associated libraries


Maybe try slopes and second derivatives (change in temperature over time and so forth) could also try introducing various lag windows into timeseries data.

edit: I've Also seen a lot of pitches about predictive maintenance / automated anomaly detection. I think the appeal lies in having a one size fits all solution you can apply to multiple pieces of equipment (fans, conveyor belt drives, pumps etc) and not needing to develop/deploy/maintain bespoke models.

A lot of manufacturing sites won't have a data person on tap (or even people who can write python). Also there are challenges with deployment etc especially in remote sites where access is difficult, data connectivity is bad etc (think like oil/gas pipelines). Most of the pitches seem to combine running ML models and using some kind of iot device with something like lorawan for connectivity..


Is the product being sold the setup of the bespoke models from their bag of what they’ve done before?

Regressions seem like the obvious way to detect anomalies to me since it should be 100% repeatable and make sense according to amount of heat being generated and removed , how to apply ML/AI to it I am not so sure


A good first step would be a scatterplots, time series plots and mark 1 eyeballs. It helps to understand the shape of the data before you start trying to fit models.


Right, make some scatter plots of bearing temp vs oil inlet temp and machine load, establish a fitted line, then can detect anomaly when new measurements vary from expected by more than some threshold.

Doesn’t seem that fancy, but better than waiting for a small problem to turn in to a larger problem.

I think it will also be useful in highlighting differences between identical machines, why does the one right beside the other run 5 degrees hotter on thrust bearing? etc


If you have any kind of functional physical model of part where and eventual failure, you have a huge head start.

That said, there can be a pretty big gap between detecting individual sensor anomalies (undergrad homework) and predicting component failure (build an entire business around it). I have never regretted starting a data project with a small, easy task, and ramping up from there. Whereas I have definitely regretted starting a data project with big goals and/or fancy techniques at the beginning. Set clear incremental goals, and use the early prototyping phases to explore the data and develop a good understanding for what might or might not be possible to accomplish with it.


Having worked in the same problem space, I can heavily recommend to get expert input when evaluating which features to use. Ideally, this is a person who knows the internals of the machinery and/or operations that can help you remove spurious features. As a Data Scientist, one sometimes tends to think that the data explains everything and no expert domain knowledge is needed ("Modern machine translation does work without any knowledge of grammar or language!"). Good luck!


You should look into using generalized additive models (GAMs). They are regression models that allow you to model nonlinear relationships, and even smooth nonlinear interactions between variables, while retaining the benefits of classical regression models like statistically valid confidence intervals and the ability to control for repeated measures. You can also explicitly model periodic behavior, like 24-hour or annual cycles in a predictor variable, and even account for auto-correlation explicitly.

In your example, you could not only pinpoint departure from linearity, but you could get a 95% confidence interval for it.

The best implementation is mgcv in R; pyGAM in python is ok but lacks many of the more advanced features in mgcv. There's even a more ML-flavored implementation in mboost


This is a case where I have a hard time getting my head round how and why machine learning helps. What models are there available, and what training data do you use? Any background would be appreciated, I use ML for feature detection and image classification, but not yet for regressions.


I'm not a data scientist (my background is Engineering) I use Azure ML Studio. the regression feature uses an ensemble of different algorithms there is an explanation here.

https://learn.microsoft.com/en-us/azure/machine-learning/com...

Once the model has run it uses something called a mimic to generate model explainability, which lets you explore things like feature importance etc in the final model. As far as the user interface goes I mostly used SAS in the past and it feels quite similar.


Wait I still do this, what are your secrets?


My kids have school-issued devices, and there is a ton of tracking on them.

I set up my Pi-Hole on a Raspberry Pi Zero W. I have some brief notes here on my setup: https://www.thelis.org/blog/pi-hole. It works well.


I agree that some GCP services are better than others.

I’ve never used Pub/Sub or Cloud Run, but have been quite happy with BigQuery and GKE.


BigQuery has more footguns than GKE in my experience, but that’s perhaps because I have a lot more experience with GKE and know how to avoid those footguns. To me at least it’s understandable enough to say More Nodes is More Money but completely non-straightforward to say that this query I wrote is going to scan the data in a new and expensive way. Am I doing it wrong?


> To me at least it’s [...] non-straightforward to say that this query I wrote is going to scan the data in a new and expensive way. Am I doing it wrong?

When you put a query in the BigQuery console, it'll tell you "This query will process ??? MB when run" at the top right.

So if you code all your queries interactively in production (which is what everyone else is doing anyway) it's not too hard to keep an eye on.


Are you using slots (https://cloud.google.com/bigquery/docs/slots)? If you aren't, I'd highly recommend you switch. My guess is that it would make your costs much more predictable (it did for us).

Note that this is not the default! :-)


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: