Hacker News new | past | comments | ask | show | jobs | submit | fatso784's comments login

“Do politics have artifacts?” was the rejoinder article. IMO that article should be as widely read as the main one, because it provides a warning to those who take the main one as gospel. Link: https://journals.sagepub.com/doi/abs/10.1177/030631299029003...


This seems like a sheets implementation of something like ChainForge (https://github.com/ianarawjo/ChainForge).

It's curious that Anthropic is entering the LLMOps tooling space ---this definitely comes as a surprise to me, as both OpenAI and HuggingFace seem to avoid building prompt engineering tooling themselves. Is this a business strategy of Anthropic's? An experiment? Regardless, it's cool to see a company like them throw their hat into the LLMOps space beyond being a model provider. Interested to see what comes next.


The original poster making this claim used a t-test to compare means (https://x.com/RobLynch99/status/1734278713762549970?s=20). Turns out the data is not normally distributed, making a t-test worthless (https://www.statology.org/t-test-assumptions/).

There might be other tests to do, but for this specific setup, the claim has been debunked.


Can’t reproduce this. See for yourself: https://x.com/IanArawjo/status/1734307886124474680?s=20

Inspectable evaluation flow in ChainForge: https://chainforge.ai/play/?f=2yvqkpe1vpus8


N=470 vs N=80 can impact the replicability


wait. WHAT! this app chainforge is great!


This seems like a sheets implementation of something like ChainForge (https://github.com/ianarawjo/ChainForge).

It's curious that Anthropic is entering the LLMOps tooling space ---this definitely comes as a surprise to me, as both OpenAI and HuggingFace seem to avoid building prompt engineering tooling themselves. Is this a business strategy of Anthropic's? An experiment? Regardless, it's very cool to see a company like them throw their hat into the LLMOps space beyond being a model provider. Interested to see what comes next.


ChainForge lets you do this, and also setup ad-hoc evaluations with code, LLM scorers, etc. It also shows model responses side-by-side for the same prompt: https://github.com/ianarawjo/ChainForge


Thanks!


There is a long term vision of supporting fine-tuning through an existing evaluation flow. We originally created this because we were worried about how to evaluate ‘what changed’ between a fine-tuned LLM and its base model. I wonder if Vertex AI has an API that we could plug-in, though, or if it’s limited to the UI.


I meant for completion, chat and embedding. https://cloud.google.com/vertex-ai/docs/generative-ai/chat/t... . Some examples here.

Vertex AI has the same API as PaLM as far as I know. However, the authorization is through Google Cloud. So I use it like any other GCP API.

I love the idea of adding fine tuning as a node though. Here is the API for creating a model tuning job - https://cloud.google.com/vertex-ai/docs/generative-ai/models...

I wish I could use ChainForge nodes in Node Red.


Hey Eric! Thank you! As an aside, we are looking to interview some people who’ve used ChainForge (you see, we are academics who must justify our creations through publications… crazy, I know). Would you or anyone on your team be interested in a brief chat?

Can contact me here: https://twitter.com/IanArawjo Or find email on CV here: ianarawjo.com

At any rate, glad it was helpful!


Thank you for the kind words! Looking at the photo, I think you wouldn’t need the last prompt node there.

As far as evaluating functions go, that’s unfortunately a ways off. But, we generally prioritize things based on how many people posted GitHub Issues about it/want it. (For instance, Chat Turn nodes came from an Issue.) If you post a feature request there, it’ll move up our priority list, and we can also clarify what the feature precisely should be.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: