I'm working on https://quickchart.io/, a web API for generating chart images. I've expanded it to a WYSIWYG chart editor at https://quickchart.io/chart-maker/, which lets you create an endpoint that you can use to generate variations of custom charts. This is useful for creating charts quickly, or using them in places that don't support dynamic charting (email, SMS, various app plugins, etc).
I messed around with some AI features, mostly just for fun and to see if they could help users onboard. But the core product is decidedly not AI.
I've been interested in automatic testset generation because I find that the chore of writing tests is one of the reasons people shy away from evals. Recently landed eval testset generation for promptfoo (https://github.com/typpo/promptfoo), but it is non-RAG so more simplistic than your implementation.
Was also eyeballing this paper https://arxiv.org/abs/2401.03038, which outlines a method for generating asserts from prompt version history that may also be useful for these eval tools.
Thanks! I've been following promptfoo, so I'm glad to see you here. In addition to automatic evals I think every engineer and PM using LLMs should be looking at as many real responses as they can _every day_, and promptfoo is a great way to do that.
It uses meteor data from NASA CAMS [1] to reconstruct the meteoroid cloud that creates the Quadrantids. When Earth passes through the cloud every year, we see a meteor shower.
Each particle in this visualization represents an actual meteor that burned up in the Earth's atmosphere. CAMS reconstructs the orbit of the meteor based on its entry trajectory by triangulating multiple recordings. CAMS is very cool!
Yup, that is definitely cool AF! Is this same visualization available for other showers? Don't want to sound greedy, but this is so compelling, I'm now curious what the other showers look like
ohmuhgawd. hangs head in shame. i played with the box in the upper right. i looked at the bottom left to see the inset to locate the radiant. But my eyes glazed over at the information in the top left after reading the title anxious at getting to the glorious imagery.
Unless you really want to learn WebGL (in which case you probably should rather learn WebGPU) - I would recommend learning a framework making use of it, so ThreeJS or BabylonJS are probably the best choice, to get results fast (unless you have prior GPU programming experience).
The Quadrantids are interesting because their source is not obvious, but the most likely one (as noted in the article) is an asteroid with a relatively unusual orbit that is likely an extinct comet.
Evals are important for LLM app development. I've noticed dozens of tools in this space, including 11 (!) YC companies, so I put them together on a page.
Years ago, I built Asterank, an open-source database of asteroids. It landed me a job at Planetary Resources, an "asteroid mining" company: https://www.asterank.com/
Did they have a narrow launch window they couldn't afford to miss? I'm not talking about missions where you eat a big monetary loss on the launchpad and try again, I mean missions which rely on planetary alignments that may not happen again for years, or even the rest of your life, such as Voyager. Or even just missions where you launch successfully, but then after months (or years) of flight time the spacecraft is lost.
I like seeing how familiar structures appear at cosmological scale. Long ago I created a webgl visualization of the Millennium Run, an early large-scale cosmological simulation: https://www.asterank.com/galaxies/
It was a nice way to learn about three.js/webgl and how to make many particles performant. There are probably better visualizations out there nowadays.
"Extensions" and integration into the rest of the Google ecosystem could be how Bard wins at the end of the day. There are many tasks where I'd prefer an integration with my email/docs over a slightly smarter LLM. Unlike ChatGPT plugins, Google has the luxury of finetuning its model for each of their integrations.
The new feature for enriching outputs with citations from Google Search is also pretty cool.
Yes, exactly. Integration is where the real power of these agents can live.
I really want an agent that can help me with pretty simple tasks
- Hey agent, remember this link and that it is about hyper fast, solar powered, vine ripened, retroencabulators.
- Hey agent, remember that me and Bob Retal talked about stories JIRA-42 and JIRA-72 and we agreed to take actions XYZ
- Hey agent, schedule a zoom meeting with Joe in the afternoon next Tuesday.
- Hey agent, what did I discuss with Bob last week?
Something with retrieval and functional capability could easily end up being easier to use than the actual UIs that are capable of doing these kinds of things now.
No doubt about it. Google isn't competing directly with ChatGPT, but is betting that having a small fine-tuned model "close to the data" will dramatically cost-outperform a huge general-purpose LLM. Less resource-intensive interference, less prompt engineering (less noise).
It’s a competitive differentiator of workspace vs Office that will help retain existing users and maybe some day cause large enterprises to think more about switching.
In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.
I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.
This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.
I noticed on the evaluations, you're looking at the structure of the responses (and I agree this is important.) But how do I check the factual content of the responses automatically? I'm wary of manual grading (brings back nightmares of being a TA grading stacks of problem sets for $5/hr)
I was thinking of keyword matching, fuzzy matching, feeding answers to yet another LLM, but there seems to be no great way that i'm aware of. Any suggestions on tooling here?
The library supports the model-graded factuality prompt used by OpenAI in their own evals. So, you can do automatic grading if you wish (using GPT 4 by default, or your preferred LLM).
I'd be interested to see how models behave at different parameter sizes or quantization levels locally with the Ollama integration. For anyone trying promptfoo's local model Ollama provider, Ollama can be found at https://github.com/jmorganca/ollama
From some early poking around with a basic coding question using Code Llama locally (`ollama:codellama:7b` `ollama:codellama:13b` etc in promptfoo) it seems like quantization has little effect on the output, but changing the parameter count has pretty dramatic effects. This is quite interesting since the 8-bit quantized 7b model is about the same size as a 4-bit 13b model. Perhaps this is just one test though – will be trying this with more tests!
They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?
I messed around with some AI features, mostly just for fun and to see if they could help users onboard. But the core product is decidedly not AI.
reply