Hacker News new | past | comments | ask | show | jobs | submit | typpo's comments login

I'm working on https://quickchart.io/, a web API for generating chart images. I've expanded it to a WYSIWYG chart editor at https://quickchart.io/chart-maker/, which lets you create an endpoint that you can use to generate variations of custom charts. This is useful for creating charts quickly, or using them in places that don't support dynamic charting (email, SMS, various app plugins, etc).

I messed around with some AI features, mostly just for fun and to see if they could help users onboard. But the core product is decidedly not AI.


Congrats on the launch!

I've been interested in automatic testset generation because I find that the chore of writing tests is one of the reasons people shy away from evals. Recently landed eval testset generation for promptfoo (https://github.com/typpo/promptfoo), but it is non-RAG so more simplistic than your implementation.

Was also eyeballing this paper https://arxiv.org/abs/2401.03038, which outlines a method for generating asserts from prompt version history that may also be useful for these eval tools.


Thanks! I've been following promptfoo, so I'm glad to see you here. In addition to automatic evals I think every engineer and PM using LLMs should be looking at as many real responses as they can _every day_, and promptfoo is a great way to do that.


I posted this visualization of mine in a recent thread on the Quadrantids, but sharing again because people seemed to enjoy it: https://www.meteorshowers.org/view/Quadrantids

It uses meteor data from NASA CAMS [1] to reconstruct the meteoroid cloud that creates the Quadrantids. When Earth passes through the cloud every year, we see a meteor shower.

Each particle in this visualization represents an actual meteor that burned up in the Earth's atmosphere. CAMS reconstructs the orbit of the meteor based on its entry trajectory by triangulating multiple recordings. CAMS is very cool!

[1] http://cams.seti.org/


Yup, that is definitely cool AF! Is this same visualization available for other showers? Don't want to sound greedy, but this is so compelling, I'm now curious what the other showers look like


there's a drop down at the top of the page where you can choose a shower/ or all of them.


ohmuhgawd. hangs head in shame. i played with the box in the upper right. i looked at the bottom left to see the inset to locate the radiant. But my eyes glazed over at the information in the top left after reading the title anxious at getting to the glorious imagery.


Please be kind to yourself. Life is hard enough.


fair enough, but i do like to own up to my times of committing ID10T errors


This seems like a cool concept, but it's broken for me.

Firefox 113.0.2 shows the orbits and background, but no particles as far as I can tell.

Chromium 120.0.6099.129 shows just a black screen plus the widgets, nothing else.

Both on amd64 Debian Linux.

Update: bumping FF to 121.0 did not seem to help.


Wow, this is incredibly cool! It's nerd-sniping me into wanting to (try to) learn WebGL and orbital mechanics.


Unless you really want to learn WebGL (in which case you probably should rather learn WebGPU) - I would recommend learning a framework making use of it, so ThreeJS or BabylonJS are probably the best choice, to get results fast (unless you have prior GPU programming experience).


Using recorded meteor data from NASA CAMS, I built this visualization of the meteor cloud that creates the Quadrantids: https://www.meteorshowers.org/view/Quadrantids

The Quadrantids are interesting because their source is not obvious, but the most likely one (as noted in the article) is an asteroid with a relatively unusual orbit that is likely an extinct comet.


This is the first time I'm seeing this data like this, and it's really impressive, thank you

Do you know where I can do further reading on why the planes of these meteors are different than the one on our universe?


Great visualization. Thank you for sharing.


Evals are important for LLM app development. I've noticed dozens of tools in this space, including 11 (!) YC companies, so I put them together on a page.


Years ago, I built Asterank, an open-source database of asteroids. It landed me a job at Planetary Resources, an "asteroid mining" company: https://www.asterank.com/


> you may literally see your life's work go up in flames.

Incidentally, this happened to Lewicki a few years later when Planetary Resources' first satellite blew up on an Antares rocket: https://www.geekwire.com/2014/rocket-carrying-planetary-reso...


Did they have a narrow launch window they couldn't afford to miss? I'm not talking about missions where you eat a big monetary loss on the launchpad and try again, I mean missions which rely on planetary alignments that may not happen again for years, or even the rest of your life, such as Voyager. Or even just missions where you launch successfully, but then after months (or years) of flight time the spacecraft is lost.


I like seeing how familiar structures appear at cosmological scale. Long ago I created a webgl visualization of the Millennium Run, an early large-scale cosmological simulation: https://www.asterank.com/galaxies/

It was a nice way to learn about three.js/webgl and how to make many particles performant. There are probably better visualizations out there nowadays.


"Extensions" and integration into the rest of the Google ecosystem could be how Bard wins at the end of the day. There are many tasks where I'd prefer an integration with my email/docs over a slightly smarter LLM. Unlike ChatGPT plugins, Google has the luxury of finetuning its model for each of their integrations.

The new feature for enriching outputs with citations from Google Search is also pretty cool.


Yes, exactly. Integration is where the real power of these agents can live.

I really want an agent that can help me with pretty simple tasks - Hey agent, remember this link and that it is about hyper fast, solar powered, vine ripened, retroencabulators. - Hey agent, remember that me and Bob Retal talked about stories JIRA-42 and JIRA-72 and we agreed to take actions XYZ - Hey agent, schedule a zoom meeting with Joe in the afternoon next Tuesday. - Hey agent, what did I discuss with Bob last week?

Something with retrieval and functional capability could easily end up being easier to use than the actual UIs that are capable of doing these kinds of things now.


oh so a working google assistant? :)


Yes, basically. I'm shocked at how bad that is compared to what it could be. If anything, it has gotten worse since introduction.


I think they tuned down the model because TPUs are needed for actual AI work


Having to explicitly say "remember that" is so anachronistic. Of cause computers (should) remember everything.


No doubt about it. Google isn't competing directly with ChatGPT, but is betting that having a small fine-tuned model "close to the data" will dramatically cost-outperform a huge general-purpose LLM. Less resource-intensive interference, less prompt engineering (less noise).


Yes - but the big question is this: how to monetize this? This might completly kill search and ads.


Workspace is already monetized


It’s a competitive differentiator of workspace vs Office that will help retain existing users and maybe some day cause large enterprises to think more about switching.


By showing ads... They'll figure out units that look native and history will repeat itself.


In case anyone's interested in running their own benchmark across many LLMs, I've built a generic harness for this at https://github.com/promptfoo/promptfoo.

I encourage people considering LLM applications to test the models on their _own data and examples_ rather than extrapolating general benchmarks.

This library supports OpenAI, Anthropic, Google, Llama and Codellama, any model on Replicate, and any model on Ollama, etc. out of the box. As an example, I wrote up an example benchmark comparing GPT model censorship with Llama models here: https://promptfoo.dev/docs/guides/llama2-uncensored-benchmar.... Hope this helps someone.


ChainForge has similar functionality for comparing : https://github.com/ianarawjo/ChainForge

LocalAI creates a GPT-compatible HTTP API for local LLMs: https://github.com/go-skynet/LocalAI

Is it necessary to have an HTTP API for each model in a comparative study?


Thanks for sharing this, this is awesome!

I noticed on the evaluations, you're looking at the structure of the responses (and I agree this is important.) But how do I check the factual content of the responses automatically? I'm wary of manual grading (brings back nightmares of being a TA grading stacks of problem sets for $5/hr)

I was thinking of keyword matching, fuzzy matching, feeding answers to yet another LLM, but there seems to be no great way that i'm aware of. Any suggestions on tooling here?


The library supports the model-graded factuality prompt used by OpenAI in their own evals. So, you can do automatic grading if you wish (using GPT 4 by default, or your preferred LLM).

Example here: https://promptfoo.dev/docs/guides/factuality-eval


OpenAI/evals > Building an eval: https://github.com/openai/evals/blob/main/docs/build-eval.md

"Robustness of Model-Graded Evaluations and Automated Interpretability" (2023) https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness... :

> The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

From https://news.ycombinator.com/item?id=37451534 : add'l benchmarks: TheoremQA, Legalbench


Tooling focusing on custom evaluation and testing is sorely lacking, so thank you for building and sharing this!


I'd be interested to see how models behave at different parameter sizes or quantization levels locally with the Ollama integration. For anyone trying promptfoo's local model Ollama provider, Ollama can be found at https://github.com/jmorganca/ollama

From some early poking around with a basic coding question using Code Llama locally (`ollama:codellama:7b` `ollama:codellama:13b` etc in promptfoo) it seems like quantization has little effect on the output, but changing the parameter count has pretty dramatic effects. This is quite interesting since the 8-bit quantized 7b model is about the same size as a 4-bit 13b model. Perhaps this is just one test though – will be trying this with more tests!


This is really cool!

I've been using this auditor tool that some friends at Fiddler created: https://github.com/fiddler-labs/fiddler-auditor

They went with a langchain interface for custom Evals which I really like. I am curious to hear if anyone has tried both of these. What's been your key take away for these?


Thanks for sharing, looks interesting!

I've actually been using a similar LLM evaluation tool called Arthur Bench: https://github.com/arthur-ai/bench

Some great scoring methods built in and a nice UI on top of it as well


I was just digging into promptfoo the other day for some good starting points in my own LLM eval suite. Thanks for the great work!


This is impressive. Good work.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: