A few miscellaneous observations about Claude's new beta function calling / structured data support (https://docs.anthropic.com/claude/docs/tool-use) that I encountered while making this notebook:
1. It is not nearly as good as ChatGPT's function calling conformance, hence why the system prompt engineering is more aggressive than usual.
2. Claude doesn't seem to handle nesting schema well: otherwise I would have allowed "generating X cards in a cycle" as a feature. The documentation does state that it can't handle "deeply nested" data but a single list is not deeply nested.
3. The documentation mentioned that Claude can do a chain-of-thoughts with tools enabled: in my testing this reduces quality drastically. In Opus, which does it by default, it's a waste of extremely expensive output tokens and it has a tendency to ignore fields.
4. Haiku and Sonnet have different vibes but similar subjective quality, in this case Sonnet being more "correct" and Haiku being more "fun", which is surprising.
I've had way more success taking it and forming it to my tasks than stuffing in a bunch of tokens about how to do things.
In one case I was able to go from ~6k in input tokens to ~3k because I no longer had to provide a mountain of examples and instructions for corner cases
Anytime I see a prompt from one of these companies, I assume it matches the style of instructions the model encountered during pre-training.
And the kinds of instruction formats that were encountered during pre-training end up informing what style of instruction the model is best at following.
An extreme example would be prompt templates that a "raw" instruct-tuned LLM follows: the model will technically work with a suboptimal format, but you get much better performance if you follow the prompt template the model was trained/fine-tuned on.
-
It's not a guarantee a given prompting style was involved during pre-training of course, but at the very least it's going to a provide a jumping off point that the creators of the model co-signed on.
This is semi-unrelated, but I tried to log in to Claude today and saw that Anthropic had banned my account. I only used Claude to ask 3-4 questions, so I guess the problem was that one of them was intended to see how Claude will self-censor on shady questions.
The moral of the story is don't ask Claude anything out of the ordinary, as maybe now I'm on a list somewhere.
Reminds me of when Bing would slam the ‘end conversation’ button wherever you (or it) hit any hidden tripwires. You literally couldn’t ask it what topics to avoid, because that was one of them.
They seem to ban a lot of accounts "by mistake" or very aggressively but they also do unban. There are quite a few cases on /r/ClaudeAI subreddit with Anthropic employees directing them to the above link.
Before I added the threat, Claude subjectively had a high probability of ignoring the rules such as generating a preamble before the JSON and thus breaking it, or often scolding the user.
At a high-level, Claude seems to be less influenced by system prompts in my testing, which could be a problem. I'm tempted to rerun the tests in that blog post, since in my experiments it performed much worse than ChatGPT.
> What? That opens you up to all kind of attacks, no?
tbh it doesn't listen to that line very well but it's more of a hedge to encourage better instructions.
This is very interesting, I had a similar feeling about Claude performing much worse than GPT-4.
Granted, I didn't put much work into optimizing the prompts, but then again, the prompts were certainly not GPT-optimized or specific either. The problems were severe, such as it choosing the wrong side of the conversation, hallucinating weird stuff plus repeating part of the prompt, all in the same message
Ok, the other blogpost has been sent to my readlist now. Looks fun. I did have one question though, and please forgive me if the answer is RTFA, but why choose Claude? Was there a specific reason?
Because I have another blog post about ChatGPT's structured data (https://news.ycombinator.com/item?id=38782678) and wanted to investigate Claude's implementation to compare and contrast. It's easy to port to ChatGPT if needed.
I just wanted to do the experiment in a fun way instead of fighting against benchmarks. :)
DALL-E 2/3 is too expensive to run significant tests on, but neither allow you to manipulate the system prompt to override some behaviors.
It is possible to work around it for GPT-4-Vision with the system prompt but it's very difficult and due to ambiguities in OpenAI's content policy I'm unsure if it's ethical or not.
I am still working on experimenting with its effects on Claude: it turns out that Claude does leak its system prompt telling it not to identify individuals without any prompt injections! If you do hit such an issue with this notebook, it will output the full JSON response.
It won't follow even simple, non-ethnic specific instructions such as "draw three women sitting at a cafe", it re-writes and completely forgets the original number of women, and adds a lot to the query that wasn't there.
At least from the limited sample size of results, the rules text seems like garbled than other models in the mtg card generating space. Curious if you're able to first generate a few mechanics, and then have it design an entire set (either bottom-up or top-down). I'm sure balancing is not something Gen AI can do properly... but I imagine this can really change how set designers approach new worlds or mechanics!
(Disclaimer, I'm the maintainer of this package, but this kind of use is exactly why I created it at the beginning)
If you know how to use HTML+CSS and would like to generate full-fledged cards, you could use a package such as html2image [0] to combine the text, the image and a card-template image into one final image.
Chrome/Chromium has to be available on Colab Notebooks though, that's the only requirement.
Using basic SVG without this package could also do the trick.
1. It is not nearly as good as ChatGPT's function calling conformance, hence why the system prompt engineering is more aggressive than usual.
2. Claude doesn't seem to handle nesting schema well: otherwise I would have allowed "generating X cards in a cycle" as a feature. The documentation does state that it can't handle "deeply nested" data but a single list is not deeply nested.
3. The documentation mentioned that Claude can do a chain-of-thoughts with tools enabled: in my testing this reduces quality drastically. In Opus, which does it by default, it's a waste of extremely expensive output tokens and it has a tendency to ignore fields.
4. Haiku and Sonnet have different vibes but similar subjective quality, in this case Sonnet being more "correct" and Haiku being more "fun", which is surprising.