Great parallel with PubMed RCT vs case reports vs mechanistic inference, and it highlights the same core problem: users trusting output they shouldn't is worse than no output at all. The confidence tiers turned out to be satisfyingly efficient to implement (just YAML frontmatter fields that each skill reads and annotates) and they change how you interact with the results.
On the Supabase vs file question: makes total sense for live lookup. The file-based manifest works here because brand building is user-driven and session-based. You're not querying supplement interactions in real time. But the principle is the same: one source of truth, downstream consumers read and annotate, never duplicate state.
About the swap test: it runs at generation time, inside the skill prompt. Each skill has instructions that say "before presenting output, run these checks."
If the swap test fails, the skill flags it inline, explains why it failed, and rewrites before the user ever sees the first version. No separate classifier or post-processing step.
Tradeoff: it's not as rigorous as a trained classifier.
For brand copy the swap test is not complicated, for example: if a statement can generically apply to any brand/service, it fails. A classifier would be overkill here, though probably necessary for something like your PubMed evidence grading.
On the Supabase vs file question: makes total sense for live lookup. The file-based manifest works here because brand building is user-driven and session-based. You're not querying supplement interactions in real time. But the principle is the same: one source of truth, downstream consumers read and annotate, never duplicate state.
About the swap test: it runs at generation time, inside the skill prompt. Each skill has instructions that say "before presenting output, run these checks."
If the swap test fails, the skill flags it inline, explains why it failed, and rewrites before the user ever sees the first version. No separate classifier or post-processing step.
Tradeoff: it's not as rigorous as a trained classifier.
For brand copy the swap test is not complicated, for example: if a statement can generically apply to any brand/service, it fails. A classifier would be overkill here, though probably necessary for something like your PubMed evidence grading.
reply