Hacker Newsnew | past | comments | ask | show | jobs | submit | appleroll's commentslogin

Hi HN — I’m the author.

This project started as an open-source system for detecting prompt injections in LLMs. The goal is to flag adversarial prompts before they reach a model, while keeping latency low and probabilities well-calibrated.

The main insight came from ensembles: not all models are equally good at every case. Instead of just averaging outputs, I:

1. Benchmarked each candidate model first to see what it actually contributes.

2. Remove models that don’t improve the ensemble through ablation studies (e.g., ProtectAI's Deberta finetune was dropped as it only contributed 0.5% to ECE and actually decreased accuracy).

3. Weight predictions by each model’s accuracy, letting models specialize in what they’re good at.

With this approach, the ensemble is smaller (~237M parameters vs ~600M for the leading baseline), 2x faster, and more calibrated (lower Expected Calibration Error) while still achieving competitive accuracy. Lower confidence on wrong predictions makes it safer for “human-in-the-loop” fallback systems.

For more info, you can check it out here: https://github.com/appleroll-research/promptforest

This project is open to all forms of contributions, and I’d love to hear feedback from the HN community — especially on ideas to further improve calibration, robustness, or ensemble design.


PromptForest — a fast, ensemble-based prompt injection detector for real-world AI safety

Prompt injection is an adversarial attack in LLM systems: malicious inputs that manipulate model behavior by slipping in hidden instructions. As AI usage grows in products, pipelines, and public APIs, detecting and mitigating these injections becomes a practical production problem.

PromptForest is an open-source ensemble detector that emphasizes speed, uncertainty awareness, and reliability without relying on massive models.

How it works - Runs multiple lightweight prompt-injection detectors in parallel. - Uses a voting/discrepancy mechanism to flag risky prompts. - Generates uncertainty scores: disagreement between models can trigger human review or stricter handling. - Small ensemble → faster inference (~100 ms per request) and lower resource usage. - Better-calibrated confidence estimates reduce overconfident mistakes compared to some existing detectors.

Why it matters

Prompt injection can leak private prompts or subvert agent workflows. Most current defenses rely on large classifiers or hard-coded heuristics:

- Big models are slow and expensive at scale. - Single detectors can be overconfident on edge cases. - Zero-risk doesn’t exist, but better calibration helps trigger sensible defenses.

PromptForest aims to be practical, open, and easy to run without a massive GPU footprint.

Technical Highlights

- Ensemble with voting/discrepancy scoring for ambiguous cases. - Supports multiple detection backends (e.g., LLaMA prompt guard variants). - Python-first with CLI and server mode for easy integration. - Optimized for latency and confidence calibration.

Who is this for

- Developers integrating LLMs in user-generated content pipelines - AI researchers focused on adversarial safety - Infrastructure teams needing fast, explainable detection - Community contributors who prefer open source tools over black boxes

Repo: https://github.com/appleroll-research/promptforest Try it out here: https://colab.research.google.com/drive/1EW49Qx1ZlaAYchqplDI...

Feedback is welcome, especially on integration patterns, benchmarks, or potential improvements.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: