You could be the first if you were to develop an eval (preferably automated with llm as judge) and compared local deep research with perplexity's, openai's and deepseek's implementations on high-information questions.
Given a benchmark corpus, the evaluation criteria could be:
- Facts extracted: the amount of relevant facts extracted from the corpus
- Interpretations : based on the facts, % of correct interpretations made
- Correct Predictions: based on the above, % of correct extrapolations / interpolations / predictions made
The ground truth could be in JSON file per example.
(If the solution you want to benchmark uses a graph db, you could compare these aspects with a LLM as judge.)
---
The actual writing is more about formal/business/academic style, and I find less relevant for a benchmark.
However I would find it crucial to run a "reverse RAG" over the generated report to ensure each claim has a source. [0]