> This repository contains the code that Defog uses for the evaluation of generated SQL. It's based off the schema from the Spider, but with a new set of hand-selected questions and queries grouped by query category. For an in-depth look into our process of creating this evaluation approach, see this.
> Our testing procedure comprises the following steps. For each question/query pair: 1. We generate a SQL query (possibly from an LLM).
2. We run both the "gold" query and the generated query on their respective database to obtain 2 dataframes with the results.
3. We compare the 2 dataframes using an "exact" and a "subset" match. TODO add link to blogpost.
4. We log these alongside other metrics of interest (e.g. tokens used, latency) and aggregate the results for reporting
- awesome-Text2SQL: https://github.com/eosphoros-ai/Awesome-Text2SQL :
> Curated tutorials and resources for Large Language Models, Text2SQL, Text2DSL、Text2API、Text2Vis and more.
- Awesome-code-llm > Benchmarks > Text to SQL: https://github.com/codefuse-ai/Awesome-Code-LLM#text-to-sql
- underlines/awesome-ml//llm-tools.md > RAG > OpenAI > dataherald,: https://github.com/underlines/awesome-ml/blob/master/llm-too...
- underlines/awesome-ml//llm-tools.md > Benchmarking > Benchmark Suites, Leaderboards: https://github.com/underlines/awesome-ml/blob/master/llm-too... :
- sql-eval: https://github.com/defog-ai/sql-eval :
> This repository contains the code that Defog uses for the evaluation of generated SQL. It's based off the schema from the Spider, but with a new set of hand-selected questions and queries grouped by query category. For an in-depth look into our process of creating this evaluation approach, see this.
> Our testing procedure comprises the following steps. For each question/query pair: 1. We generate a SQL query (possibly from an LLM). 2. We run both the "gold" query and the generated query on their respective database to obtain 2 dataframes with the results. 3. We compare the 2 dataframes using an "exact" and a "subset" match. TODO add link to blogpost. 4. We log these alongside other metrics of interest (e.g. tokens used, latency) and aggregate the results for reporting
- dataherald/services/engine/dataherald/tests/sql_generator/test_generator.py: https://github.com/Dataherald/dataherald/blob/main/services/...