How we hit the top of the BIRD-SQL benchmark: fine-tuning and diverse RAG

April 22, 2024

We're excited to announce the release of our paper describing the inner workings of Dubo-SQL on the arXiv. The paper describes both our v1 model, which hit the top of the BIRD-SQL leaderboard in November 2023, and our v2 model that does even better.

BIRD-SQL is the most realistic text-to-SQL benchmark, with 12,751 real user questions from 95 databases across dozens of industries. The questions are challenging, with human accuracy of 93%.

Dubo-SQL v1 hit #1 on the BIRD-SQL leaderboard, ahead of models from Tencent and Alibaba. We used a simple prompt with a GPT-3.5 Turbo model fine tuned on the training set. For any SQL that resulted in a syntax error, we sent the error messages back to the LLM for correction. The model is token-efficient and fast, with an inference cost 50x cheaper than reported for models with lower accuracy.

Dubo V1 benchmark

Dubo-SQL v2 achieves higher performance on the BIRD-SQL dev set using GPT-4 Turbo with a novel approach to retrieval-augmented generation. We use cosine similarity to match each user question from the dev set with similar questions from the training set, but we force the example questions to be diverse, at the cost of lower similarity to the dev set question.

Dubo V2 benchmark

Check out the paper and our GitHub repo for more details and let us know what you think! The models we share here are greatly simplified versions of the Dubo production code. The production version of Dubo can handle databases and documentation that are orders of magnitude larger than in BIRD and can have full conversations instead of one round of Q&A.