lyq/nl2sql: 文字问题转sql语句。 @ 9a9c2a4c7f46a81ea195f88230cda802672f0086

文字问题转sql语句。

waynesun 9a9c2a4c7f init		5 years ago
img	9a9c2a4c7f init	5 years ago
saved_model	9a9c2a4c7f init	5 years ago
sqlnet	9a9c2a4c7f init	5 years ago
LICENSE	9a9c2a4c7f init	5 years ago
README.md	9a9c2a4c7f init	5 years ago
start_test.sh	9a9c2a4c7f init	5 years ago
start_train.sh	9a9c2a4c7f init	5 years ago
test.py	9a9c2a4c7f init	5 years ago
train.py	9a9c2a4c7f init	5 years ago

Introduction

This baseline method is developed and refined based on SQLNet, which is a baseline model in WikiSQL.

The model decouples the task of generating a whole SQL into several sub-tasks, including select-number, select-column, select-aggregation, condition-number, condition-column and so on.

Simple model structure shows here, implementation details could refer to the origin paper.

The difference between SQLNet and this baseline model is, Select-Number and Where-Relationship sub-tasks are added to adapt this Chinese NL2SQL dataset better.

Dependencies

Python 2.7
torch 1.0.1
tqdm

Start to train

Firstly, download the provided datasets at ~/data_nl2sql/ including train.json, train.tables.json, dev.json, dev.tables.json and char_embedding.

mkdir ~/nl2sql
cd ~/nl2sql/
git clone https://github.com/ZhuiyiTechnology/nl2sql_baseline.git
cp ~/data_nl2sql/* ~/nl2sql/data
sh ./start_train.py 0 128

while the first parameter 0 means gpu number, the second parameter means batch size.

Start to evaluate

To evaluate on dev.json or test.json, make sure trained model is ready, then run

sh ./start_test.py 0 pred_example

while the first parameter 0 means gpu number, the second parameter means the output path of prediction.

Experiment result

We have run experiments several times, achiving avegrage 27.5% logic form accuracy on the dev dataset.

And we found the main challenges of this datasets containing poor condition value prediction, select column and condition column not mentioned in NL question, inconsistent condition relationship representation between NL question and SQL, etc. All these challenges could not be solve by existing baseline and SOTA models.

Correspondingly, this baseline model achieves only 77% accuracy on condition column and 62% accuracy on condition value respectively even on the training set, which require contestants to pay attention to.

Related resources:

https://github.com/salesforce/WikiSQL

https://yale-lily.github.io/spider

Semantic Parsing with Syntax- and Table-Aware SQL Generation

README.md