SPARQL is a highly powerful query language for an ever-growing number of Linked Data resources and Knowledge Graphs. Using it requires a certain familiarity with the entities in the domain to be queried as well as expertise in the language's syntax and semantics, none of which average human web users can be assumed to possess. To overcome this limitation, automatically translating natural language questions to SPARQL queries has been a vibrant field of research. However, to this date, the vast success of deep learning methods has not yet been fully propagated to this research problem.
This paper contributes to filling this gap by evaluating the utilization of eight different Neural Machine Translation (NMT) models for the task of translating from natural language to the structured query language SPARQL. While highlighting the importance of high-quantity and high-quality datasets, the results show a dominance of a CNN-based architecture with a BLEU score of up to 98 and accuracy of up to 94%.
Title: Neural Machine Translating from Natural Language to SPARQL
Authors: Dr. Dagmar Gromann, Prof. Sebastian Rudolph and Xiaoyu Yin
PDF is available
@article{DBLP:journals/corr/abs-1906-09302,
author = {Xiaoyu Yin and
Dagmar Gromann and
Sebastian Rudolph},
title = {Neural Machine Translating from Natural Language to {SPARQL}},
journal = {CoRR},
volume = {abs/1906.09302},
year = {2019},
url = {http://arxiv.org/abs/1906.09302},
archivePrefix = {arXiv},
eprint = {1906.09302},
timestamp = {Thu, 27 Jun 2019 18:54:51 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1906-09302.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Title: Translating Natural language To SPARQL
Author: Xiaoyu Yin
Supervisor: Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger
The thesis is already finished. (8th January 2019) and has been turned into a paper (link above).
Find the thesis in thesis folder, and defense slides in presentation folder, both available in
.tex
and
The files ended with *.en
(e.g. dev.en
, train.en
, test.en
) contain English sentences, *.sparql
files contain SPARQL queries. The ones with the same prefix name have 1-1 mapping that was used in the training as a English-SPARQL pair. vocab.*
or dict.
are vocabulary files. fairseq has its own special requirement of input files, therefore aforementioned files were not used directly by it but processed into binary formats stored in /fairseq-data-bin
folder of each dataset.
The datasets used in this paper were originally downloaded from Internet. I downloaded them and have split them into the way I needed to train the models. The sources are listed as follows:
- Neural SPARQL Machines Monument dataset
- LC-QUAD (v2.0 is released! but we used 1.0)
- DBpedia Neural Question Answering (DBNQA) dataset
see in paper
We kept the inference translations of each model and dataset which was used to generate BLEU scores, accuracy, and corresponding graphs in below sections. The results are saved in the format of dev_output.txt
(validation set) & test_output.txt
(test set) version and available here (compat version).
Full version containing raw output of frameworks is also available
Plots of training perplexity for each models and datasets are available in a separate PDF here.
Table of BLEU scores for all models and validation and test sets
Table of Accuracy (in %) of syntactically correct generated SPARQL queries | F1 score
Please find more results and detailed explanations in the research paper and the thesis.
Because some models were so space-consuming (esp. GNMT4, GNMT8) after training for some sepecific datasets (esp. DBNQA), I didn't download all the models from the HPC server. This is an overview of the availablity of the trained models on my drive:
. | Monument | Monument80 | Monument50 | LC-QUAD | DBNQA |
---|---|---|---|---|---|
NSpM | yes | yes | yes | yes | yes |
NSpM+Att1 | yes | yes | yes | yes | yes |
NSpM+Att2 | yes | yes | yes | yes | yes |
GNMT4 | no | yes | no | no | no |
GNMT8 | no | no | no | no | no |
LSTM_Luong | yes | yes | yes | yes | no |
ConvS2S | yes | yes | yes | yes | no |
Transformer | yes | yes | yes | yes | no |
This paper and thesis couldn't have been completed without the help of my supervisors (Dr. Dagmar Gromann, Dr. Dmitrij Schlesinger and Prof. Sebastian Rudolph) and those great open source projects. I send my sincere appreciation to all of the people who have been working on this subject, and hopefully we will show the world its value in the near future.
By the way, I work as an Android developer now, although I still have passion with AI and may want to learn more and probably even find a career in it in the future, currently my focus is on Software Engineering. I enjoy any kind of experience or knowledge sharing and would like to have new friends! Connect with me on LinkedIn.