annotations_creators | language | language_creators | license | multilinguality | pretty_name | size_categories | source_datasets | tags | task_categories | task_ids | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
KQA-Pro |
|
|
|
|
|
- Homepage: http://thukeg.gitee.io/kqa-pro/
- Repository: https://github.com/shijx12/KQAPro_Baselines
- Paper: KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base
- Leaderboard: http://thukeg.gitee.io/kqa-pro/leaderboard.html
- Point of Contact: shijx12 at gmail dot com
KQA Pro is a large-scale dataset of complex question answering over knowledge base. The questions are very diverse and challenging, requiring multiple reasoning capabilities including compositional reasoning, multi-hop reasoning, quantitative comparison, set operations, and etc. Strong supervisions of SPARQL and program are provided for each question.
It supports knowlege graph based question answering. Specifically, it provides SPARQL and program for each question.
English
train.json/val.json
[
{
'question': str,
'sparql': str, # executable in our virtuoso engine
'program':
[
{
'function': str, # function name
'dependencies': [int], # functional inputs, representing indices of the preceding functions
'inputs': [str], # textual inputs
}
],
'choices': [str], # 10 answer choices
'answer': str, # golden answer
}
]
test.json
[
{
'question': str,
'choices': [str], # 10 answer choices
}
]
This dataset has two configs: train_val
and test
because they have different available fields. Please specify this like load_dataset('drt/kqa_pro', 'train_val')
.
train, val, test
You can find the knowledge graph file kb.json
in the original github repository. It comes with the format:
{
'concepts':
{
'<id>':
{
'name': str,
'instanceOf': ['<id>', '<id>'], # ids of parent concept
}
},
'entities': # excluding concepts
{
'<id>':
{
'name': str,
'instanceOf': ['<id>', '<id>'], # ids of parent concept
'attributes':
[
{
'key': str, # attribute key
'value': # attribute value
{
'type': 'string'/'quantity'/'date'/'year',
'value': float/int/str, # float or int for quantity, int for year, 'yyyy/mm/dd' for date
'unit': str, # for quantity
},
'qualifiers':
{
'<qk>': # qualifier key, one key may have multiple corresponding qualifier values
[
{
'type': 'string'/'quantity'/'date'/'year',
'value': float/int/str,
'unit': str,
}, # the format of qualifier value is similar to attribute value
]
}
},
]
'relations':
[
{
'predicate': str,
'object': '<id>', # NOTE: it may be a concept id
'direction': 'forward'/'backward',
'qualifiers':
{
'<qk>': # qualifier key, one key may have multiple corresponding qualifier values
[
{
'type': 'string'/'quantity'/'date'/'year',
'value': float/int/str,
'unit': str,
}, # the format of qualifier value is similar to attribute value
]
}
},
]
}
}
}
We implement multiple baselines in our codebase, which includes a supervised SPARQL parser and program parser.
In the SPARQL parser, we implement a query engine based on Virtuoso. You can install the engine based on our instructions, and then feed your predicted SPARQL to get the answer.
In the program parser, we implement a rule-based program executor, which receives a predicted program and returns the answer. Detailed introductions of our functions can be found in our paper.
You need to predict answers for all questions of test set and write them in a text file in order, one per line. Here is an example:
Tron: Legacy
Palm Beach County
1937-03-01
The Queen
...
Then you need to send the prediction file to us by email [email protected], we will reply to you with the performance as soon as possible. To appear in the learderboard, you need to also provide following information:
- model name
- affiliation
- open-ended or multiple-choice
- whether use the supervision of SPARQL in your model or not
- whether use the supervision of program in your model or not
- single model or ensemble model
- (optional) paper link
- (optional) code link
MIT License
If you find our dataset is helpful in your work, please cite us by
@inproceedings{KQAPro,
title={{KQA P}ro: A Large Diagnostic Dataset for Complex Question Answering over Knowledge Base},
author={Cao, Shulin and Shi, Jiaxin and Pan, Liangming and Nie, Lunyiu and Xiang, Yutong and Hou, Lei and Li, Juanzi and He, Bin and Zhang, Hanwang},
booktitle={ACL'22},
year={2022}
}
Thanks to @happen2me for adding this dataset.