source: https://arxiv.org/html/2412.18702v1 by Yanlin Fengα Simone Papicchio, Sajjadur Rahman, Megagon Labs, Politecnico di Torino
- Abstract
- 1 Introduction
- 2 Knowledge Graph Modeling in the LLM Era
- 3 Transforming RDF to Property Graphs
- 4 Constructing Questions
- 5 Evaluation Metrics
- 6 Experiments
- 7 Related Work
- 8 Conclusion
- Appendix A Additional Technical Details
- Appendix B Additional CypherBench Statistics
Retrieval from Graph Data
- Crucial for augmenting large language models (LLM) with:
- Open-domain knowledge
- Private enterprise data
- Key component in the recent GraphRAG system
Challenges with Modern RDF Knowledge Graphs:
- Overly large schemas exceed typical LLM context window
- Use of resource identifiers
- Overlapping relation types
- Lack of normalization
Proposed Solution:
- Property graph views on top of underlying RDF graph
- Can be efficiently queried by LLMs using Cypher
Implementation:
- Instantiated on Wikidata
- Introduced CypherBench:
- First benchmark with 11 large-scale, multi-domain property graphs
- 7.8 million entities and over 10,000 questions
Implementation Challenges:
- Developing an RDF-to-property graph conversion engine
- Creating a systematic pipeline for text-to-Cypher task generation
- Designing new evaluation metrics
Graphs for Storing Entity-Relation Data:
- Widely used in large-scale encyclopedic knowledge and domain-specific enterprise data storage
- Enable efficient processing of complex multi-hop aggregation queries
- Provide more compact representation of knowledge compared to raw textual documents
- Advantages have motivated research on knowledge graphs and KBQA
Challenges in Retrieval from Modern RDF Knowledge Graphs:
- Difficult for LLMs unlike success with relational databases
- Previous studies focused on simplified settings, limiting practical application
- Leading LLM frameworks prioritize retrieval from property graphs instead
Root Cause and Proposed Solution:
- RDF knowledge graphs have complex schemas and relationships
- Transforming them into multiple smaller property graphs simplifies retrieval
- Each graph functions as a domain-specific view, enabling efficient querying by LLMs
- Unified query language for both RDF graphs and property graph databases
CypherBench:
- Collection of 11 property graphs transformed from Wikidata
- Contains complete entities and relations from Wikidata that conform to a domain-specific schema
- Includes over 10,000 natural language questions for evaluation
- Global queries, multi-hop queries, temporal queries, and aggregation queries covered
- Significant challenges in benchmarking with current LLMs.
Main Contributions:
- Novel methodology to enable efficient text-to-Cypher retrieval over modern RDF knowledge graphs
- Collection of 11 large-scale property graphs for future research and knowledge source deployment
- RDF-to-property-graph transformation engine for Wikidata
- Text-to-Cypher / KBQA benchmark with over 10,000 instances covering various query types
- Automatic text-to-Cypher task generation pipeline
- Set of related tools including graph deployment Docker, evaluation scripts, and graph visualization tools.
Knowledge Graphs: RDF vs Property Graph
Definition of Knowledge Graph:
- List of relations (triples): subject entity, relation type, object entity
- Example: ("LeBron James", playsFor, "LA Lakers")
- Relation types are also called properties or predicates
RDF:
- Entities stored using Internationalized Resource Identifiers (IRIs)
- Entity properties stored as relations: subject = entity IRI, object = literal value
- Reification: create copy of relation as special entity linked to relation property using additional relation
- Popular knowledge graphs based on RDF: Wikidata, Freebase, DBpedia, queried using SPARQL
Property Graph:
- Entities and relations treated as objects
- Each object can be assigned types and have associated properties
- Access entities directly using names
- Property graph databases popular in industry: Neo4j
- No reification process used.
Retrieval over Modern Encyclopedic Knowledge Graphs:
Challenges:
- Overly large schemas: Modern knowledge graphs cover entities and relations across all domains, resulting in a massive schema that exceeds context window sizes of LLMs.
- Wikidata: Over 4 million entity types, 12,000 relation types
- RDF graphs allow arbitrary entities for subjects/objects
- Use of resource identifiers: SPARQL queries require identifiers obtained via external linkers.
- Makes queries less readable
- Semantically overlapping relation types: Wikidata contains multiple relation types with similar meanings, causing confusion during retrieval.
- Example: Six relation types for starting times of an entity
- Lack of normalization: RDF does not enforce type constraints or standardized units on values.
- Resulting in incorrect types and varied units, leading to incorrect results during aggregation.
Challenges in Knowledge Base Question Answering (KBQA)
- Graph retrieval required: answering questions using graphs instead of traditional databases or text corpora
- Simplified settings used: many studies focus on simplified settings, assuming entity and relation identifiers are provided
- Reduces task to retrieval over a small local subgraph
- Custom-designed intermediate logical forms: lack support for certain graph querying functionalities
- Examples: relation properties querying, grouping, variable-length path matching
- Limited capabilities of existing KBQA approaches
- Struggle with queries involving:
- Relation properties (time-sensitive queries)
- Global queries without named entities
- Complex aggregations over a large number of entities
- Struggle with queries involving:
Transforming RDF Graph into Property Graphs
- Proposed solution: multiple domain-specific property graphs and Cypher for querying (Figure 2)
- Choice of Cypher: widespread adoption, efficient querying, and compatibility with LLM frameworks
- Property graphs serve as views on top of original RDF graph
- Each property graph contains data conforming to its schema
- Views can be updated when underlying RDF data changes
- Scalability without complex schemas or ambiguous relation types
- Transformation layer manages datatype conversion and unit standardization
- Efficient process: few seconds for small graphs (fewer than 10,000 entities)
Wikidata Application
- Demonstration of idea on Wikidata, the largest knowledge graph today
- Direct prompting baseline using gpt-4o achieves reasonable performance
- No external linkers or retrievers used.
CypherBench Construction Process (Figure 2)
- RDF data transformed into schema-enforced property graphs
- Efficient and accurate text-to-Cypher querying enabled
- Property graphs used to generate text-to-Cypher tasks.
We transform RDF data into property graphs as the first step in building CypherBench. We chose Wikidata because it's the largest and most updated knowledge graph, containing 114 million entities and receiving 270 million edits from over 42,000 editors each month.
Domain and Property Graph Schema Curation
Selecting a Domain:
- Identify entity and relation types and their properties relevant to the domain
- Explore Wikidata to find corresponding QIDs (entity types) and PIDs (relation types and properties)
Schema Development:
- Each property is assigned a datatype
- Properties representing quantities are given a unit, indicated in the property label (e.g., runtime_minute)
- Time investment: approximately 4 hours per graph on average
Example Schema:
- Figure 3
- Complete schema available in Appendix B
RDF-to-Property-Graph Engine Functions:
- Issues SPARQL queries to Wikidata using curated schema identifiers
- Fetches all entities and relations conforming to the schema:
- Filmmaker (Q11424) instances for award received (P166) relations
- Award (Q618779) instances as objects of these relations
- Converts fetched data into receivesAward relations in target property graph
Engine Functionalities:
- Datatype conversion: enforces type constraints, converts values to str, int, float, date, list[str]
- Date conversion: retrieves precision using wikibase:timePrecision and keeps only fine-grained dates
- Unit standardization: enforces standardized units (e.g., centimeters) on property values representing quantities
- Rank filtering: fetches preferred relations for time-sensitive properties, normal or deprecated ones with additional properties
- Selective entity fetching: limits fetching to entities linked to certain relations to avoid out-of-memory issues
Execution:
- SPARQL queries executed against local Wikidata endpoint
- Results aggregated into final property graph
- Transformation time ranges from seconds to hours, depending on graph size
Format and Deployment:
- Property graph stored in DBMS-independent JSON format
- Custom Neo4j Docker image initialized with data from JSON file upon startup.
Text-to-Cypher Task Generation Pipeline
Step 1: Generating Initial (question, Cypher) pairs:
- Input: graph schema and natural language question
- Output: executable Cypher query that returns the desired answer
Patterns:
- Basic MATCH pattern: queries for entities of a particular type
- Special MATCH pattern: queries for relationships between entities
- RETURN template: specifies the answer entity or entities to be returned
Nodes:
- Square nodes: all entities of a particular type
- Circular nodes: named entities
- Optional edges and nodes: denoted by dashed lines
Step 2: Rewriting Questions:
- Use a LLM to make questions sound more natural
Patterns:
A Cypher query starts with a MATCH clause, identifying matching subgraphs. Following this, other clauses (e.g. WHERE, WITH, ORDER BY, RETURN) perform transformations to generate the result.
Graph Retrieval: Locating Relevant Subgraphs
- Core task: locate subgraph relevant to query
- Balanced distribution of graph matching patterns
- Template-based generation approach instead of crowd-sourcing
- Graph patterns categorized by isomorphism structure
- Seven basic graph patterns: single answer node, up to two edges
- Five special graph patterns: comparison, grouping, optional matching, time-sensitive queries, union
Comparison with Existing Benchmarks
- Most KBQA benchmarks overlook global queries (GraphRAG targets)
- Global queries: no specific named entities
- Listing queries like "List the names of all teams"
- Complex queries like "Unique countries of citizenship for individuals in same movie"
- Answers depend on large number of documents and cannot be easily handled by standard RAG approaches.
Text-to-Cypher Question Generation
MATCH Clause Instantiation:
- Create multiple Cypher MATCH clause templates by enumerating edge directions
- Pair each MATCH template with a human-written question template
- Instantiate the MATCH clauses by sampling:
- Entity types for node variables (n, m0)
- Relation types for edge variables (r0)
- Entity names for named nodes (m0)
- Execute a special Cypher query to ensure non-empty results
RETURN Clause Instantiation:
- Pair each instantiated MATCH clause in special categories with dedicated RETURN clause templates
- Pair each RETURN clause in basic categories with one of 6 templates:
- NAME, PROPERTY
- SORT
- WHERE
- AGGREGATE, ARGMAX
- Instantiate the RETURN clauses by sampling properties, ranking orders, comparison operators, and aggregate functions
Design Choices:
- Ensure Cypher always returns literal values instead of node objects
- Allows benchmark to be used for evaluating non-Cypher graph retrieval methods
Data Splits:
- Questions split into training and test sets by domain
- 4 graphs allocated for training, 7 for testing
- Remove queries producing more than 10^5 records or taking more than 30 seconds to execute
- Demonstrate diverse and balanced distribution of patterns, templates, domains, and answer lengths
Detecting Semantically Unrealistic Questions:
- Blind sampling can result in semantically unrealistic questions
- Address this systematically by modeling:
- Cardinality: one-to-one, one-to-many, many-to-one, many-to-many
- Participation: total or partial (always/not always associated with the relationship)
- Entailment: a relationship implying another
We use LLMs to rewrite template-generated questions into natural-sounding ones, while preserving entity names and values. This design choice allows us to evaluate LLMs directly without external linkers or retrievers. However, we observe that LLMs sometimes alter question meanings during rewriting. To address this, we implement three rounds of verification and revision using LLMs, followed by author inspection.
Text-to-SQL Evaluation Metric: Execution Accuracy
Definition:
- Measures if results returned by predicted query match those of ground truth query
- Adapted from text-to-SQL literature for Cypher
- Borrows implementation from Spider leaderboard
Execution Results Comparison:
- Identical tables: one can be transformed into the other through row and column permutations
- Applicable to both Cypher and SQL queries
- Objects in query results are serialized for comparison
Calculation:
- Compare execution results of ground truth (V) and predicted (q̂) Cypher queries
- Obtain final metric by averaging across all instances: EX(q,q̂)=1_V=V̂(V,V̂)
Implementation:
- Adapted from Spider leaderboard at https://github.com/taoyds/test-suite-sql-eval
Notes:
- Cypher supports objects (lists and maps) in query results
- Execution accuracy can be applied to non-Cypher graph retrieval approaches as long as they return results in tabular format.
Graph Retrieval and Provenance Subgraph Jaccard Similarity (PSJS)
Core Task of Graph Retrieval:
- Locate relevant subgraph using MATCH clause
- LLM may generate incorrect or partially correct MATCH clauses
- Resulting errors lead to zero execution accuracy
Proposed Measure for Subgraph Matching Performance: Provenance Subgraph Jaccard Similarity (PSJS)
- Definition:
- PSJS calculated as Jaccard similarity between provenance subgraphs of ground truth and predicted Cypher
- Provenance Subgraph:
- Obtained by pairing MATCH clause with RETURN *
- Includes entities/nodes matched by the MATCH clause
- Example:
- For "Q18. What is the average longest lifespan of taxa that feed on Leporidae?":
- Provenance subgraph includes Leporidae and all taxa feeding on it
- For "Q18. What is the average longest lifespan of taxa that feed on Leporidae?":
- PSJS Calculation:
- G ∩ Ĝ / (G ∪ Ĝ)
- G and Ĝ are provenance subgraphs of ground-truth and predicted Cypher, respectively
- Impact of Incorrect MATCH Clauses:
- Receives zero execution accuracy but non-zero PSJS score if some entities/nodes are correctly retrieved.
Evaluation of Zero-Shot Text-to-Cypher Performance
- State-of-the-art LLMs tested on CypherBench: gpt-4o, gpt-4o-mini, gemini1.5, claude3.5-sonnet, and yi-large
- No use of training set for this study
- Models prompted with question, graph schema, and instruction (Appendix A.3)
- Cost per run: gpt-4o - 5.5, gpt-4o-mini - 0.3
- Execution on Neo4j using parallelization and timeout for metrics calculation
Performance Metrics:
- Zero-shot execution accuracy (EX): not provided in table
- Provenance subgraph jaccard similarity (PSJS)
- Executable percentage (Exec.)
Models Used:
- gpt-4o and gpt-4o-mini: OpenAI's API
- gemini1.5 and claude3.5-sonnet: Google Cloud Vertex AI
- yi-large: Fireworks AI API
CypherBench Evaluation Results
- claude3.5-sonnet: Execution accuracy of 61.58%, PSJS of 80.85%
- gpt-4o: Slightly worse than claude3.5-sonnet
- Best performing open-source model: 41.87% execution accuracy
- Models with fewer than 10B parameters: Less than 20% execution accuracy
- Difficulty of CypherBench: Highlights the challenges in graph matching, not just formatting errors
- Differentiating LLM capabilities: Smaller models within the same family perform significantly worse (e.g., gpt-, llama-, gemini-series)
- Effectiveness of CypherBench benchmarking: Figure 5 illustrates performance across basic and special MATCH patterns, RETURN templates, and domains.
Performance Analysis of LLMs: gpt-4o, claude3.5-sonnet, qwen2.5-72b, llama3.1-70b
Focus Models:
- gpt-4o
- claude3.5-sonnet
- qwen2.5-72b
- llama3.1-70b
Evaluation Metrics:
- Execution accuracy (EX)
- Path-based similarity judgment score (PSJS)
Performance Trends:
- All models exhibit near-perfect EX and low PSJS on basic categories
- Gradual decline in performance as graph patterns include more relations (PSJS)
Error Analysis:
- Most errors for basic category result from incorrect deduplication of distinct entities
- gpt-4o struggles with time-sensitive questions
- claude3.5-sonnet performs poorly on comparison questions
The top right chart shows execution accuracy across different RETURN templates. Models have varying weaknesses depending on the template. Claude3.5-sonnet achieves near-zero accuracy on SORT questions because it includes unnecessary columns in its output, even though only entity names are requested.
The bottom right chart in Figure 5 shows execution accuracy across domains. Four models exhibit similar trends, with flight_accident and nba being easiest, while performing similarly in other domains.
Figure 6 shows error distributions made by gpt-4o and llama3.1-8b on 50 incorrect predictions.
Error Analysis of LLMs:
- We examine gpt-4o and llama3.1-8b models to understand their error behavior
- Randomly sample 50 incorrect predictions from each model for annotation (Appendix A.5)
- Define error categories during the annotation process (Table 4)
Error Categories:
- Reversed Direction: Model reverses direction of a relation
- Entity Linking: Entity name does not correspond to intended entity in database
- Pattern Not Aligned with Question: MATCH pattern conforms to schema but doesn't align with question's intent
- Schema Violation (llama3.1-8b): Superior schema following capabilities of gpt-4o are inferior in llama3.1-8b
Example Predictions: Table 4 shows sample predictions from both models with annotated error categories.
Error Distribution: Figure 6 illustrates the distribution of error categories across the models.
CypherBench: Graph Retrieval Methods Related to Knowledge Base Question Answering (KBQA)
Background:
- CypherBench serves as benchmark for evaluating KBQA and graph retrieval methods
Types of Methods:
- Approximate Retrieval Methods: Identify top relevant elements based on relevance to the question
- Retrieve k-hop neighborhood of mentioned entities
- Verbalize entities or relations into text, use embedding-based methods
- Limitations:
- Inability to handle large number of entities (global queries or complex aggregation queries)
- Reliance on expensive in-memory operations or need to embed entire graph
Precise Retrieval Methods:
- Description:
- Translate question into formal language query that fetches exactly what it asks for
- Custom Logical Forms:
- Transpiled into actual database queries or executed by custom engine
- Limitations:
- Lack support for certain graph querying features like grouping and variable-length path matching
- Inability to handle large graphs due to scalability issues and limitations compared to standard database query languages
- Recent Works:
- Use LLMs to generate graph database queries (e.g., SPARQL or Cypher)
- Assumption of identified entities in the question
- Limitations:
- Simplifications such as assuming identifiers are provided
- Working with smaller graphs only
Our Work:
- Handles realistic scenario of graph retrieval over full-scale knowledge graphs using only the question as input.
Text-to-Query Benchmarks Comparison
CypherBench vs. Existing Text-to-Query Benchmarks:
- CypherBench: text-to-query benchmark with databases and (question, database query) pairs
- KBQA benchmarks: represent specific type of text-to-query benchmarks where the databases are knowledge graphs, and queries are graph database queries
- Comparison in Table 6:
- Schema Size and Data Size: insights into the complexity of databases in existing text-to-query benchmarks
- Most existing KBQA benchmarks are based on text-to-SPARQL over RDF knowledge graphs, but the massive schema poses challenges for LLMs in zero-shot settings
- CypherBench has a comparable schema size to text-to-SQL benchmarks, while still encompassing up to 7 million entities
Text-to-SQL Benchmarks:
- Spider: based on Wikipedia and other sources, with 200 graphs, schema size of 5.1 tables and 27.6 columns, and data size of 400k rows
- BIRD-SQL: based on Kaggle and other sources, with 95 graphs, schema size of 7.3 tables and 54.2 columns, and data size of 52M rows
Text-to-SPARQL/RDF Benchmarks:
- LC-Quad 2.0: based on Wikidata, with 1 graph, schema size of 12k relation types, and data size of 114M entities
- GrailQA: based on Freebase, with 1 graph, schema size of 37k relation types, and data size of 45M entities
- KQA Pro: based on FB15k-237, with 1 graph, schema size of 0.8k relation types, and data size of 16k entities
Text-to-nGQL/Property Graphs Benchmarks:
- R^3 -NL2GQL: based on OpenKG, with 3 graphs, schema size of 5.3 relation types and 13 properties, and data size of 46k entities
- Fin/Medi-GQL: based on OpenKG, with 2 graphs, schema size of 13 relation types and 38 properties, and data size of 713k entities
Text-to-Cypher/Property Graphs Benchmarks:
- MetaQA-Cypher: based on OMDb, with 1 graph, schema size of 5 relation types and 5 properties, and data size of 43k entities
- SpCQL: based on OwnThink, with 1 graph, schema size of 480k relation types and 1 property, and data size of 16M entities
Microsoft's GraphRAG:
- Introduced to address corpus-level summarization queries (e.g., "What are the main themes in the dataset?")
- Cannot be handled by standard top-k embedding-based retrieval methods
- Leverages a centralized knowledge graph to index textual documents
- Enables handling of queries that rely on a large volume of documents
GraphRAG Stages:
- Knowledge graph construction during indexing time
- Graph retrieval during query time
Comparison with Original GraphRAG:
- Original GraphRAG uses a slightly different graph formalism than typical knowledge graphs
- Nodes are entity communities at various abstraction levels
- Retrieval performed by fetching all communities at a specific level
LlamaIndex and Property Graph Index:
- Leading open-source LLM framework for RAG workflows
- Constructs a Neo4j property graph from textual documents using LLMs during indexing time
- Conducts graph retrieval via text-to-Cypher during query time
Text-to-Cypher Benchmark:
- Provides the first comprehensive benchmark for evaluating graph retrieval, a critical component in GraphRAG.
RDF to Property Graph Transformations
Methods from Semantic Web Community:
- Several studies have explored methods for transforming RDF graphs into property graphs [^57]161616
- Transformations can be computationally expensive for full-size modern RDF graphs like Wikidata.
Two-step Process:
- First, RDF triples are mapped directly to edges in the property graph [^57]
- Second, edges representing entity properties are transformed into node properties
Exception:
- g2glab/g2g adopts an approach similar to this by transforming RDF into property graphs through executing SPARQL queries over RDF.
Lacking Key Functionalities:
- Their method lacks essential functionalities for ensuring output quality described in section 3
Several tools have been developed to extract domain-specific subgraphs of Wikidata or general RDF knowledge graphs to address scalability issues in modern knowledge graphs. Examples include KGTK, WDumper, and WDSub. However, these tools process the entire RDF dump and produce RDF output.
Wikidata has received over 2 billion edits from 42,000 active editors worldwide, making it a comprehensive knowledge source. Our study proposes techniques for integrating Wikidata with Large Language Models (LLMs), offering opportunities in knowledge graphs and graph retrieval research.
The graph matching patterns and RETURN templates are listed in Table 11 and Table 12.
The prompt to rewrite template-generated questions is in Table 9.
The prompt for evaluating LLMs is shown in Table 10 of CypherBench statistics.
Retrieving Graph Schema from Neo4j Database:
- Use custom queries instead of db.schema.visualization and apoc.meta.data for large graphs
- Queries provide complete and accurate schemas deterministically
- Approximately 15 times slower than built-in procedures (apoc.meta.data)
- Retrieve schemas using the following queries:
- Retrieve Nodes:
- Labels, properties, incoming and outgoing relationships
- Retrieve Relationships:
- Types, properties, source and target nodes
- Aggregate and Serialize Results
- Example Query:
- Custom Cypher query for node retrieval (modify according to your requirements)
There is no need to repeat the passage as it's not there. The original message had a link and instructions to refer to Table 13, but since I'm unable to access external links or tables, I can provide you with a summary:
Table 13 provides detailed definitions of each error category in CypherBench, including examples.
CypherBench Statistics
- Table 8: Statistics of the graphs
- Wikipedia refers to the number of English Wikipedia articles linked from the entities (roughly the number of entities with Wikipedia articles)
Question Rewriting Prompt and Text-to-Cypher Prompt
- Table 9: Sample question rewriting prompt
- Table 10: Sample text-to-Cypher prompt used in experiments
Graph Matching Patterns
- Table 11: Sample questions with various graph matching patterns from the benchmark
- Nodes in purple denote the answer entities
- Square nodes ([Square Grey], [Square Color]) denote all entities of a particular type
- Circular nodes ([Circle Grey] [Circle Color]) represent named entities
- Nodes and edges with dashed lines ([dashed lines]) are optional
- Edges with diamond arrowheads ([diamond]) indicate relations with time sensitivity constraints
RETURN Clause Categories and Sample Questions from the Benchmark
- Table 12: RETURN clause categories and sample questions from the benchmark
- RETURN clauses shown here apply to any graph pattern in the basic categories
Text-to-Cypher Error Categories
- Table 13: Definitions and examples for the 10 text-to-Cypher error categories
Schemas of the 11 Graphs in the Benchmark
- Table 14: Schemas of the 11 graphs in the benchmark
- Color of property boxes indicates whether they are entity properties (e.g., ) or relation properties (e.g., )
Schemas of the 11 Graphs in the Benchmark (Continued)
- Table 15: Schemas of the 11 graphs in the benchmark (Continued)