Hyper-Relational Knowledge Graph Pipeline

About 1007 wordsAbout 3 min

2026-04-21

1. Overview

The Hyper-Relational Knowledge Graph Pipeline targets scenarios where natural language text is used to extract hyper-relational knowledge graph triples, filter attribute-specific subgraphs, and then generate and evaluate QA pairs from them. Compared with traditional triples, this pipeline can handle more complex graph structures, such as hyper-relational triples augmented with qualifiers like time and location, and it uses that richer structure to automatically build high-quality KG-QA datasets.

This pipeline is suitable for the following tasks:

Extracting hyper-relational triples from complex text
Retrieving and filtering subgraphs by specific attributes, such as <Time>
Automatically generating multi-hop or otherwise complex QA pairs from hyper-relational knowledge graphs
Automatically evaluating QA naturalness with a large language model

The main stages of the pipeline include:

Hyper-Relational Triple Extraction: Extract a set of hyper-relational triples with qualifier attributes from plain text.
Attribute-Based Subgraph Filtering: Filter generated triples according to a specified attribute tag, such as <Time>, and retain relevant subgraphs.
Subgraph QA Generation: Generate QA pairs of a specified type and quantity from the filtered subgraphs.
QA Naturalness Evaluation: Score the generated QA pairs for naturalness and fluency.

2. Quick Start

Step 1: Create a new DataFlow working directory

mkdir run_dataflow_kg
cd run_dataflow_kg

Step 2: Initialize the pipeline code and default data

dfkg init

After initialization, the following files will be generated:

Pipeline script: api_pipelines/hyper_kg_qa_pipeline.py
Default data: example/hyper_kg_qa_pipeline_input.json

Step 3: Configure the API key

This pipeline uses gpt-4o by default. You need to configure your OpenAI API key. The exact environment variable name depends on the underlying implementation of APILLMServing_request, and it is commonly DF_API_KEY or a similar setting:

export DF_API_KEY=sk-xxxx

Step 4: Run the pipeline

python api_pipelines/hyper_kg_qa_pipeline.py

3. Data Flow and Pipeline Logic

3.1 Input Data

The required data format for this pipeline should be stored in hyper_kg_qa_pipeline_input.json, and at minimum it should contain the following field:

text: A list of original natural language texts.

An example input is shown below:

[
  {
    "text": "In March 2022, Alice joined Company A in Beijing. In July 2023, Alice led Project Orion for Company A in Shanghai. Bob collaborated with Alice on Project Orion in Shanghai during 2023."
  },
  {
    "text": "In April 2021, City X hosted the Spring Marathon in Riverside Park. In April 2022, City X hosted the Spring Marathon again in Riverside Park. Club Y won multiple titles at the 2022 event."
  },
  {
    "text": "In January 2020, Hospital Z launched a vaccination program in Shenzhen. In June 2021, Hospital Z expanded the program to Guangzhou. Doctors from Hospital Z reported higher participation in 2021."
  }
]

3.2 Hyper-Relational Knowledge Graph QA Pipeline Logic (`HyperKGQA_APIPipeline`)

Step 1: Hyper-Relational Triple Extraction (`HRKGTripleExtraction`)

Functionality:

Use an LLM to extract hyper-relational triples from text, including subject, object, relation, and qualifier attributes such as time and location.

Input: text
Output: tuple

Operator Run:

self.hyper_triple_extraction_step1.run(
    storage=self.storage.step(),
    input_key="text",
    output_key="tuple",
)

Step 2: Attribute-Based Subgraph Filtering (`HRKGRelationTripleAttributeFilter`)

Functionality:

Filter out subgraph data that contains a specific attribute qualifier, such as <Time>, to narrow the scope for later QA generation.

Input: tuple
Output: subgraph
Parameter: attr_tag="<Time>"

Step 3: Subgraph QA Generation (`HRKGRelationTripleSubgraphQAGeneration`)

Functionality:

Use the filtered subgraphs with the target attribute to generate numeric or other specific types of QA pairs, such as qa_type="num".
Control the number of generated pairs, such as num_q=5.

Input: subgraph
Output: QA_pairs

Step 4: QA Naturalness Evaluation (`KGQANaturalEvaluator`)

Functionality:

Use an LLM to evaluate the generated QA pairs for naturalness and language quality, usually producing scores or evaluation results.

Input: QA_pairs
Output: naturalness_scores

3.3 Output Data

After the pipeline finishes, DataFlow storage under the default path ./cache_local will contain the following output fields:

text: Original data
tuple: Extracted hyper-relational triples
subgraph: Graph structures after removing entries that do not contain the target attribute
QA_pairs: Generated QA pairs, including questions and answers
naturalness_scores: Naturalness evaluation scores for the QA pairs

Example output conceptually looks like the following. The actual structure depends on the implementation details of the underlying operators:

{
  "text":"In January 2020, Hospital Z launched a vaccination program in Shenzhen. In June 2021, Hospital Z expanded the program to Guangzhou. Doctors from Hospital Z reported higher participation in 2021.",
  "tuple":[
    "<subj> Hospital Z <obj> Vaccination Program <rel> Launched <Time> January 2020 <Location> Shenzhen",
    "<subj> Hospital Z <obj> Vaccination Program <rel> Expanded <Time> June 2021 <Location> Guangzhou",
    "<subj> Doctors from Hospital Z <obj> Participation <rel> Reported <Time> 2021 <Degree> Higher"
  ],
  "subgraph":[
    "<subj> Hospital Z <obj> Vaccination Program <rel> Launched <Time> January 2020 <Location> Shenzhen",
    "<subj> Hospital Z <obj> Vaccination Program <rel> Expanded <Time> June 2021 <Location> Guangzhou",
    "<subj> Doctors from Hospital Z <obj> Participation <rel> Reported <Time> 2021 <Degree> Higher"
  ],
  "QA_pairs":[
    {
      "question":"In which locations did Hospital Z's vaccination program take place over time?",
      "answer":"Shenzhen, Guangzhou"
    },
    {
      "question":"What are the times associated with the launch and expansion of Hospital Z's vaccination program?",
      "answer":"January 2020, June 2021"
    },
    {
      "question":"What changes related to vaccination programs at Hospital Z were reported in 2021?",
      "answer":"Vaccination Program Expanded, Doctors Participation Higher"
    }
  ],
  "naturalness_scores":[
    1,
    0.5,
    0.5
  ]
}

4. Pipeline Example

The complete code structure of hyper_kg_qa_pipeline.py is shown below:

from dataflow.serving import APILLMServing_request
from dataflow.utils.storage import FileStorage

from dataflow.operators.general_kg.eval.kg_qa_natural_eval import KGQANaturalEvaluator
from dataflow.operators.hyper_relation_kg import HRKGRelationTripleAttributeFilter
from dataflow.operators.hyper_relation_kg import (
    HRKGTripleExtraction,
)
from dataflow.operators.hyper_relation_kg import (
    HRKGRelationTripleSubgraphQAGeneration,
)


class HyperKGQA_APIPipeline:
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="../example_data/hyper_kg_qa_pipeline_input.json",
            cache_path="./cache_local",
            file_name_prefix="hyper_kg_qa_pipeline",
            cache_type="json",
        )

        self.llm_serving = APILLMServing_request(
            api_url="https://api.openai.com/v1/chat/completions",
            model_name="gpt-4o",
            max_workers=30,
        )

        self.hyper_triple_extraction_step1 = HRKGTripleExtraction(
            llm_serving=self.llm_serving,
            lang="en",
        )

        self.subgraph_filter_step2 = HRKGRelationTripleAttributeFilter(
            lang="en",
        )

        self.subgraph_qa_generation_step3 = HRKGRelationTripleSubgraphQAGeneration(
            llm_serving=self.llm_serving,
            lang="en",
            qa_type="set",
            num_q=3,
        )

        self.qa_natural_eval_step4 = KGQANaturalEvaluator(
            llm_serving=self.llm_serving,
            lang="en",
        )

    def forward(self):
        self.hyper_triple_extraction_step1.run(
            storage=self.storage.step(),
            input_key="text",
            output_key="tuple",
        )

        self.subgraph_filter_step2.run(
            storage=self.storage.step(),
            input_key="tuple",
            output_key="subgraph",
            attr_tag="<Time>",
        )

        self.subgraph_qa_generation_step3.run(
            storage=self.storage.step(),
            input_key="subgraph",
            output_key="QA_pairs",
        )

        self.qa_natural_eval_step4.run(
            storage=self.storage.step(),
            input_key="QA_pairs",
            output_key="naturalness_scores",
        )


if __name__ == "__main__":
    model = HyperKGQA_APIPipeline()
    model.forward()