Quickstart
About 562 wordsAbout 2 min
2026-04-02
DataFlow-KG adopts a usage pattern of code generation + custom modification + running scripts. That is, by calling via the command line, it automatically generates default execution scripts and entry Python files. After user-customized modifications (such as changing datasets, using different LLM APIs, rearranging operators, or adding your own personalized pipeline), running this Python file will execute the corresponding functions.
You only need three steps to run our provided Pipeline.
1. Initialize the Project
Run the following in an empty directory:
dfkg initThis will generate the following folders in your current working directory:
$ tree -L 1
.
|-- api_pipelines
|-- core_text
|-- cpu_pipelines
|-- example_data
|-- gpu_pipelines
|-- playground
`-- simple_text_pipelinesDirectory Purposes:
cpu_pipelines: Pipelines that only use CPUcore_text: Examples of the most basic operators in DataFlow-KGapi_pipelines: Uses online LLM APIs (Recommended for beginners)gpu_pipelines: Uses local GPU modelsexample_data: Default input data for all example Pipelinesplayground: Lightweight examples, not constituting a complete Pipelinesimple_text_pipelines: Examples related to basic text processing
2. Pipeline Classification (Choose one)
Pipelines with the same name in different directories have an inclusive relationship:
| Directory | Dependent Resources |
|---|---|
cpu_pipelines | CPU only |
api_pipelines | CPU + LLM API |
gpu_pipelines | CPU + API + Local GPU |
Note: For beginners, it is highly recommended to start directly from
api_pipelines! Later, if you have a GPU, you only need to changeLLMServingto a local model.
3. Run Your First Pipeline
Enter any Pipeline directory, for example:
cd api_pipelinesOpen the Python file within it. You typically only need to focus on two configurations:
(1) Input Data Path
self.storage = FileStorage(
first_entry_file_name="<path_to_dataset>"
)This defaults to the sample dataset we provide and can be run directly. You can change it to your own dataset path to infer your own data.
(2) LLM Serving
If you are using an API, you first need to set the environment variable:
Linux / macOS
export DF_API_KEY=sk-xxxxxWindows CMD
set DF_API_KEY=sk-xxxxxPowerShell
$env:DF_API_KEY="sk-xxxxx"Then simply run the script directly:
python xxx_pipeline.py4. Multi-API Serving (Optional)
If you need to use multiple LLM APIs simultaneously, you can specify different environment variable names for each Serving:
llm_serving_openai = APILLMServing_request(
api_url="[https://api.openai.com/v1/chat/completions](https://api.openai.com/v1/chat/completions)",
key_name_of_api_key="OPENAI_API_KEY",
model_name="gpt-4o"
)
llm_serving_deepseek = APILLMServing_request(
api_url="[https://api.deepseek.com/v1/chat/completions](https://api.deepseek.com/v1/chat/completions)",
key_name_of_api_key="DEEPSEEK_API_KEY",
model_name="deepseek-chat"
)Then, simply export the corresponding field names in your environment variables (e.g., export OPENAI_API_KEY=sk-xxxxx) to achieve Multi-API Serving coexistence.
5. Add Custom Pipeline Templates (Optional)
If you wish to solidify your own Pipeline into a standard template that generates automatically every time you execute the initialization operation, you can do so by modifying the static folder in the source code:
- Enter the template directory: In the DataFlow-KG source code repository, find and enter the
dataflow/statics/pipelines/api_pipelinesfolder. - Insert custom script: Place your own
your_custom_pipeline.pyfile into this directory. - Update local installation: Return to the root directory of the DataFlow-KG repository and execute the following command to make the modifications take effect in your local environment:
pip install -e . - Re-initialize: Run
dfkg initagain in the new directory where you want to work, and your custom Pipeline script will be automatically generated in the corresponding initialization folder, just like a pre-packaged meal.

