Skip to content

DLT Integration

dagster-odp provides an enhanced integration with DLT (Data Load Tool) for configuration-driven data ingestion. Using the GitHub API ingestion example from our Chess Data Analysis Tutorial as a reference, this guide explains the components and capabilities of ODP's DLT integration.

Prerequisites

This guide assumes familiarity with DLT's core concepts and operations. If you're new to DLT, we recommend reading How DLT Works before continuing.

Understanding DLT Concepts

Before diving in, let's clarify some terminology:

  1. DLT Source: A Python module that defines how to extract data from an external system (like an API or database)
  2. DLT Resource: A logical subset of data from a source (like a specific API endpoint or database table). NOTE: This is different from Dagster/ODP resources!
  3. DLT Pipeline: Defines how data moves from source to destination, including schema management
  4. DLT Destination: Where the data gets loaded (like BigQuery, DuckDB, or filesystem)

ODP's Enhanced DLT Integration

ODP provides several improvements over Dagster's built-in DLT integration:

  1. Granular Asset Creation:

    While Dagster's integration creates a single asset per pipeline, ODP:

    • Creates individual assets for each DLT object
    • Automatically generates source assets as Dagster external assets
    • Sets up proper dependencies between assets
  2. Configuration-Driven:

    • Pipeline creation through YAML/JSON configuration
    • Automatic secrets management
    • Parameter validation using Pydantic

Implementation Components

The following sections break down how ODP's DLT integration works, using our tutorial's GitHub API pipeline as an example.

Project Structure

A typical ODP project using DLT has the following structure, with clear separation between ODP configuration and DLT implementation:

my_project/
├── odp_config/
│   ├── dagster_config.yaml     # Resource configuration
│   └── workflows/
│       └── dlt_workflow.yaml   # Pipeline definition
└── dlt_project/
    ├── github/                 # DLT source implementation
    │   ├── __init__.py
    │   └── source.py
    ├── .dlt/
    │   └── secrets.toml        # DLT credentials
    └── schemas/
        └── export/
            └── github.schema.yaml  # Generated DLT schema

Schema Generation

ODP uses DLT's schema file to create appropriate Dagster assets. The schema can be generated by adding schema export to a DLT pipeline:

dlt_project/github/pipeline.py
pipeline = dlt.pipeline(
    pipeline_name="github",
    destination="duckdb",
    dataset_name="github_data",
    export_schema_path="schemas/export"  # Enables schema generation
)

Running this pipeline creates a schema file that ODP uses to:

  • Determine what Dagster assets to create
  • Set up proper dependencies between assets
  • Configure asset materializations

Secrets Management

DLT expects secrets in environment variables with a specific format: SECTION__KEY=value

For example, when a secrets.toml file contains:

secrets.toml
[github]
api_token = "ghp_123..."

DLT expects an environment variable:

GITHUB__API_TOKEN=ghp_123...

ODP handles this automatically by:

  1. Reading secrets from .dlt/secrets.toml
  2. Converting them to DLT's expected format
  3. Setting them as environment variables

Resource Configuration

The DLT resource in dagster_config.yaml tells ODP where to find the DLT implementation:

dagster_config.yaml
resources:
  - resource_kind: dlt
    params:
      project_dir: dlt_project  # Root directory containing DLT code

Single DLT Resource

Each Dagster code location can have only one DLT resource. However, multiple DLT pipelines can be organized in subdirectories within the project_dir.

Asset Configuration

DLT assets are defined in workflow files. Here's an example configuration with explanations of each component:

workflow.yaml
assets:
  - asset_key: github/api/pull_requests  # Creates source asset github/api
    task_type: dlt
    description: "Fetch GitHub pull requests"
    group_name: data_ingestion
    params:
      # Source Configuration
      source_module: github.source      # Relative to project_dir
      schema_file_path: schemas/export/github.schema.yaml

      # DLT Source Parameters 
      source_params:                    # Passed to DLT source function
        repo: "dagster-io/dagster"
        since: "{{#date}}{{context.partition_key}}|%Y-%m-%d{{/date}}"

      # DLT Destination Configuration
      destination: duckdb              # Any supported DLT destination
      destination_params:              # Destination-specific configuration
        database: data/github.db

      # DLT Pipeline Configuration
      pipeline_params:                 # Pipeline configuration options
        dataset_name: github_data

      # DLT Load Configuration  
      run_params:                     # Controls data loading behavior
        write_disposition: append

Asset Configuration Reference

source_module

Specifies the Python path to the DLT source relative to project_dir. For example:

  • github.source resolves to dlt_project/github/source.py
  • The module should contain:
    • A source function decorated with @dlt.source
    • Resource methods that yield data

schema_file_path

  • Path to the schema file relative to project_dir
  • Generated when you run your DLT pipeline with export_schema_path
  • ODP uses this to:
    • Create individual assets for each DLT object
    • Set up correct dependencies
    • Configure materializations

source_params

Parameters passed directly to the DLT source function. Each source has its own configuration options:

  • Refer to DLT Verified Sources for source-specific parameters
  • Supports ODP variables for dynamic values:
    source_params:
      since_date: "{{context.partition_key}}"
      api_endpoint: "{{resource.api.endpoint}}"
    
  • No need for config.toml - all configuration can be passed here

destination

Any DLT-supported destination can be used. ODP adds standardized materialization metadata for:

Destination Metadata Field Format
BigQuery destination_table_id dataset.table
DuckDB destination_table_id schema.table
Filesystem destination_file_uri File path

Downstream assets can access this metadata replacing / with _ in the handlebar syntax:

workflow.yaml
assets:
  - asset_key: process_pull_requests
    task_type: duckdb_query
    depends_on: ["github/api/pull_requests"]
    params:
      query: SELECT * FROM {{github_api_pull_requests.destination_table_id}}

destination_params

Configuration for the chosen destination. Common parameters:

# BigQuery
destination_params:
  dataset: my_dataset
  project: my-project

# DuckDB
destination_params:
  database: path/to/db.duckdb

# Filesystem
destination_params:
  bucket_url: gs://my-bucket/path/  # or local path

For destination-specific configuration, refer to DLT Destinations.

pipeline_params

Pipeline-specific configuration passed to DLT's pipeline creation. See DLT Pipeline Configuration for available options.

Parameters passed to DLT's pipeline creation:

pipeline_params:
  dataset_name: github_data       # Logical grouping of tables
  pipeline_name: github_pipeline  # Optional: pipeline identifier

run_params

Controls how DLT loads data:

run_params:
  write_disposition: append    # append, replace, or merge
  merge_key: id               # For merge disposition

Asset Generation

In ODP, a DLT asset configuration corresponds to one DLT resource (like a specific API endpoint or database table) but can generate multiple Dagster assets based on that resource's output. Given an asset key like abc/github/pull_requests, here's what ODP creates:

  1. Source Asset:

    An external Dagster asset (abc/github) representing the data source

    • Created automatically from the asset key prefix
    • Functions as a logical grouping for related DLT objects
    • Allows dependency management at the source level
  2. DLT Object Assets:

    Individual Dagster assets for each object produced by the DLT resource

    • Named by combining source asset prefix with DLT object name
    • Created automatically based on the DLT schema
    • Example for a pull requests API that returns nested data:
      abc/github/pull_requests              # Main pull request data
      abc/github/pull_requests__comments    # Comments on pull requests
      abc/github/pull_requests__reactions   # Reactions to pull requests
      
    • Enables granular dependencies on specific DLT objects
    • All these assets come from a single DLT resource (pull requests endpoint)

One Resource, Many Assets

While you define one DLT asset in your ODP configuration targeting a specific DLT resource (like an API endpoint), that resource might produce multiple related data objects. ODP automatically creates separate Dagster assets for each object, allowing downstream tasks to depend on exactly the data they need.

Multiple DLT Pipelines

While you can only have one DLT resource (and thus one project_dir), you can organize multiple DLT pipelines in subdirectories:

dlt_project/
├── github/                # GitHub API pipeline
│   ├── __init__.py
│   └── source.py
├── stripe/                # Stripe API pipeline
│   ├── __init__.py
│   └── source.py
└── schemas/
    └── export/
        ├── github.schema.yaml
        └── stripe.schema.yaml

Each pipeline can have its own:

  • Source implementation
  • Schema file
  • Secrets in .dlt/secrets.toml
  • Configuration in workflow files

Best Practices

  1. Asset Organization

    • Use meaningful prefixes in asset keys to group related data
    • Create separate assets for different DLT resources
    • Use asset dependencies to manage relationships
  2. Configuration

    • Keep secrets in secrets.toml
    • Pass dynamic configuration through source_params
  3. Project Structure

    • Organize related DLT pipelines in subdirectories
    • Keep schema files in the DLT directory

Common Issues

  1. Missing Schema File

    Error: Schema file not found at 'schemas/export/github.schema.yaml'
    
    Solution: Run your DLT pipeline with export_schema_path to generate the schema

  2. Secrets Not Available

    Error: Environment variable 'GITHUB__API_TOKEN' not found
    
    Solution: Ensure secrets are properly configured in .dlt/secrets.toml

  3. Invalid Asset Keys

    Error: DLT asset key must contain at least one '/'
    
    Solution: DLT asset keys must have at least two parts (e.g., github_api/resource)

This enhanced DLT integration makes it easier to build maintainable data pipelines while keeping the benefits of both Dagster's asset-based paradigm and DLT's powerful data ingestion capabilities.