Connections

What is a Connection?

Think of a Connection as your data pipeline's passport - it tells your data processing system exactly where your data lives and how to access it. Just like you need different documents to travel to different countries, different data sources require different connection types.

Key Analogy: A Connection is like your team's standard procedure for accessing a specific type of data. Once set up, everyone can use the same method to access sales data from BigQuery, customer files from S3, or transaction records from Delta tables.

Connection Types - What's Available?

Physical Connections: Your "Go-To" Data Sources

These are connections to actual data storage you use every day:

╔═════════════════╦════════════════════════════════╦══════════════════════════════╗

║ Connection Type ║ What It's For ║ Notes ║

╠═════════════════╬════════════════════════════════╬══════════════════════════════╣

║ Parquet ║ Optimized columnar files ║ ║

╠═════════════════╬════════════════════════════════╬══════════════════════════════╣

║ ClickHouse ║ Fast analytical database ║ Only if enabled in your plan ║

╠═════════════════╬════════════════════════════════╬══════════════════════════════╣

║ BigQuery ║ Google's cloud data warehouse ║ Only if enabled in your plan ║

╠═════════════════╬════════════════════════════════╬══════════════════════════════╣

║ Delta Table ║ Time-travel enabled data lakes ║ ║

╠═════════════════╬════════════════════════════════╬══════════════════════════════╣

║ CSV/JSON ║ Simple data files ║ ║

╚═════════════════╩════════════════════════════════╩══════════════════════════════╝

Ephemeral Connections: Special-Purpose Tools

These are for processing and transforming data rather than storing it:

Variable: Temporary data storage during processing

The Three Key Components

1. Kind: "What type of data are we working with?"

This is like specifying the data format or system type:

kind: BigQuery # We're accessing Google's data warehouse

kind: Deltatable # We're using a Delta table with version history

kind: CSV # We're reading a simple spreadsheet file

Business Impact: Getting this right means your pipeline uses the correct "dialect" to talk to your data source.

2. Locator: "Exactly where is the data?"

This is the specific address of your data:

# Different locator examples:

locator: sales_data.revenue_2024 # BigQuery table

locator: s3://company-data/sales/q1.csv # S3 file

locator: customer_transactions # Delta table name

locator: /shared/reports/daily_sales.parquet # Local file

In case your account storage you can use short locators:

locator: revenue_2024 # BigQuery table

locator: q1.parquet # S3 file

locator: customer_transactions.delta # Delta table name

If you specify dataset name or bucket address you'll also need to add config options and credentials.

Think of it as:

For files: The relative path to your account's storage, or the full path (for external files)
For databases: The table name
For cloud storage: The complete URL

3. Config: "What permissions do we need?"

This contains access credentials and environment settings:

config:

kind: s3 # Which cloud provider?

bucket: company-data # Which storage "folder"?

region: us-east-1 # Which geographic location?

# Credentials:

access_key_id: <<AWS_KEY>> # Security key (usually stored securely)

secret_access_key: <<AWS_SECRET>>

Key Insight: The config is typically set up once by your IT team and then reused across multiple pipelines. As an analyst, you usually reference a pre-configured setup.

Real-World Scenarios

Scenario 1: Daily Sales Report from BigQuery

connection:

kind: BigQuery # TYPE: Google's data warehouse

locator: daily_transactions # LOCATION: Specific table

Scenario 2: Reading Customer Feedback Files

connection:

kind: CSV # TYPE: Spreadsheet file

locator: s3://customer-data/feedback/week_24.csv # LOCATION: Cloud file

config: # ACCESS: Company S3 bucket

kind: s3

bucket: customer-data

access_key_id: # reference a secret

secret_access_key:

Scenario 3: Updating Marketing Campaign Data

connection:

kind: Deltatable # TYPE: Versioned data table

locator: marketing_campaign_results # LOCATION: Marketing data table

data_mode: append # ADD new records instead of replacing

partition: # ORGANIZE by date for faster queries

- campaign_date

Common Business Questions Answered

Q: Why do I need to specify the "kind"?

A: Different systems speak different languages. Telling the system "this is BigQuery" means it uses Google's specific protocols. Telling it "this is CSV" means it reads simple comma-separated values.

Q: What connections are available?

A: There are several built-in connection types:

File - for connecting to number of supported file types (Excel and HTML among others)
Parquet - for connecting to Paquet files
CSV - for connecting to CSV files
JSON - for connecting to JSON files
Deltatable - for connecting to Delta table data store (Databricks connection is currently not supported in this context)
Bigquery - for connecting to Google BigQuery table
Clickhouse - for connecting to Clickhouse table
Rest - refers to HTTP-based communication, supporting multiple request methods (GET, PUT, POST) and payload formats including JSON, Form Data, and GraphQL queries sent via POST.

Q: What if my file moves locations?

A: You only need to update the locator field. The access permissions (config) usually stay the same if it's in the same storage system.

Q: Who sets up the config/credentials?

A: For connections in your site you don't need this. For external connections the site admin sets them up in secret manager and they can be referenced.

Q: What's the difference between Delta Table and regular tables?

A: Delta Tables keep history (you can see what changed yesterday), handle errors better, and are optimized for large datasets. Regular tables are simpler but lack these features.

Q: When would I use "raw_dispatch"?

A: When you need complete control over a complex query. Normally, the system helps format your queries, but raw_dispatch: true says "trust me, I know exactly what I want to run."

Quick Reference: What You Need to Provide

╔═══════════╦══════════════════════════════════════╦═════════════════════════════════╗

║ Component ║ Usually Provided By ║ Example ║

╠═══════════╬══════════════════════════════════════╬═════════════════════════════════╣

║ kind ║ You (what type of data) ║ BigQuery, CSV, Deltatable ║

╠═══════════╬══════════════════════════════════════╬═════════════════════════════════╣

║ locator ║ You (where the data is) ║ revenue_table, path/to/file.csv ║

╠═══════════╬══════════════════════════════════════╬═════════════════════════════════╣

║ config ║ Only needed for external connections ║ ║

╚═══════════╩══════════════════════════════════════╩═════════════════════════════════╝

In most cases, the config section is not needed. You typically only need to specify the kind and locator for your specific data sources.

Connection query

When you're working with large datasets (mandatory for BigQuery tables, Delta tables, or ClickHouse databases), it's important to fetch only what you need. Adding a query to your connection helps by:

Limiting data volume (faster processing)
Reducing costs (especially for cloud services)
Improving performance (smaller datasets process quicker)

When Do You Need a Query?

Yes, Add a Query When:

Working with BigQuery, Delta Tables, or ClickHouse
Your dataset is large (millions of records)
You only need specific columns or filtered results
You're working with versioned data (need latest records only)

No Query Needed When:

Working with CSV, JSON, or Parquet files (entire file is loaded)
Dataset is small (under 100,000 records)
You need all available data
Query Basics: Simple Filtering

General Format

SELECT what_you_need FROM ConnectionKind WHERE filter_conditions

Examples in Plain Language

# Example 1: Get all customer data from January 2024

query: >

SELECT * FROM BigQuery

WHERE order_date >= '2024-01-01'

AND order_date < '2024-02-01'

# Example 2: Get unique customers who made purchases

query: >

SELECT DISTINCT customer_id FROM Deltatable

# Example 3: Get specific columns with conditions

query: >

SELECT customer_name, total_spent, region FROM Clickhouse

WHERE total_spent > 1000

AND region = 'North America'

Special Feature: Automatic Deduplication

When your connection has version tracking (using key and version fields), the system automatically gives you only the most recent version of each record.

Before (What you write):

SELECT * FROM Deltatable WHERE department = 'Sales'

After (What the system does automatically):

-- System rewrites to get only latest versions

SELECT * FROM (

-- Rank records by version, keep only #1 (most recent)

SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY last_updated DESC) as __rank__

FROM sales_data_table

WHERE department = 'Sales'

) WHERE __rank__ = 1

Business Impact:

No duplicates: Each customer appears only once with their latest data
Historical accuracy: You always see the current state
Simplified analysis: You don't need to manually filter for latest records

What You Can Do in Queries

Supported (Common Use Cases) ✅

Filter by date ranges: WHERE date BETWEEN '2024-01-01' AND '2024-01-31'
Select specific columns: SELECT customer_id, name, email
Basic conditions: WHERE status = 'active' AND region = 'US'
Remove duplicates: SELECT DISTINCT product_id
Simple calculations: SELECT revenue, revenue * 0.2 as tax

Not Supported (Complex Operations) ❌

Multiple table joins (do this in a separate transform step)
Complex aggregations (use transform stages for GROUP BY)
Window functions (except the automatic version ranking)
Subqueries (keep it simple and linear)

Troubleshooting Guide

Issue: Query is too slow

Try:

Add date filters to limit time range
Select only needed columns instead of SELECT *
Check if you're querying the right partition key

Issue: Getting duplicate records

Check:

Does your connection have key and version fields set?
Are you working with a versioned data source (like Delta tables)?

Issue: "Unsupported query" error

Remember:

Keep queries simple - just SELECT, FROM, WHERE, basic ORDER BY
Do complex joins and calculations in separate transform stages

By understanding these query basics, you can work efficiently with large datasets while letting the system handle complex tasks like version management automatically.

Connection Configuration

Datastore connection is set up using configuration object (S3 connection example):

connection:

kind: Parquet

locator: data.parquet

config:

kind: s3

bucket: s3://my-bucket

region: us-central-1

access_key_id: ... # Can be referenced with template: <<secrets.get("my-aws-key-id")>>

secret_access_key: ...

Internal connections do not require configuration but as soon as you introduce you own bucket or absolute locator you have to provide configuration.

The following connection uses internal storage which counts towards your quota:

connection:

kind: Parquet

locator: data.parquet

Connections

What is a Connection?

Connection Types - What's Available?

The Three Key Components

Real-World Scenarios

Common Business Questions Answered

Quick Reference: What You Need to Provide

Connection query

When Do You Need a Query?

General Format

Special Feature: Automatic Deduplication

What You Can Do in Queries

Troubleshooting Guide

Connection Configuration

Company

Product

Resources