Think of a Connection as your data pipeline's passport - it tells your data processing system exactly where your data lives and how to access it. Just like you need different documents to travel to different countries, different data sources require different connection types.
Key Analogy: A Connection is like your team's standard procedure for accessing a specific type of data. Once set up, everyone can use the same method to access sales data from BigQuery, customer files from S3, or transaction records from Delta tables.
Physical Connections: Your "Go-To" Data Sources
These are connections to actual data storage you use every day:
╔═════════════════╦════════════════════════════════╦══════════════════════════════╗
║ Connection Type ║ What It's For ║ Notes ║
╠═════════════════╬════════════════════════════════╬══════════════════════════════╣
║ Parquet ║ Optimized columnar files ║ ║
╠═════════════════╬════════════════════════════════╬══════════════════════════════╣
║ ClickHouse ║ Fast analytical database ║ Only if enabled in your plan ║
╠═════════════════╬════════════════════════════════╬══════════════════════════════╣
║ BigQuery ║ Google's cloud data warehouse ║ Only if enabled in your plan ║
╠═════════════════╬════════════════════════════════╬══════════════════════════════╣
║ Delta Table ║ Time-travel enabled data lakes ║ ║
╠═════════════════╬════════════════════════════════╬══════════════════════════════╣
║ CSV/JSON ║ Simple data files ║ ║
╚═════════════════╩════════════════════════════════╩══════════════════════════════╝
Ephemeral Connections: Special-Purpose Tools
These are for processing and transforming data rather than storing it:
Variable: Temporary data storage during processing
1. Kind: "What type of data are we working with?"
This is like specifying the data format or system type:
kind: BigQuery # We're accessing Google's data warehouse
kind: Deltatable # We're using a Delta table with version history
kind: CSV # We're reading a simple spreadsheet file
Business Impact: Getting this right means your pipeline uses the correct "dialect" to talk to your data source.
2. Locator: "Exactly where is the data?"
This is the specific address of your data:
# Different locator examples:
locator: sales_data.revenue_2024 # BigQuery table
locator: s3://company-data/sales/q1.csv # S3 file
locator: customer_transactions # Delta table name
locator: /shared/reports/daily_sales.parquet # Local file
In case your account storage you can use short locators:
locator: revenue_2024 # BigQuery table
locator: q1.parquet # S3 file
locator: customer_transactions.delta # Delta table name
If you specify dataset name or bucket address you'll also need to add config options and credentials.
Think of it as:
For files: The relative path to your account's storage, or the full path (for external files)
For databases: The table name
For cloud storage: The complete URL
3. Config: "What permissions do we need?"
This contains access credentials and environment settings:
config:
kind: s3 # Which cloud provider?
bucket: company-data # Which storage "folder"?
region: us-east-1 # Which geographic location?
# Credentials:
access_key_id: <<AWS_KEY>> # Security key (usually stored securely)
secret_access_key: <<AWS_SECRET>>
Key Insight: The config is typically set up once by your IT team and then reused across multiple pipelines. As an analyst, you usually reference a pre-configured setup.
Scenario 1: Daily Sales Report from BigQuery
connection:
kind: BigQuery # TYPE: Google's data warehouse
locator: daily_transactions # LOCATION: Specific table
Scenario 2: Reading Customer Feedback Files
connection:
kind: CSV # TYPE: Spreadsheet file
locator: s3://customer-data/feedback/week_24.csv # LOCATION: Cloud file
config: # ACCESS: Company S3 bucket
kind: s3
bucket: customer-data
access_key_id: # reference a secret
secret_access_key:
Scenario 3: Updating Marketing Campaign Data
connection:
kind: Deltatable # TYPE: Versioned data table
locator: marketing_campaign_results # LOCATION: Marketing data table
data_mode: append # ADD new records instead of replacing
partition: # ORGANIZE by date for faster queries
- campaign_date
Q: Why do I need to specify the "kind"?
A: Different systems speak different languages. Telling the system "this is BigQuery" means it uses Google's specific protocols. Telling it "this is CSV" means it reads simple comma-separated values.
Q: What connections are available?
A: There are several built-in connection types:
File - for connecting to number of supported file types (Excel and HTML among others)
Parquet - for connecting to Paquet files
CSV - for connecting to CSV files
JSON - for connecting to JSON files
Deltatable - for connecting to Delta table data store (Databricks connection is currently not supported in this context)
Bigquery - for connecting to Google BigQuery table
Clickhouse - for connecting to Clickhouse table
Rest - refers to HTTP-based communication, supporting multiple request methods (GET, PUT, POST) and payload formats including JSON, Form Data, and GraphQL queries sent via POST.
Q: What if my file moves locations?
A: You only need to update the locator field. The access permissions (config) usually stay the same if it's in the same storage system.
Q: Who sets up the config/credentials?
A: For connections in your site you don't need this. For external connections the site admin sets them up in secret manager and they can be referenced.
Q: What's the difference between Delta Table and regular tables?
A: Delta Tables keep history (you can see what changed yesterday), handle errors better, and are optimized for large datasets. Regular tables are simpler but lack these features.
Q: When would I use "raw_dispatch"?
A: When you need complete control over a complex query. Normally, the system helps format your queries, but raw_dispatch: true says "trust me, I know exactly what I want to run."
╔═══════════╦══════════════════════════════════════╦═════════════════════════════════╗
║ Component ║ Usually Provided By ║ Example ║
╠═══════════╬══════════════════════════════════════╬═════════════════════════════════╣
║ kind ║ You (what type of data) ║ BigQuery, CSV, Deltatable ║
╠═══════════╬══════════════════════════════════════╬═════════════════════════════════╣
║ locator ║ You (where the data is) ║ revenue_table, path/to/file.csv ║
╠═══════════╬══════════════════════════════════════╬═════════════════════════════════╣
║ config ║ Only needed for external connections ║ ║
╚═══════════╩══════════════════════════════════════╩═════════════════════════════════╝
In most cases, the config section is not needed. You typically only need to specify the kind and locator for your specific data sources.
When you're working with large datasets (mandatory for BigQuery tables, Delta tables, or ClickHouse databases), it's important to fetch only what you need. Adding a query to your connection helps by:
Limiting data volume (faster processing)
Reducing costs (especially for cloud services)
Improving performance (smaller datasets process quicker)
Yes, Add a Query When:
Working with BigQuery, Delta Tables, or ClickHouse
Your dataset is large (millions of records)
You only need specific columns or filtered results
You're working with versioned data (need latest records only)
No Query Needed When:
Working with CSV, JSON, or Parquet files (entire file is loaded)
Dataset is small (under 100,000 records)
You need all available data
Query Basics: Simple Filtering
SELECT what_you_need FROM ConnectionKind WHERE filter_conditions
Examples in Plain Language
# Example 1: Get all customer data from January 2024
query: >
SELECT * FROM BigQuery
WHERE order_date >= '2024-01-01'
AND order_date < '2024-02-01'
# Example 2: Get unique customers who made purchases
query: >
SELECT DISTINCT customer_id FROM Deltatable
# Example 3: Get specific columns with conditions
query: >
SELECT customer_name, total_spent, region FROM Clickhouse
WHERE total_spent > 1000
AND region = 'North America'
When your connection has version tracking (using key and version fields), the system automatically gives you only the most recent version of each record.
Before (What you write):
SELECT * FROM Deltatable WHERE department = 'Sales'
After (What the system does automatically):
-- System rewrites to get only latest versions
SELECT * FROM (
-- Rank records by version, keep only #1 (most recent)
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY last_updated DESC) as __rank__
FROM sales_data_table
WHERE department = 'Sales'
) WHERE __rank__ = 1
Business Impact:
No duplicates: Each customer appears only once with their latest data
Historical accuracy: You always see the current state
Simplified analysis: You don't need to manually filter for latest records
Supported (Common Use Cases) ✅
Filter by date ranges: WHERE date BETWEEN '2024-01-01' AND '2024-01-31'
Select specific columns: SELECT customer_id, name, email
Basic conditions: WHERE status = 'active' AND region = 'US'
Remove duplicates: SELECT DISTINCT product_id
Simple calculations: SELECT revenue, revenue * 0.2 as tax
Not Supported (Complex Operations) ❌
Multiple table joins (do this in a separate transform step)
Complex aggregations (use transform stages for GROUP BY)
Window functions (except the automatic version ranking)
Subqueries (keep it simple and linear)
Issue: Query is too slow
Try:
Add date filters to limit time range
Select only needed columns instead of SELECT *
Check if you're querying the right partition key
Issue: Getting duplicate records
Check:
Does your connection have key and version fields set?
Are you working with a versioned data source (like Delta tables)?
Issue: "Unsupported query" error
Remember:
Keep queries simple - just SELECT, FROM, WHERE, basic ORDER BY
Do complex joins and calculations in separate transform stages
By understanding these query basics, you can work efficiently with large datasets while letting the system handle complex tasks like version management automatically.
Datastore connection is set up using configuration object (S3 connection example):
connection:
kind: Parquet
locator: data.parquet
config:
kind: s3
bucket: s3://my-bucket
region: us-central-1
access_key_id: ... # Can be referenced with template: <<secrets.get("my-aws-key-id")>>
secret_access_key: ...
Internal connections do not require configuration but as soon as you introduce you own bucket or absolute locator you have to provide configuration.
The following connection uses internal storage which counts towards your quota:
connection:
kind: Parquet
locator: data.parquet