Member-only story
Building Efficient Data Pipelines with AWS Glue, Redshift, dbt, and Dagster
The world of modern of data engineering with growth mindset
Combining AWS Glue for ETL, Microsoft SQL Server for data storage, Amazon Redshift for data warehousing, dbt for transformation, and Dagster for orchestration creates a powerful stack. But like any technology, I would set the tone for each of tool with it’s specific operation that what I like to do.
I use AWS Glue for Extract and Load but not for transformation. You know what why because I think like its limited flexibility in transformation I cannot deliver to the client. I know AWS Glue is great for basic ETL, but complex transfomation are better handled in dbt. You know what my favourite part about choosing dbt is it allows for version-controlled, modular transformations, making data pipeline more manageable and collaborative. If you got yourself lucky being on dbt cloud, you don’t need to manager the infrastructure which is like same flex with AWS Glue. This reduces operational overhead, allowing my data team to focus on what really matters: the data.
Once the Glue job is created and configured, Let’s say for moving data from Microsoft SQL Server into Amazon Redshift
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
# Initialize context
sc = SparkContext()
glueContext = GlueContext(sc)
spark =…