Data Engineer- AWS-based data lake and analytics platform. – AWS
Data Engineer
Summary:
We are seeking an experienced Data Engineer to support the re-implementation of a
large-scale AWS-based data lake and analytics platform. This is a lift-and-shift effort to
recreate ingestion pipelines, EMR/Glue workflows, Redshift loading logic, DataBricks integration
and SageMaker ML orchestration in a new AWS account using Terraform. You’ll play a key role
in rebuilding core data workflows, validating source integrations, and ensuring data accuracy
across the stack.
Responsibilities:
● Rebuild and validate data ingestion pipelines using AWS services (Lambda, Kinesis
Firehose, MSK, S3).
● Migrate and reconfigure processing jobs in Glue, EMR, and Amazon Managed
Workflows for Apache Airflow (MWAA).
● Recreate and validate table definitions in Glue Data Catalog for downstream Athena
queries.
● Support data ingestion from third-party APIs (e.g., Revature API, eCommerce Affiliates)
using Lambda or Airflow DAGs.
● Collaborate with ML engineers to ensure SageMaker/Personalize workflows are rebuilt
and operational.
● Work with the DevOps team to align Terraform-managed resources with data pipeline
needs.
● Conduct data validation across the migration: object counts, schema consistency,
source-to-target QA.
● Document data flow logic and maintain lineage across ingestion, transformation, and
analytics layers.
Required Skills:
● 5+ years of experience in data engineering or analytics engineering roles.
● Strong AWS experience: Lambda, Kinesis (Data Streams & Firehose), MSK, S3, Glue,
Athena, EMR, Redshift.
● Experience with Airflow, either self-managed or MWAA.
● Proficiency in Python, especially for Lambda functions and ETL logic.
● Experience building or re-building Glue Data Catalog schemas and maintaining
partitioning strategies.
● Understanding of JSON, Parquet, or AVRO file formats and versioning strategies in S3
data lakes.
● Experience integrating and authenticating to external APIs and managing secrets
securely.
Nice to Have:
● Familiarity with Sailthru, Zephr, Databricks, or other Martech tools.
● Experience with Sagemaker pipelines, endpoint deployment, or feature store
management.
● Familiarity with cross-account data ingestion strategies in AWS.
● Hands-on experience working with Terraform to define data infrastructure.
● Knowledge of Redshift Spectrum or federated queries from Redshift to S3.