Site Reliability Engineer (AWS) – Mexico

LATAM

About Distillery

Distillery accelerates innovation through an unyielding approach to nearshore software development. The world’s most innovative technology teams choose Distillery to help accelerate strategic innovation, fill a pressing technology gap, and hit mission-critical deadlines. We support essential applications, mobile apps, websites, and eCommerce platforms by placing senior, strategic technical leaders and deploying fully managed technology teams that work intimately alongside our client’s in-house development teams.

At Distillery, we’re not here to reinvent nearshore software development — we’re on a mission to perfect it. Distillery is committed to diversity and inclusion. We actively seek to cultivate a workforce that reflects the rich tapestry of perspectives, backgrounds, and experiences present in our society. Our recruitment efforts are dedicated to promoting equal opportunities for all candidates, regardless of race, ethnicity, gender, sexual orientation, disability, age, or any other dimension of diversity.

About the Position

We’re looking for a Senior Site Reliability Engineer (AWS) to join a high-performing, global engineering team. This role will focus on designing, implementing, and maintaining reliable, scalable, and secure cloud infrastructure in AWS.

You will work closely with development and operations teams to ensure system stability, optimize performance, and enhance automation. The ideal candidate is passionate about reliability, has a strong DevOps mindset, and brings a problem-solving attitude to complex distributed environments.

This position requires working Pacific Time hours and strong communication skills to collaborate effectively across teams. Experience leveraging AI tools to improve development efficiency is also required.

Responsibilities

Design, implement, and maintain high-availability and scalable AWS infrastructure.
Automate infrastructure provisioning and management using tools like Terraform, CloudFormation, or Pulumi.
Improve system reliability through monitoring, alerting, and observability using Prometheus, Grafana, ELK, OpenTelemetry, and related tools.
Manage and optimize containerized environments (Docker, Kubernetes).
Troubleshoot complex system and networking issues (TCP, HTTP, DNS, load balancing, proxies).
Develop scripts and automation in Python, Go, or Bash to streamline operations.
Collaborate cross-functionally with engineering, product, and security teams to ensure reliability best practices are embedded throughout the development lifecycle.
Conduct incident analysis, root cause investigations, and implement long-term fixes.
Contribute to continuous improvement initiatives for deployment pipelines and operational excellence.
Utilize AI tools to enhance automation, debugging, and infrastructure management.

Requirements

Must be able to work Pacific Time hours.
5+ years of experience in Site Reliability Engineering, Operations, or Infrastructure roles.
Strong hands-on experience with AWS cloud services.
Solid knowledge of networking protocols and concepts (TCP, HTTP, DNS, load balancing, proxies).
Experience with Docker, Kubernetes, or other container orchestration systems.
Proven ability to automate infrastructure provisioning (Terraform, CloudFormation, or Pulumi).
Experience building high-availability distributed systems.
Proficiency with monitoring, logging, and tracing tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.).
Strong debugging and incident management skills.
Proficiency in Python, Go, or Bash scripting.
Excellent communication skills, with the ability to collaborate effectively in fast-paced, cross-functional environments.
Experience using AI tools to support development, automation, or operational workflows.

Nice To Have

Experience working in healthcare, telemedicine, or regulated environments (HIPAA, data privacy).
Familiarity with PHP / Laravel or similar web stacks.
Experience integrating e-commerce platforms (BigCommerce, Shopify, etc.) into backend systems.
Working knowledge of CDN / edge services (Cloudflare, Fastly).
Experience with chaos engineering, fault injection, or resilience testing.
Prior experience with consumer-scale applications.
Cloud certifications (AWS, GCP, or Azure).

Why You'll Like Working Here

Join a global team committed to Distillery's core values: Unyielding Commitment, Relentless Pursuit, Courageous Ambition, and Authentic Connection.

100% Remote Work: Enjoy the freedom to work from anywhere while collaborating with a diverse, multinational team.
Competitive Compensation: Generous and competitive package in USD, along with a comprehensive benefits plan.
Flexible Hours: Create a schedule that aligns with your life and priorities.
Home Office Setup: Receive all the hardware and software needed to succeed from home.
Innovative Workplace: Collaborate with the global Top 1% of talent in a multicultural and dynamic environment.
Focus on Growth: Pursue professional and personal development while contributing your unique talents to a team where you can truly shine!

Site Reliability Engineer (AWS) – Mexico

About the Position

Requirements

Nice To Have

Why You'll Like Working Here

Muévete con Distillery.

Enlaces rápidos

Recursos

Sede mundial

Teléfono

Correo electrónico