Posted 1d ago

Principal Engineer

@ A2Z Sync
Denver or United States
RemoteFull Time
Responsibilities:Own AI surface, Design tool interfaces, Mentor engineers
Requirements Summary:8+ years of software engineering; 3+ years in Staff/Principal/Architect; cloud-native, AWS serverless; mentoring; multi-language proficiency; observability and design docs.
Technical Tools Mentioned:AWS, Lambda, EventBridge, DynamoDB, StepFunctions, ECS, RDS, API Gateway, S3, CloudWatch, Aurora, Auth0, LangChain, Playwright, PHP, Python, Java, TypeScript
Save
Mark Applied
Hide Job
Report & Hide
Job Description

Why This Role Exists

We operate a multi-tenant automotive SaaS platform serving thousands of dealer groups across the United States. Our backend — event-driven serverless on AWS (Lambda, EventBridge, DynamoDB, S3, Step Functions) — orchestrates everything from dealer onboarding to inventory management to real-time transaction processing. That platform works. Now we need to make it think.

We are building agentic AI systems: autonomous, tool-using agents that observe platform state, reason over dealer context, take action through production APIs, and learn from outcomes. These are not chatbots bolted onto a dashboard. They are first-class platform services — backed by AWS Bedrock, connected to production systems via MCP servers — that make decisions, execute workflows, and close loops without human intervention unless guardrails say otherwise.

This Principal Engineer owns that entire surface. You are not advising on AI strategy from a whiteboard. You are writing agent code, defining tool interfaces, building evaluation harnesses, setting cost and latency budgets, and shipping production AI workflows that touch real dealers and real money. You set the engineering patterns the team follows, you help make the build-vs-buy calls, and when an agent misbehaves at 2 AM, your architecture is what determines whether it fails safe or fails loud.

Scope & Scale

  • 5000+ destination dealer tenants, each with isolated databases and per-tenant configuration.
  • Billions in annual Gross Merchandise Value (GMV) flowing through platform transactions.
  • Tens of thousands of API requests per minute across REST, SOAP, and event-driven integration surfaces.
  • Data pipelines spanning 6 integration domains with multi-protocol vendor connectivity.

What You Will Own

  • Ownership and core development of agentic AI systems — designing, building, and operating the AI agent infrastructure (AWS Bedrock, MCP servers) that powers intelligent automation across the platform. You are not advising on AI strategy; you are writing the agent code, defining the tool interfaces, building the evaluation harnesses, and shipping production AI workflows.
  • AI agent lifecycle end to end — from prompt engineering and tool-use design through guardrails, evaluation, cost optimization, and production observability. You own the patterns the team uses to build with AI: how agents connect to production systems, how we evaluate output quality, how we manage model costs at scale, and how we roll back when an agent misbehaves.
  • System design and technical decision-making for migration waves — from identity/tenant services through core domain extraction and frontend decomposition.
  • The dual-write framework, API Gateway traffic-splitting, and per-tenant feature flag rollout that make every migration step reversible.
  • Cross-cutting concerns: observability (OpenTelemetry, CloudWatch), security posture (Auth0 consolidation, IAM), and data architecture (DynamoDB single-table design, Aurora consolidation).
  • Mentoring and force-multiplying senior ICs — establishing patterns, reviewing designs, and raising the technical bar across 5 engineering teams.
  • Consolidate and strategize 30+ different integrations and make the future integrations easier.

Technical Environment

  • Cloud Services: High-availability AWS stack including Lambda, EventBridge, DynamoDB, S3, ECS Fargate, Aurora, API Gateway, CloudWatch, and Secrets Manager.
  • Development Languages: Modern Python and Java (Spring Boot) alongside TypeScript/React (Next.js 16) frontends, with legacy domain coverage in PHP/Laravel.
  • AI & Agentic Systems: Advanced agentic workflow orchestration utilizing lean AWS Bedrock AgentCore, MCP servers, or LangChain/LangGraph frameworks.
  • Data Engineering: Complex data architectures featuring DynamoDB single-table design, MySQL/Aurora, S3 data lakes, Glue Data Catalog, Athena, and Data pipelines.
  • Infrastructure & Security: Enterprise-grade CI/CD and observability via CloudFormation, Auth0 consolidation, OpenTelemetry, and CircleCI.
  • Integration Surfaces: Multi-protocol connectivity spanning REST, SOAP/XML, EventBridge event-bus patterns, SES processing, and Playwright browser automation.

First 12 Months

  • Months 1–3: Immerse in the codebase. Audit the current architecture across all stacks. Publish the first Architecture Decision Record (ADR) for the next migration wave. Establish your design review cadence with the team.
  • Months 4–6: Drive the AI/agentic integration layer — Bedrock-powered automation in at least one production workflow. Establish the patterns for how the team builds with AI going forward; both agentic insight retrieval agentic workflow automation.
  • Months 7–9: Own and deliver the first migration wave end-to-end — from design doc through production cutover with dual-write validation. Stand up the observability baseline (OpenTelemetry instrumentation, dashboards, SLOs).
  • Months 10–12: Second migration wave in production. Architecture runway documented for the next 12 months. The team operates at a higher technical bar because of patterns you set.