Posted 1d ago

Senior Principal AI Architect/Engineer

@ PepsiCo
Plano, Texas, United States
$124k-$207k/yrOnsiteFull Time
Responsibilities:Design observability, Lead red team exercises, Develop CQE framework
Requirements Summary:Lead design and operation of an enterprise AI observability platform; strong experience in AI safety, Responsible AI, ML infrastructure, OTEL, and cloud systems; 12+ years in tech; 5+ years in senior/architect roles.
Technical Tools Mentioned:OpenTelemetry, Kafka, Kubernetes, Terraform, Bicep, CI/CD, Python, Datadog, Grafana, New Relic, Dynatrace
Save
Mark Applied
Hide Job
Report & Hide
Job Description
Overview:

The AI Observability Architect is a senior technical leader responsible for designing, deploying, and operating an enterprise-grade, production-ready AI observability platform that spans the full spectrum of modern agentic AI — from large language model (LLM) workflows and multi-agent orchestration to physical AI systems, reinforcement learning harnesses, multi-modal pipelines, and agentic marketplaces. This role serves as the strategic and engineering authority for end-to-end telemetry, tracing, safety, and quality signals across heterogeneous agent frameworks and platforms.

 

The architect leads the convergence of AI observability with safety & security (including red teaming), Responsible AI (RAI), data science, physical AI, memory/skills engineering, agent fleet management, self-evolving harnesses, reinforcement learning, agent-to-agent protocols (A2A, UCP, AP2), and continuous quality engineering — making this a uniquely broad and high-impact role within the AI Solutions & Platforms organization.

 

The role also owns OpenTelemetry (OTEL) integration across third-party agentic platforms (Salesforce AgentForce, ServiceNow, Microsoft Agent 365, and others), enabling unified observability and governance at enterprise scale.



Responsibilities:

Agentic AI Observability Architecture at Scale  (30%)

  • Define and own the enterprise observability architecture for AI agents, LLMs, multi-agent workflows, and physical AI systems — covering planner/executor loops, tool/function calls, RAG retrieval chains, and memory/state transitions.
  • Build and operate unified telemetry pipelines incorporating metrics, logs, distributed traces, semantic/vector signals, and real-time event streaming (Kafka) at enterprise scale.
  • Instrument OpenTelemetry (OTEL) across heterogeneous platforms including Salesforce AgentForce, ServiceNow, Microsoft Agent 365, and internal frameworks — delivering protocol-level observability for agent ecosystems including MCP, A2A, UCP, and AP2.
  • Design and implement observability for Agent Fleets, multi-modal pipelines, physical AI systems, and self-evolving reinforcement learning harnesses — including signal capture for reward shaping and policy evaluation.
  • Deliver dashboards, alerting, SLO/SLA management, incident runbook automation, and RCA tooling that drive measurable reliability improvements and reduce MTTR across agentic services.
  • Establish cost telemetry and FinOps observability for AI workloads — token consumption, inference cost allocation, and GPU/compute efficiency across cloud environments (Azure, AWS, GCP).

Safety, Security & Red Teaming  (15%)

  • Lead observability-driven red team exercises targeting agentic AI systems — instrumenting attack surfaces, adversarial prompt injection vectors, model evasion attempts, and multi-agent trust boundary failures.
  • Design telemetry pipelines that capture safety-critical signals: guardrail trigger rates, policy violation events, PII exposure risks, prompt leakage, and agent hallucination rates.
  • Partner with Security and RAI teams to embed threat modeling, zero-trust agent authentication, and behavioral anomaly detection into the observability platform.
  • Instrument secure policy enforcement layers across agent-to-agent communication protocols (A2A, UCP, AP2) and maintain audit-ready traceability for all AI decision events.
  • Develop and maintain a Security Observability Playbook covering incident classification, escalation paths, and forensic trace retention policies for agentic AI systems.

Responsible AI (RAI) & Governance  (10%)

  • Integrate RAI signal capture — fairness, bias detection, explainability, and safety metrics — directly into observability pipelines, making compliance measurable and audit-ready.
  • Deliver governance dashboards that surface RAI compliance posture across all active AI agents and LLM deployments, aligned with global regulatory standards.
  • Support risk assessments, gap analyses, and governance frameworks with real-time observability insights — enabling proactive risk mitigation rather than reactive audit responses.
  • Collaborate with RAI CoE and Legal/Compliance teams to define data retention, consent logging, and model decision traceability standards embedded in the telemetry architecture.

Quality Engineering for Agentic Solutions — Post Go-Live & Continuous QE  (10%)

  • Own the Continuous Quality Engineering (CQE) framework for post-production agentic solutions — defining and tracking quality metrics across accuracy, latency, agent success rate, tool-call fidelity, and user outcome measures.
  • Build automated quality gates within CI/CD pipelines that leverage observability data to detect regressions, drift, and degradation in agent performance — preventing silent failures in production.
  • Instrument and monitor Skill Evaluations (evals) across the Memory, Skills, and MCP harness stack — providing traceability from eval results to production behavior.
  • Partner with product and business stakeholders to define SLA-backed quality benchmarks and deliver automated alerting when quality thresholds are breached.
  • Drive root-cause analysis for quality failures using distributed trace data, enabling rapid iteration and continuous improvement cycles for agentic solutions.

Memory, Skills, MCP & Harness Engineering Observability  (10%)

  • Design and implement observability for the agent memory layer — episodic, semantic, and working memory read/write operations — providing latency, accuracy, and drift monitoring across memory backends.
  • Instrument MCP (Model Context Protocol) server interactions, tool registrations, skill invocations, and context injection pipelines with full trace propagation and semantic tagging.
  • Own observability for self-evolving harness and reinforcement learning (RL) systems — capturing reward signals, policy update events, environment state transitions, and learning convergence metrics.
  • Monitor harness execution fidelity, skill eval pass/fail rates, and regression signals across training, fine-tuning, and inference workflows — feeding data back into the quality engineering loop.

Data Science Observability & Hardcore Python Engineering  (5%)

  • Lead a team of senior Python engineers building high-performance, production-grade observability tooling — including custom OTEL exporters, semantic trace enrichers, signal aggregators, and anomaly detection pipelines.
  • Apply data science methods — statistical process control, time-series anomaly detection, clustering, and causal inference — to transform raw telemetry into actionable AI operational intelligence.
  • Build and maintain Python-native SDKs and libraries that simplify observability onboarding for agent developers across the organization.
  • Establish code quality standards, testing frameworks, and peer review practices for the observability engineering team — embedding software craftsmanship into the team culture.

Agentic Marketplace, Registry & Ecosystem Observability  (5%)

  • Instrument the Agentic Marketplace and Agent Registry platforms — providing usage telemetry, adoption metrics, capability health scores, and dependency mapping for registered agents and skills.
  • Design observability APIs and SDK hooks that allow marketplace-registered agents to self-report health, performance, and behavioral signals into the central observability platform.
  • Monitor inter-agent communication patterns across the marketplace ecosystem — identifying latency hotspots, circular dependencies, and protocol mismatches in agent-to-agent (A2A) workflows.
  • Deliver a Marketplace Observability Dashboard surfacing agent catalog health, adoption trends, quality scores, and incident history — supporting marketplace governance and curation decisions.

Integration, Deployment & CI/CD Automation  (5%)

  • Build and maintain CI/CD pipelines for observability services and agent operations center components, incorporating automated testing, deployment gates, and rollback mechanisms.
  • Automate onboarding for new agent use cases using templates, scaffolding, and configuration validation — reducing time-to-observability from weeks to hours.
  • Drive infrastructure-as-code (IaC) practices for observability platform components across Azure, AWS, and GCP — ensuring reproducible, version-controlled, and auditable deployments.

Product Delivery & Stakeholder Collaboration (10%)

  • Operate with a product mindset — defining observability platform roadmaps, OKRs, adoption playbooks, and release milestones in partnership with AI platform and business teams.
  • Collaborate with transformation teams, enterprise architects, security, and business stakeholders to tailor observability solutions to domain-specific requirements.
  • Serve as the technical authority in executive and governance forums — translating complex observability data into business-relevant insights on risk, cost, and AI performance.
  • Partner with SRE, AI platform, and product teams to drive standard adoption and reduce integration friction across the agentic AI ecosystem.

People Leadership & Team Development (5%)

  • Build, mentor, and lead a high-performing observability engineering team — spanning Python developers, data scientists, and platform engineers — with talent initially based in India.
  • Define career paths, skills development plans, and leveling criteria aligned with PepsiCo job architecture — fostering an inclusive, high-accountability team culture.
  • Drive hiring, coaching, performance management, and succession planning across the observability function.

Decision-Making Autonomy

  • High — Owns architecture decisions, platform roadmap, and engineering standards. Strategic alignment sought from AI Solutions Director on enterprise-level commitments.

Supervision Required

  • Low to Moderate — Operates independently with periodic alignment reviews. Proactively escalates cross-organizational dependencies and risk trade-offs.

Role Complexity

  • Very High — Spans observability, safety/security, RL harnesses, physical AI, multi-modal systems, agent protocols, quality engineering, and marketplace governance simultaneously

Compensation and Benefits:

  • The expected compensation range for this position is between $123,500 - $206,750.
  • Location, confirmed job-related skills, experience, and education will be considered in setting actual starting salary. Your recruiter can share more about the specific salary range during the hiring process.
  • Bonus based on performance and eligibility target payout is 15% of annual salary paid out annually.
  • Paid time off subject to eligibility, including paid parental leave, vacation, sick, and bereavement.
  • In addition to salary, PepsiCo offers a comprehensive benefits package to support our employees and their families, subject to elections and eligibility: Medical, Dental, Vision, Disability, Health, and Dependent Care Reimbursement Accounts, Employee Assistance Program (EAP), Insurance (Accident, Group Legal, Life), Defined Contribution Retirement Plan.


Qualifications:

Minimum Education & Experience:

  • Bachelor's or Master's degree in Computer Science, AI/ML, Data Science, Software Engineering, or a related field (PhD a plus for research-heavy domains).
  • 12+ years in technology with deep experience in enterprise observability, distributed systems, platform engineering, or AI/ML infrastructure.
  • 5+ years in a senior/principal or architect-level role with demonstrated ownership of complex, cross-functional technical programs.

Core Technical Qualifications

  • AI Observability & Distributed Systems: Expert-level knowledge of observability primitives (metrics, logs, traces, events) applied to LLM/ML/agentic systems; hands-on OpenTelemetry (OTEL) instrumentation including custom exporters, semantic conventions, and trace propagation across agent/tool boundaries.
  • Agentic AI Frameworks: Direct experience with agentic AI platforms, multi-agent orchestration, LLM-based workflow design, and agent lifecycle management at production scale.
  • Safety, Security & Red Teaming: Demonstrated experience conducting red team exercises against AI systems; knowledge of adversarial attack patterns, prompt injection, model evasion, and multi-agent trust boundary failures; ability to design safety telemetry pipelines.
  • Memory, Skills & MCP: Working knowledge of agent memory architectures (episodic, semantic, working memory), Model Context Protocol (MCP), skill registries, and context injection patterns — with ability to design observability for these layers.
  • Agent-to-Agent Protocols: Familiarity with A2A (Agent-to-Agent), UCP (Universal Communication Protocol), and AP2 patterns; ability to implement protocol-level observability and policy enforcement.
  • Reinforcement Learning & Self-Evolving Harnesses: Understanding of RL training loops, reward signal capture, policy evaluation, and harness instrumentation for continuously improving agent systems.
  • Physical AI & Multi-Modal Systems: Experience or strong familiarity with observability for physical AI pipelines (robotics, edge inference, sensor fusion) and multi-modal models (vision, audio, text).
  • Data Science & Python Engineering: Proficiency in Python at a senior engineering level; experience with statistical anomaly detection, time-series analysis, and data pipeline design applied to observability data at scale.
  • Platform Integrations (OTEL / Enterprise): Hands-on experience integrating OTEL with enterprise agentic platforms including Salesforce AgentForce, ServiceNow, Microsoft Agent 365, or similar; strong understanding of enterprise integration patterns and API design.
  • Cloud & Infrastructure: Cloud fluency across Azure, AWS, and GCP; proficiency in Kubernetes, service mesh, IaC (Terraform/Bicep), and CI/CD tooling; experience with event streaming platforms (Kafka, Event Hubs).
  • Quality Engineering for AI: Experience designing continuous quality frameworks (CQE) for agentic solutions including eval harnesses, regression detection, quality gates, and SLA-backed quality benchmarking.
  • Responsible AI (RAI): Familiarity with RAI principles — fairness, bias detection, explainability, and safety — and ability to operationalize RAI signal capture within production observability pipelines.
  • Agentic Marketplace & Registry: Experience or strong familiarity with agent marketplace architectures, capability registries, and platform governance — ideally with observability or monitoring responsibilities for marketplace-registered components.

Preferred / Differentiating Technical Skills

  • Published contributions or hands-on experience with emerging agent frameworks (LangGraph, AutoGen, CrewAI, Semantic Kernel, Bedrock Agents, or equivalent).
  • Experience with Grafana, Datadog, New Relic, Dynatrace, or equivalent enterprise observability platforms — ideally extended to support AI/LLM workloads.
  • Familiarity with vector databases (Pinecone, Weaviate, pgvector) and semantic search observability patterns relevant to RAG pipelines.
  • Background in MLOps, LLMOps, or model lifecycle management — including model versioning, drift detection, and deployment governance.
  • Experience designing observability APIs and SDK hooks for developer self-service onboarding.

  • Differentiating Competencies Required - Translates enterprise AI strategy into observability architecture that simultaneously enables governance, safety, quality, and scale — holding the full picture across deeply technical and business dimensions.
  • Safety-First Engineering Mindset - Instinctively designs systems with security, adversarial resilience, and RAI compliance as first-class requirements — not retrofitted features. Leads red team exercises with intellectual rigor and operational discipline.
  • Outcome & Quality Orientation - Drives measurable impact: reduced MTTR, audit readiness, SLA adherence, agent quality scores, and RL harness convergence — translating telemetry data into business-relevant results.
  • Cross-Functional Influencing - Navigates complex organizational dynamics — aligning engineering, governance, security, data science, and business units around shared observability standards and practices.
  • Governance by Design - Integrates RAI, compliance, and security controls into design decisions from inception — producing systems that are audit-ready by default, not by remediation
  • Technical Leadership Presence - Commands credibility in both executive and deep-technical forums; able to shift fluidly between C-suite communication and whiteboard architecture sessions with engineers.
  • Adaptability & Continuous Learning - Thrives in a rapidly evolving AI landscape; quickly absorbs and operationalizes new frameworks, protocols, and research — from emerging agent communication standards to novel RL paradigms.
  • Python Engineering Excellence - Holds a high bar for Python code quality, software craftsmanship, testing discipline, and developer experience — modeling best practices for the engineering team.


>:

Our Company will consider for employment qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Credit Reporting Act, and all other applicable laws, including but not limited to, San Francisco Police Code Sections 4901-4919, commonly referred to as the San Francisco Fair Chance Ordinance; and Chapter XVII, Article 9 of the Los Angeles Municipal Code, commonly referred to as the Fair Chance Initiative for Hiring Ordinance.
 
All qualified applicants will receive consideration for employment without regard to age, race, color, religion, sex, sexual orientation, gender identity, national origin, protected veteran status, or disability status.
 
PepsiCo is an Equal Opportunity Employer: Female / Minority / Disability / Protected Veteran / Sexual Orientation / Gender Identity / Age
 
If you'd like more information about your EEO rights as an applicant under the law, please download the available EEO is the Law & EEO is the Law Supplement documents. View PepsiCo EEO Policy.
 
Please view our Pay Transparency Statement