Posted 1mo ago

Senior Operations Expert FT- SH 高级运维专家 (全职) - 上海

@ Flowith
Shanghai, Shanghai, China
OnsiteFull Time
Responsibilities:Design systems, Optimize performance, Ensure availability
Requirements Summary:5+ years SRE/DevOps/Operations; Linux and networking mastery; Cloudflare ecosystem expertise; IaC (Terraform) and CI/CD; Prometheus/Grafana monitoring; security operations; AI infra experience bonus.
Technical Tools Mentioned:Cloudflare, Terraform, Prometheus, Grafana, CI/CD, Shell, Python, Go, Linux
Save
Mark Applied
Hide Job
Report & Hide
Job Description

Role Overview

You are the "architect" and "guardian" of Flowith’s global production environment. In this role, you are not just a firefighter putting out outages, but the cornerstone supporting exponential business growth. You will master the Cloudflare ecosystem and mainstream global cloud infrastructure to design and implement high-concurrency, low-latency distributed architectures. Through extreme performance optimization and a relentless pursuit of automation, you will ensure millions of global users always experience silky-smooth and stable AI interactions.

Key Responsibilities

  • Global Architecture Implementation: Design and manage cross-platform cloud-native architectures, driving multi-region deployment, elastic scaling, canary releases, and rapid rollbacks to ensure the efficient operation of global distributed applications.
  • Traffic & Performance Optimization: Lead the architectural design of managed caching and asynchronous messaging capabilities to seamlessly handle hot caches, task decoupling, and traffic spikes.
  • High Availability & Continuity: Build and continuously optimize the observability system (SLI/SLO and alert governance). Develop and drill backup/recovery, disaster recovery switching, and emergency response mechanisms to defend the baseline of business continuity.
  • Technical Vision & Empowerment: Participate in tech stack selection and architecture reviews for core business features, finding the optimal balance between reliability, security, cost, and maintainability.

  • 全球化架构落地:设计并管理跨平台云原生架构,推进多地域部署、弹性扩缩容、灰度发布与快速回滚,保障全球分布式应用的高效运行。
  • 流量与性能优化:主导托管式缓存与异步消息能力的架构设计,从容应对热点缓存、任务解耦与流量削峰。
  • 高可用与连续性保障:建设并持续优化可观测性体系(SLI/SLO与告警治理),制定并演练备份恢复、容灾切换与应急响应机制,捍卫业务连续性底线。
  • 技术前瞻与架构赋能:参与核心业务的技术选型与架构评审,在可靠性、安全性、成本与可运维性之间找到最优解。