sre

выше рынка на 29,0%

вакансия 422 700 ₽

в среднем 327 564 ₽

мэтч

Загрузи резюме, чтобы видеть мэтчи с вакансией

генерация резюме под вакансию

Загрузи резюме в профиль, чтобы сгенерировать временное CV под эту вакансию

сопроводительное письмо

Загрузи резюме в профиль, а нейросеть определит твою категорию. Затем ты сможешь генерировать сопроводительные письма для вакансий этой категории

описание

The company develops an advanced platform for creating and managing AI agents. This solution supports deployment within customer infrastructure as an enterprise product, as well as a SaaS version. The platform integrates real-time voice and telephony, GPU and LLM inference, and streaming analytics, operating in both cloud and on-premise environments.

задачи

Ensure the reliability of services by managing SLIs/SLOs, availability, and identifying system bottlenecks;
Set up monitoring, metrics, alerts, and dashboards;
Build and maintain Grafana dashboards for internal teams and customers;
Conduct load testing, analyze results, and provide scaling recommendations;
Investigate incidents, participate in on-call rotations, and lead postmortems;
Collaborate with developers to challenge technical decisions and find solutions;
Develop and support Kubernetes-based infrastructure on GCP and AWS;
Automate routine tasks and assist with CI/CD processes;
Deliver and support the platform for customers, including on-prem deployments;
Mentor colleagues and contribute to raising the engineering standards of the team.

требования

5+ Years of experience in SRE/DevOps with responsibility for high-load production systems;
Deep practical understanding of Docker and Kubernetes in production;
Hands-on experience with metrics, alerts, Prometheus, Alertmanager, and Grafana;
Experience with SLIs/SLOs, incident investigation, and postmortems;
Experience with load testing and capacity planning;
Proficiency in Python for automation and tooling;
Cloud experience with GCP and/or AWS, strong Linux skills, and solid networking knowledge;
Proficiency in CI/CD and infrastructure as code (GitHub Actions, Terraform, Ansible);
Ownership mindset, analytical skills, and proactive approach to outage prevention;
Strong communication and mentoring skills;
Nice to have: Experience with AI agents, real-time telephony (SIP, FreeSWITCH, RTP, WebRTC), GPU/ML serving (Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM), streaming data (Kafka, ClickHouse), IaC/GitOps (ArgoCD), logging (Loki/ELK), gRPC, working in isolated secure environments, and preparing systems for significant load growth.

условия

21 Vacation days plus public holidays and 5 sick days;
Private English lessons via Preply.

навыки

sre devops kubernetes docker python prometheus grafana gcp aws linux terraform ansible ci/cd sli/slo incident management

Если просят войти через iCloud, отправить коды из SMS, запустить код, что-то установить, перевести деньги или сделать что угодно, связанное с деньгами, не соглашайтесь: это признаки мошенничества.

зарплата по оценке AI

Добавить в трекер

Откликнуться В трекер

sre