sre
генерация резюме под вакансию
сопроводительное письмо
описание
The company develops an advanced platform for creating and managing AI agents. This solution supports deployment within customer infrastructure as an enterprise product, as well as a SaaS version. The platform integrates real-time voice and telephony, GPU and LLM inference, and streaming analytics, operating in both cloud and on-premise environments.
задачи
- Ensure service reliability by managing SLIs/SLOs, availability, and eliminating system bottlenecks;
- Set up monitoring, metrics, alerts, and dashboards;
- Build and maintain Grafana dashboards for internal teams and customers;
- Conduct load testing, analyze results, and provide scaling recommendations;
- Investigate incidents, participate in on-call rotations, and lead postmortems;
- Collaborate with developers to challenge technical decisions and find solutions;
- Develop and support Kubernetes-based infrastructure on GCP and AWS;
- Automate routine tasks and assist with CI/CD processes;
- Deliver and support the platform for customers, including on-prem deployments;
- Mentor colleagues and contribute to raising the engineering standards of the team.
требования
- 5+ Years of experience in SRE/DevOps with responsibility for high-load production systems;
- Deep practical understanding of Docker and Kubernetes in production;
- Hands-on experience with metrics, alerts, Prometheus, Alertmanager, and Grafana;
- Experience with SLIs/SLOs, incident investigation, and postmortems;
- Experience with load testing and capacity planning;
- Proficiency in Python for automation and tooling;
- Experience with GCP and/or AWS, strong Linux skills, and solid networking knowledge;
- Knowledge of DevOps fundamentals including CI/CD, GitHub Actions, Terraform, and Ansible;
- Ownership mindset, analytical thinking, and proactive approach to outage prevention;
- Strong communication and mentoring skills;
- Nice to have: Experience using AI agents, real-time telephony (SIP, FreeSWITCH, RTP, WebRTC), GPU/ML serving (Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM), streaming data and analytics (Kafka, ClickHouse), IaC and GitOps (ArgoCD), logging (Loki/ELK), gRPC, working in isolated secure environments, and preparing systems for significant load growth.
условия
- 21 Vacation days plus public holidays and 5 sick days;
- Private English lessons via Preply.
навыки
Если просят войти через iCloud, отправить коды из SMS, запустить код, что-то установить, перевести деньги или сделать что угодно, связанное с деньгами, не соглашайтесь: это признаки мошенничества.