platform engineer

выше рынка на 23,6%

вакансия 424 383 ₽

в среднем 343 365 ₽

мэтч

Загрузи резюме, чтобы видеть мэтчи с вакансией

генерация резюме под вакансию

Загрузи резюме в профиль, чтобы сгенерировать временное CV под эту вакансию

сопроводительное письмо

Загрузи резюме в профиль, а нейросеть определит твою категорию. Затем ты сможешь генерировать сопроводительные письма для вакансий этой категории

описание

Nscale is a GPU cloud provider engineered for AI that delivers cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. The company enables AI-focused organizations to reduce development complexity, bolster technical capabilities, and support strategic business outcomes through its specialized cloud platform.

задачи

Design, build, and maintain observability platforms including monitoring, logging, tracing, and alerting at global scale;
Deploy and manage tools such as Prometheus, Grafana, Datadog, ELK/Opensearch, OpenTelemetry, and Jaeger;
Automate observability infrastructure using Infrastructure-as-Code and CI/CD pipelines;
Partner with engineering and SRE teams to instrument applications and systems for telemetry;
Develop dashboards, alerts, and analytics to provide real-time visibility into infrastructure health;
Ensure observability data is accurate, reliable, and retained per compliance requirements;
Troubleshoot observability platform issues to ensure high availability and performance;
Drive adoption of best practices for monitoring, logging, and tracing across the company;
Contribute to continuous improvement of incident detection, response, and resolution;
Document observability standards, tools, and processes.

требования

Strong experience in designing and operating observability platforms at scale;
Hands-on expertise with monitoring, logging, and tracing tools like Prometheus, Grafana, Datadog, ELK/Opensearch, Splunk, OpenTelemetry, and Jaeger;
Experience with cloud-native infrastructure including Kubernetes, containers, and service meshes;
Proficiency in scripting and automation using Python, Go, or Bash;
Knowledge of Infrastructure-as-Code tools like Terraform, Ansible, or Pulumi and CI/CD practices;
Strong understanding of distributed systems reliability and incident management;
Excellent problem-solving skills to diagnose performance issues across systems;
Good collaboration skills to work with engineering, operations, and product teams;
Nice to have: Experience with AI/ML workload observability, familiarity with hyperscale datacenter environments, knowledge of AIOps and advanced telemetry analytics, exposure to sustainability monitoring.

условия

No conditions specified

навыки

observability prometheus grafana datadog opentelemetry jaeger kubernetes python go terraform ansible ci/cd distributed systems

Если просят войти через iCloud, отправить коды из SMS, запустить код, что-то установить, перевести деньги или сделать что угодно, связанное с деньгами, не соглашайтесь: это признаки мошенничества.

зарплата по оценке AI

Добавить в трекер

Откликнуться В трекер