platform engineer
генерация резюме под вакансию
сопроводительное письмо
описание
Nscale is a GPU cloud provider engineered for AI that delivers cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. The company enables AI-focused organizations to reduce development complexity, bolster technical capabilities, and support strategic business outcomes through its specialized cloud platform.
задачи
- Design, build, and maintain observability platforms including monitoring, logging, tracing, and alerting at global scale;
- Deploy and manage tools such as Prometheus, Grafana, Datadog, ELK/Opensearch, OpenTelemetry, and Jaeger;
- Automate observability infrastructure using Infrastructure-as-Code and CI/CD pipelines;
- Partner with engineering and SRE teams to instrument applications and systems for telemetry;
- Develop dashboards, alerts, and analytics to provide real-time visibility into infrastructure health;
- Ensure observability data is accurate, reliable, and retained per compliance requirements;
- Troubleshoot observability platform issues to ensure high availability and performance;
- Drive adoption of best practices for monitoring, logging, and tracing across the company;
- Contribute to continuous improvement of incident detection, response, and resolution;
- Document observability standards, tools, and processes.
требования
- Strong experience in designing and operating observability platforms at scale;
- Hands-on expertise with monitoring, logging, and tracing tools like Prometheus, Grafana, Datadog, ELK/Opensearch, Splunk, OpenTelemetry, and Jaeger;
- Experience with cloud-native infrastructure including Kubernetes, containers, and service meshes;
- Proficiency in scripting and automation using Python, Go, or Bash;
- Knowledge of Infrastructure-as-Code tools like Terraform, Ansible, or Pulumi and CI/CD practices;
- Strong understanding of distributed systems reliability and incident management;
- Excellent problem-solving skills to diagnose performance issues across systems;
- Good collaboration skills to work with engineering, operations, and product teams;
- Nice to have: Experience with AI/ML workload observability, familiarity with hyperscale datacenter environments, knowledge of AIOps and advanced telemetry analytics, exposure to sustainability monitoring.
условия
- No conditions specified
навыки
Если просят войти через iCloud, отправить коды из SMS, запустить код, что-то установить, перевести деньги или сделать что угодно, связанное с деньгами, не соглашайтесь: это признаки мошенничества.