📍 स्थान

Kathmandu

💼 प्रकार

full-time

📅 पोस्ट मिति

May 10, 2026

📊 क्षेत्र

Engineering - Civil

📌 Sourced from Kumarijob — Summary prepared by KaamNepal. View original listing →

Senior AI Infrastructure & Platform Engineer

🏢 The Pace Infosys 📍 Kathmandu ⏰ Full Time 📅 Posted May 10, 2026

📊 Quick Overview

CategoryOther Opportunities
LocationKathmandu
Job TypeFull Time
Experience4+ years
Deadline2026-08-10

🔧 Required Skills

PythonGoDockerKubernetesAWSLinuxRedis

📋 Job Description

Job Description

Department

Engineering / Infrastructure / AI Systems

Location

Kathmandu, Nepal (Hybrid/On-site)

Experience Required

4+ Years (DevOps / Platform Engineering / Cloud Infrastructure)

Job Summary

We are looking for a highly skilled Senior AI Infrastructure & Platform Engineer to design, deploy, scale, and maintain production-grade AI systems and cloud infrastructure.

This role is ideal for a DevOps or Platform Engineer who has strong experience managing microservices at scale and is passionate about deploying AI-powered applications in real-world production environments.

The ideal candidate should have hands-on experience with

Kubernetes,

AWS cloud infrastructure,

CI/CD automation,

GPU workload management,

observability/monitoring,

production troubleshooting,

and scalable AI model deployment.

You will work closely with AI/ML Engineers to deploy and optimize AI inference systems, LLM services, and distributed microservice architectures.

Key Responsibilities

AI Infrastructure & Deployment

Deploy and maintain AI-powered microservices in production environments

Manage scalable GPU-based inference systems for live AI/LLM applications

Optimize model-serving infrastructure for low latency and high availability

Deploy AI workloads using Docker and Kubernetes

Cloud & Infrastructure Management

Design and manage AWS cloud infrastructure (EKS, ECS, EC2, VPC, IAM, ALB, Auto Scaling, S3)

Manage on-premise/in-house servers and hybrid infrastructure environments

Ensure infrastructure security, scalability, and reliability

Kubernetes & Container Orchestration

Deploy and manage Kubernetes clusters for distributed AI workloads

Configure auto-scaling for GPU and CPU-intensive services

Manage Helm charts, ingress controllers, service networking, and workload scheduling

Optimize container performance and resource utilization

CI/CD & Automation

Build and maintain CI/CD pipelines for microservices and AI applications

Automate deployments using GitHub Actions, GitLab CI/CD, Jenkins, Terraform, or Ansible

Implement Infrastructure as Code (IaC) best practices

Monitoring & Reliability Engineering

Implement monitoring, logging, and alerting systems using Prometheus, Grafana, Loki, ELK, or similar tools

Monitor microservice health, latency, GPU utilization, and production metrics

Troubleshoot and resolve production incidents, outages, and infrastructure bottlenecks

Ensure high uptime and operational reliability

Performance Optimization

Scale GPU instances dynamically for live inference workloads

Optimize AI inference performance, container startup time, and infrastructure costs

Improve deployment efficiency and system throughput

Required Skills & Qualifications

Must Have

4+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE

Strong hands-on experience with Kubernetes in production environments

Experience deploying and managing microservices at scale

Strong AWS experience (EKS, ECS, EC2, IAM, VPC, ALB, CloudWatch, Auto Scaling)

Strong Linux administration and troubleshooting skills

Experience with Docker and container orchestration

Experience building CI/CD pipelines

Experience handling production incidents and debugging distributed systems

Strong scripting/programming skills in Python, Bash, or Go

AI Infrastructure Experience (Preferred)

Experience deploying AI/ML/LLM workloads in production

GPU infrastructure management experience

Familiarity with

vLLM

Triton Inference Server

KServe

Ray Serve

CUDA containers

NVIDIA GPU Operator

Monitoring & Observability

Prometheus

Grafana

Loki

ELK Stack

OpenTelemetry

Distributed tracing

Infrastructure & Automation Tools

Terraform

Ansible

ArgoCD

Helm

GitOps workflows

Nice to Have

Experience with vector databases

Experience with Kafka, RabbitMQ, or Redis

Understanding of AI inference optimization

Experience with hybrid cloud/on-premise infrastructure

Exposure to security best practices and DevSecOps

Key Competencies

Strong problem-solving and debugging skills

Production-first mindset

Ownership mentality

Ability to work under pressure during incidents

Strong communication and collaboration skills

Continuous learning attitude

What We Offer

Opportunity to work on cutting-edge AI infrastructure systems

Exposure to large-scale AI deployment architectures

Competitive salary and growth opportunities

Collaborative engineering culture

High-impact technical ownership

KPIs / Success Metrics

Infrastructure uptime and reliability

Deployment success rate

Production incident resolution time

GPU utilization efficiency

CI/CD deployment speed

System scalability and performance optimization

Monitoring and alerting effectiveness

📖 Read complete description on Kumarijob →

✍️ How to Apply

Ready to take the next step in your career?

➡️ Apply on Kumarijob

You will be redirected to the original job posting to complete your application.
KaamNepal does not collect applications or store personal data.

✍️ आवेदन दिनुहोस्

Ready to take the next step in your career?

➡️ Apply on Kumarijob →

स्रोत: Kumarijob | You will be redirected to the original job posting to complete your application.
KaamNepal does not collect applications or store personal data.

← सबै जागिरहरू