Job Description
Engineering / Infrastructure / AI Systems
Kathmandu, Nepal (Hybrid/On-site)
4+ Years (DevOps / Platform Engineering / Cloud Infrastructure)
Job Summary
We are looking for a highly skilled Senior AI Infrastructure & Platform Engineer to design, deploy, scale, and maintain production-grade AI systems and cloud infrastructure.
This role is ideal for a DevOps or Platform Engineer who has strong experience managing microservices at scale and is passionate about deploying AI-powered applications in real-world production environments.
Kubernetes,
AWS cloud infrastructure,
CI/CD automation,
GPU workload management,
observability/monitoring,
production troubleshooting,
and scalable AI model deployment.
You will work closely with AI/ML Engineers to deploy and optimize AI inference systems, LLM services, and distributed microservice architectures.
Key Responsibilities
AI Infrastructure & Deployment
Deploy and maintain AI-powered microservices in production environments
Manage scalable GPU-based inference systems for live AI/LLM applications
Optimize model-serving infrastructure for low latency and high availability
Deploy AI workloads using Docker and Kubernetes
Cloud & Infrastructure Management
Design and manage AWS cloud infrastructure (EKS, ECS, EC2, VPC, IAM, ALB, Auto Scaling, S3)
Manage on-premise/in-house servers and hybrid infrastructure environments
Ensure infrastructure security, scalability, and reliability
Kubernetes & Container Orchestration
Deploy and manage Kubernetes clusters for distributed AI workloads
Configure auto-scaling for GPU and CPU-intensive services
Manage Helm charts, ingress controllers, service networking, and workload scheduling
Optimize container performance and resource utilization
CI/CD & Automation
Build and maintain CI/CD pipelines for microservices and AI applications
Automate deployments using GitHub Actions, GitLab CI/CD, Jenkins, Terraform, or Ansible
Implement Infrastructure as Code (IaC) best practices
Monitoring & Reliability Engineering
Implement monitoring, logging, and alerting systems using Prometheus, Grafana, Loki, ELK, or similar tools
Monitor microservice health, latency, GPU utilization, and production metrics
Troubleshoot and resolve production incidents, outages, and infrastructure bottlenecks
Ensure high uptime and operational reliability
Performance Optimization
Scale GPU instances dynamically for live inference workloads
Optimize AI inference performance, container startup time, and infrastructure costs
Improve deployment efficiency and system throughput
Required Skills & Qualifications
Must Have
4+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE
Strong hands-on experience with Kubernetes in production environments
Experience deploying and managing microservices at scale
Strong AWS experience (EKS, ECS, EC2, IAM, VPC, ALB, CloudWatch, Auto Scaling)
Strong Linux administration and troubleshooting skills
Experience with Docker and container orchestration
Experience building CI/CD pipelines
Experience handling production incidents and debugging distributed systems
Strong scripting/programming skills in Python, Bash, or Go
AI Infrastructure Experience (Preferred)
Experience deploying AI/ML/LLM workloads in production
GPU infrastructure management experience
vLLM
Triton Inference Server
KServe
Ray Serve
CUDA containers
NVIDIA GPU Operator
Monitoring & Observability
Prometheus
Grafana
Loki
ELK Stack
OpenTelemetry
Distributed tracing
Infrastructure & Automation Tools
Terraform
Ansible
ArgoCD
Helm
GitOps workflows
Nice to Have
Experience with vector databases
Experience with Kafka, RabbitMQ, or Redis
Understanding of AI inference optimization
Experience with hybrid cloud/on-premise infrastructure
Exposure to security best practices and DevSecOps
Key Competencies
Strong problem-solving and debugging skills
Production-first mindset
Ownership mentality
Ability to work under pressure during incidents
Strong communication and collaboration skills
Continuous learning attitude
What We Offer
Opportunity to work on cutting-edge AI infrastructure systems
Exposure to large-scale AI deployment architectures
Competitive salary and growth opportunities
Collaborative engineering culture
High-impact technical ownership
KPIs / Success Metrics
Infrastructure uptime and reliability
Deployment success rate
Production incident resolution time
GPU utilization efficiency
CI/CD deployment speed
System scalability and performance optimization
Monitoring and alerting effectiveness
Ready to take the next step in your career?
➡️ Apply on Kumarijob
You will be redirected to the original job posting to complete your application.
KaamNepal does not collect applications or store personal data.
Ready to take the next step in your career?
➡️ Apply on Kumarijob →
स्रोत: Kumarijob | You will be redirected to the original job posting to complete your application.
KaamNepal does not collect applications or store personal data.