At AI4I, you will work on a state-of-the-art AI computing environment built with best-in-class technologies acquired over the past year, including next-generation GPU systems such as NVIDIA B200 accelerators and high-performance distributed storage solutions such as VAST. This infrastructure is designed to support AI training, fine-tuning, and inference workloads for both research and industrial deployment.
Leonardo hosts the physical hardware infrastructure and delivers agreed infrastructure services in partnership with AI4I. In this role, you will focus on optimizing performance and providing direct technical support to internal users and clients running AI workloads, while contributing to the continuous evolution and improvement of the system design.
You will act as a key interface between the machine infrastructure and the teams executing AI workflows, ensuring efficient, stable, and predictable operations.
Location: AI4I, OGR – Turin, Italy
Hybrid work: Flexible arrangements may be negotiated
The position will remain open until it is filled and multiple candidates may be hired.
About the Role
As HPC / AI Specialist at AI4I, you will operate at the intersection of infrastructure operations and applied AI execution. You will ensure that engineers, researchers, and deployment teams can efficiently run training, fine-tuning, inference, and data-intensive pipelines on shared compute resources.
This is a cross-unit role shared 50% with the new Deployment Unit, working closely with both infrastructure and client-facing teams.
You will work closely with:
- AI engineers and ML / GenAI teams running training, fine-tuning, and inference workloads
- Cloud / DevOps engineers operating the private cloud
- The Deployment Unit supporting industrial AI clients
- Hardware vendors and infrastructure partners
This role is strongly operational and performance-oriented, with a focus on workload efficiency, system tuning, AI workload optimization, and user support.
Key Responsibilities
- Operate and maintain Linux-based HPC clusters supporting AI training, fine-tuning, and inference workloads
- Manage GPU and CPU compute environments, including workload scheduling, resource isolation, and performance tuning
- Support distributed and software-defined storage systems used for large-scale datasets
- Act as the primary technical interface between infrastructure operations and internal or external users running AI workflows
- Provide hands-on technical support for AI workload optimization, including distributed training and parameter-efficient fine-tuning of foundation models on HPC infrastructure
- Support foundation model fine-tuning workflows (including parameter-efficient approaches), including configuration of data pipelines, checkpoints, runtime settings, and GPU memory optimization in HPC environments
- Optimize resource utilization and workload performance across multi-tenant environments
- Support containerized workloads running on shared compute infrastructure
- Monitor system health, performance, and capacity; troubleshoot user-facing production issues
- Contribute to the continuous improvement and evolution of system architecture in collaboration with infrastructure teams
- Support internal users and clients with debugging, environment setup, and best practices for scalable AI execution
Required Qualifications
- Strong Linux system administration experience in production environments
- Solid background in CPU and GPU architectures and performance characteristics
- Experience operating HPC clusters or large-scale compute environments
- Hands-on experience with distributed and software-defined storage systems (e.g., VAST or equivalent)
- Experience with workload managers and job schedulers (e.g., Slurm or equivalent)
- Experience troubleshooting performance bottlenecks in compute or storage environments
- Practical understanding of AI training and fine-tuning workloads, including GPU memory management, batch sizing strategies, distributed execution constraints, checkpointing, and data pipeline performance in HPC or large-scale compute environments
- Scripting and automation skills (Bash, Python, or equivalent)
- Experience supporting shared infrastructure with uptime and operational responsibility
Additional Strengths
- Experience supporting AI / ML training workloads in production environments
- Experience with parameter-efficient fine-tuning workflows and runtime optimization in shared HPC environments
- Familiarity with foundation model adaptation workflows and large-scale training constraints
- Familiarity with containerized execution environments
- Experience operating multi-tenant compute environments
- Experience with monitoring and observability systems
- Networking fundamentals for high-throughput environments
- Experience collaborating with engineering or deployment teams in production settings
Key Performance Metrics
- Cluster availability and operational stability
- GPU and CPU utilization efficiency
- Workload performance and scheduling effectiveness
- Time required to debug and resolve user issues
- Time required to onboard new workloads and users
What We Offer
- A collaborative environment with engineers and researchers working on real industrial AI deployments
- Direct impact: your infrastructure will run daily AI workloads and production systems
- An office at the epicenter of tech: OGR Torino technology hub
- Competitive compensation and access to advanced computing infrastructure
How to Apply
Submit your application exclusively through the online form:
- Cover letter (max. 1 page) describing how your profile fits this specific position
- CV and optional links to technical projects or operational experience

