Backed by extensive benchmarking comparing performance to proprietary solutions like VMware and Microsoft Azure, the FPT SmartCloud team selected OpenStack to power and optimize these critical workloads.  

image

FPT Smart Cloud, an OpenInfra Foundation Silver Member, is a leading provider of Artificial Intelligence (AI) and cloud computing solutions in Vietnam. Backed by extensive benchmarking comparing performance to proprietary solutions like VMware and Microsoft Azure, the FPT team has selected OpenStack to power and optimize these critical workloads.  

Through two primary AI use cases as well as optimizations of existing high-performance computing (HPC) and AI workloads, FPT has built a significant OpenStack footprint spanning three regions, two zones per region, and more than 100 physical servers for each zone. 

When it comes to the infrastructure technology backing these use cases, FPT adopted open-source infrastructure software, specifically OpenStack, for several reasons: 

  • Flexibility for customization 
  • Broad set of tools available to support hardware optimization/acceleration & offload (NUMA, SRIOV, CPU PIN, Multi Queue VIF)
  • Mature cloud ecosystem for AI applications (VM, Storage, Autoscaling, Automation, Loadbalancer, Kubernetes provisioning form)
  • Variety of models of supporting GPU workloads (VM PCI passthrough, vGPU, MIG)

Within these two use cases, FPT has outlined the wide range of OpenStack services they leverage, some of the key features that are critical for their production environments, and the efficiencies they have achieved compared to the proprietary solutions they have used in the past. 

Powering Vietnam’s First AI Factory Infrastructure

The FPT Smart Cloud team built an OpenStack and NVIDIA-powered AI factory that started in Vietnam and expanded to Japan in November 2024. Within these AI factories, OpenStack is being used to manage 1,000 H100 GPUs in Vietnam and 1,000 H200 GPUs in Japan. 

FPT Smart Cloud offers a host of customizable OpenStack services to their users through the AI Factory: 

  • Bare metal as-a-service, powered by OpenStack Ironic 
  • GPU/vGPU cloud instances
  • Cloud workstation/desktop using OpenStack Nova 
  • GPU Kubernetes engine provisioning using OpenStack Magnum 
  • GPU container as-a-service using OpenStack Zun 
  • Self-service capabilities with different card lines or multiple cards dedicated to a virtual machine (VM) in one cluster
  • OpenStack add-on, ready-to-use services including load balancing (Octavia), Auto-scaling (Senlin), and storage backup (Cinder)
  • FPT Cloud Desktop accelerates with GPU by integrating OpenStack with OpenUDS

For storage, they have SAN storage (Netapp, Dell) for VM block storage, Ceph provides S3 protocol for AI, big data, model training, and archiving data and local NVME (ephemeral) for intensive VM I/O workload and caching. 

Throughout the implementation of OpenStack for the FPT AI Factory, the team monitored performance compared to VMware where they had also hosted some AI workloads. FPT Smart Cloud saw that OpenStack provides more administrative advantages and flexibility as illustrated in the below chart. 

FPT AI eKYC at Large Scale 

Another OpenStack-powered AI use case is through FPT AI eKYC, a secure digital onboarding and authentication solution for businesses that offers fast and secure onboarding, authenticates quickly, and detects fraud using AI. This solution is being used by major banks throughout Vietnam

Through this solution, identity updates are served for Vietnamese banks by scanning customers’ ID cards. To put this scale in perspective, there are more than 1,000,000 requests per day and often up to 10,000 requests per minute in peak time. Thus, the system requires high uptime, low latency, and reduced inference processing – less than 7 seconds per request. 

Proprietary solutions simply cannot offer this level of agility. FPT Cloud utilizes OpenStack and autoscaling VMs via Senlin with GPU A30, which allows them to scale out to 50 cards in two regions of FPT Cloud, fulfilling 100% of the requirements from the banks. 

Optimizing Existing AI/HPC Workloads 

For AI and HPC workloads that have been historically hosted on proprietary systems, including global hyperscalers, there are many performance enhancements that can be realized through the implementation of OpenStack. FPT successfully optimized GPU efficiency by 7% using OpenStack compared to VMware, resulting in multi-million dollar savings. 

But how? When it specifically comes to GPU optimization, there are several technologies within the OpenStack ecosystem that FPT Smart Cloud leverages for these efficiency gains: 

  • NUMA: a computer memory design used in multiprocessing systems where memory access time depends on the memory’s location relative to the processor 
  • HugePage: a memory management feature in Linux that allows the operating system to use larger memory pages than the default size (typically 4 KB). This feature is particularly beneficial for applications with large memory requirements, such as databases and virtual machines, as it improves performance and reduces resource overhead.
  • CPU pinning: a technique that binds a process or thread to a specific CPU core or set of cores
  • NVMe local ephemeral storage: high-performance, non-persistent storage that is directly attached to the physical hardware of a VM or cloud instance 

Beyond GPU throughput optimization, OpenStack has enhanced the network and storage capacity for HPC with core technologies including DPDK, SR-IOV, and NVME local. Their team is also leveraging a feature that was introduced in OpenStack 2024.1 ‘Caracal’: vGPU live migration. This enables transferring a running VM with an attached virtual GPU or vGPU from one physical host to another, with minimal downtime or disruption to the VM’s operations. By leveraging this feature, FPT was able to improve high availability (HA) for critical AI and HPC workloads. 

‘OpenInfra for AI’ Working Group Highlights OpenStack Testing, Adoption to Support AI Workloads

Within the global OpenInfra community, OpenInfra Foundation members like FPT Smart Cloud have come together to form a working group to further the education and adoption of OpenStack in AI scenarios. By identifying architectural trends for infrastructure powering AI workloads and defining open-source software gaps in running AI workloads, these organizations are collaborating to elevate OpenStack’s position as the de facto open-source infrastructure technology for supporting AI workloads. If you would like to contribute to this effort, encourage your organization to join the OpenInfra Foundation.

Allison Price