OpenStack is an open-source cloud computing platform that powers many of today’s largest cloud environments. However, managing upgrades in a large-scale OpenStack deployment can present substantial challenges. This article examines the upgrade process and efficiencies gained by OpenInfra Foundation Associate Member, Indiana University (IU), which has been running OpenStack since 2015. The university’s experience highlights the importance of strategic upgrade planning, particularly for large deployments, and how the adoption of a more structured approach to upgrades has streamlined operations and minimized disruptions.
A Historical Perspective on OpenStack Upgrades at Indiana University
IU first deployed OpenStack in 2015, beginning with a cloud environment powered by 320 hypervisors. The university’s OpenStack deployment has grown significantly over the years, with their second cloud deployment expanding to 506 hypervisors. This scale introduces unique challenges during upgrades, especially in terms of database and message bus load—both critical components for the operation of OpenStack during the upgrade process. As the size of the deployment increased, the complexity of upgrades grew, and a more efficient approach became necessary.
Indiana University’s OpenStack upgrade path has evolved over the years, with several key milestones:
- Juno, Kilo, Liberty: 2015
- Mitaka, Newton: 2016
- Ocata, Pike: 2017
- Queens, Rocky: 2018
- Stein, Train: 2019
- Wallaby: 2021
- Xena, Yoga, Zed: 2023
- Antelope, Caracal: 2024
Each upgrade cycle presented challenges in managing and maintaining such a large and growing environment. As the number of hypervisors increased, so did the workload involved in the upgrade process, particularly when dealing with upgrades to the database, RPC systems, and network agents.
The Challenges of Large-Scale OpenStack Upgrades
One of the primary challenges IU faced during OpenStack upgrades was the sheer scale of the deployment. Specifically, the 300-500 hypervisor range proved to be an inflection point in terms of system scaling. The databases and RabbitMQ message bus that handle inter-component communication (RPC) were under heavy load during upgrades, requiring careful planning and execution.
Before adopting more streamlined upgrade processes, IU faced several key pain points during each upgrade cycle:
- Upgrading Hypervisor Agents/Daemons: With hundreds of hypervisors, IU had to upgrade around 1500 agents and daemons across the environment. This required significant effort and coordination.
- Data Migrations: Each OpenStack release often required data migrations, which added to the complexity and duration of the upgrade process.
- Network Agent Outages: After upgrading centralized networking agents, IU experienced brief outages in tenant networks as agents were restarted, leading to service disruptions.
Despite having conducted 14 successful upgrades across production OpenStack clouds, the process was still a labor-intensive, multi-hour event.
The Introduction of SLURP: Streamlining the Upgrade Process
The adoption of SLURP (Stable Long-Term Upgrade Release Policy) was a game-changer for IU. SLURP’s goal is to reduce the frequency of major OpenStack upgrades while ensuring stability and timely patching of critical components. By reducing the number of upgrades required, IU could streamline the process and minimize the disruptions typically associated with frequent upgrades.
Key Benefits of SLURP for IU:
- Faster Upgrades: Upgrading from Zed to Antelope took about 4 hours, while the upgrade from Antelope to Caracal took only 3 hours—a significant improvement. By skipping Bobcat, the university was able to upgrade two releases in half the time, reducing both the duration and the number of outages.
- Minimized Outages: The upgrade process under SLURP resulted in half the outages compared to previous cycles. This was achieved through better planning and a more efficient process, which helped keep service disruptions to a minimum.
- Time Efficiency: Instead of spreading the upgrade over two days, the Antelope to Caracal upgrade was completed in just one day. The reduction in time required to complete upgrades allowed for smoother operations and better resource management.
Scheduling Upgrades for Minimal Impact
One of the additional advantages of SLURP is its impact on IU’s ability to schedule upgrades during off-peak times. The university traditionally experiences lower usage during the summer months, particularly in July and August, when many users are on vacation. By aligning the upgrade schedule with this quieter period, IU minimizes the impact on end users and ensures that any potential issues are dealt with before the start of the academic year.
Improved User Experience
The change in upgrade cadence has had a noticeable positive effect on IU’s end users. Previously, users were often impacted by frequent, noticeable maintenance periods and interruptions. However, with the new approach under SLURP, users are far less likely to notice when an upgrade takes place unless they encounter new features or improvements.
With reduced maintenance windows and fewer disruptions, users now enjoy a more stable platform with fewer API issues and service outages, creating a more seamless and reliable experience.
The Evolution of OpenStack Upgrades: From a Multi-Day Ordeal to Routine Maintenance
IU’s approach to OpenStack upgrades has evolved significantly over the past decade. Initially, upgrades were a multi-day ordeal that occurred twice a year, requiring intensive work to patch bugs, deal with network outages, and resolve issues with intermediate versions. These upgrades were laborious, time-consuming, and disruptive to users.
Today, thanks to SLURP and a more efficient upgrade strategy, upgrades have become a routine maintenance task that takes place over the summer when user activity is lower. The upgrades are far less disruptive, and the team has become adept at handling them in a way that minimizes the impact on both infrastructure and users.
The experience of IU highlights how large-scale OpenStack deployments can benefit from a more structured and efficient upgrade process. By adopting SLURP, the university has been able to reduce the frequency of major upgrades, streamline the upgrade process, and minimize service disruptions.
Today, OpenStack upgrades at IU are no longer a significant event but rather a smooth, well-planned part of routine maintenance. This evolution from a challenging, multi-day ordeal to a more efficient and predictable process has improved operational efficiency, user satisfaction, and compliance—offering valuable lessons for other organizations running large-scale OpenStack environments.
As OpenStack continues to evolve, IU’s experience serves as a prime example of how strategic planning, reduced upgrade cadence, and the right tools can transform the upgrade experience and contribute to a more stable, reliable cloud environment.