Upgrading OpenStack at scale is a complex and challenging task, but through the implementation of SLURP and a shift to less frequent, more deliberate upgrade cadences, organizations like Cleura can achieve more efficient and stable upgrade processes.

image

As OpenStack evolves, so do the challenges of managing large-scale cloud environments. For organizations with multiple regions and complex deployments, staying current with OpenStack upgrades while maintaining system stability and compliance can feel like a daunting task. This article outlines the upgrade path and lessons learned from Cleura, an OpenInfra Foundation Gold Member that has been running OpenStack since 2016 and managing multiple regions of varying sizes. As a company that is under the constant scrutiny of European digital sovereignty initiatives—which can be noted in this week’s Compliant Cloud launch—running supported releases is critical for their compliance requirements. 

A Historical Perspective of OpenStack at Cleura 

Cleura first deployed OpenStack in 2015 and has seen several releases come and go, as well as a notable shift in how upgrades are managed. With two, separate deployments (Cleura Public Cloud, and Cleura Compliant Cloud) each spanning multiple regions, and each containing up to hundreds of compute nodes, the upgrade process has had a significant impact on their operational efficiency. As OpenStack releases progressed, so did the complexity of keeping the deployment up to date.

Cleura’s OpenStack upgrade path has seen several key milestones:

  • Juno → Kilo ~ 2015
  • Kilo → Liberty ~ 2016
  • Liberty → Mitaka ~ 2016
  • Mitaka → Newton ~ 2017
  • Newton → Ocata ~ 2017
  • Ocata → Pike: 2018
  • Pike → Queens → Rocky: 2019
  • Rocky → Stein → Train: 2020
  • Train → Victoria: 2021
  • Victoria → Xena: 2022
  • Xena → Antelope: 2023
  • Antelope → Caracal: 2024

These upgrades, while necessary for maintaining security, stability, and performance, introduced significant challenges due to the size and complexity of their deployment. The team faced frequent upgrades, often requiring maintenance windows of multiple days, with testing phases that could take up to two weeks in preparation.

The Challenge of Frequent Upgrades

In managing such a large-scale OpenStack deployment, one of the biggest pain points for Cleura—and many other OpenStack operators—was the sheer frequency of upgrades. With multiple regions to consider, upgrades were happening almost bi-weekly, creating challenges in planning and resource allocation. Cleura noted that to keep up with the pace, it would require almost two full-time engineers focused exclusively on the upgrade process—just to account for vacations, sick leaves, and the necessary “4-eyes” principle for risk management.

The historical lack of qualified engineers in the market also compounded this issue, making it difficult to maintain the necessary staff to manage the constant churn of upgrades.

Key Pain Points Before SLURP

Before the implementation of SLURP (Stable Long-Term Upgrade Release Policy), OpenStack upgrades posed several technical and operational challenges for Cleura:

  • Frequent Upgrades: The constant stream of upgrades across multiple regions created a heavy operational burden, often requiring upgrades every two weeks.
  • Long Maintenance Windows: Upgrades involving intermediate releases resulted in extensive maintenance windows due to the complexity of the upgrades, often involving multiple layers of patching and fixes.
  • Skipped Releases: Skipping intermediate releases to jump directly to the next major version caused compatibility issues between components, such as incompatible RPC API versions or missing required database migrations. This was resulting in services like Cinder or Nova simply crashing or getting stuck in indefinite restart loops. Additionally, changes in deployment tooling and processes added more complexity, such as a rehauled process of issuing Octavia Amphora certificates triggered generation of a new CA, which led to a failover requirement for all Load Balancers.

Introducing SLURP for Upgrade Efficiency

The introduction of SLURP has significantly improved the upgrade process for Cleura. The SLURP policy offers a more stable and predictable approach to upgrades by focusing on the last stable release and reducing the frequency of major version upgrades.

Key Benefits of SLURP:

  • Reduced Maintenance Time: For the upgrade from Antelope to Caracal, Cleura planned just one day of maintenance and completed the upgrade in under 8 hours for their largest region, a significant improvement over previous upgrades, which required 3 days of maintenance.
  • Faster Testing Cycles: Testing cycles were reduced to 3 days, as the upgrade path was already well-tested through the use of tools like OpenStack-Ansible, which also streamlined the deployment process.
  • Stable Upgrades: SLURP releases are treated better in terms of bug fixes and patches. With this policy, teams have started backporting fixes not only to the latest release but also to SLURP releases, which is a major benefit for organizations facing ongoing issues.

Compliance and Stability

The reduced cadence of upgrades under SLURP also provided important benefits beyond just operational efficiencies for Cleura. With the policy in place, the organization significantly reduced the risk of running an unmaintained version of OpenStack, which was critical for their compliance requirements. The stability of SLURP releases, coupled with a more predictable upgrade schedule, allowed for better planning and fewer disruptions.

Despite these improvements, Cleura suggested that SLURP releases could benefit from a slightly longer “maintenance” cycle compared to non-SLURP releases. This would provide more room for upgrades and planning, helping prevent the system from slipping into an “unmaintained” period.

The User Experience

One of the most notable benefits of the shift in upgrade cadence for Cleura was the positive effect on end users. Prior to implementing SLURP, users often experienced significant API disturbances and multiple maintenance periods due to frequent, overlapping upgrades. However, with the new approach, users saw a noticeable reduction in service disruptions and maintenance windows.

Cleura also observed a reduction in user complaints about “too frequent” maintenance and a general sense of platform stability and maturity. The overall user experience improved, with fewer API issues and less frustration related to upgrades.

Lessons Learned and Future Outlook

The journey from a frequent, high-maintenance upgrade cycle to the more stable SLURP process has been a valuable learning experience for Cleura. The organization found that leap upgrades, which bypass intermediate releases and go directly to the next major version, helped save time and reduce unnecessary complexity.

Although the leap upgrade method was not officially supported at the time, it was still seen as a more cost-effective way to avoid issues associated with running multiple intermediate releases. The shift towards this approach started around the Train release and was instrumental in reducing the upgrade burden.

Upgrading OpenStack at scale is a complex and challenging task, but through the implementation of SLURP and a shift to less frequent, more deliberate upgrade cadences, organizations like Cleura can achieve more efficient and stable upgrade processes. With better planning, faster testing, and reduced maintenance windows, SLURP has proven to be an invaluable tool for managing large OpenStack deployments while keeping pace with evolving technology and compliance requirements.

The experience of Cleura offers a blueprint for others facing similar challenges, highlighting the importance of strategic planning, the right tooling, and a shift towards more stable release management practices in ensuring the long-term success of OpenStack deployments.

Allison Price