Notable evaluation benchmarks Microsoft

CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

Xiaobin Zhang, Lefei Shen, Mouxiang Chen, Zhuo Li, Hongkai Li, Han Fu

Published: Jun 11, 2026 — 16:00 UTC

Also in this story: Google

Problem
The paper addresses the gap in evaluating the practical utility of forecasting models for cloud resource consolidation, particularly in the context of the forecast-then-optimize paradigm. Existing benchmarks primarily focus on prediction error metrics, neglecting the actual decision-making capabilities of these models. This work is particularly relevant as it is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.

Method
The authors propose CloudCons, a comprehensive end-to-end benchmark that evaluates forecasting models specifically for cloud resource consolidation. They construct high-quality datasets from diverse workloads sourced from Huawei Cloud, Microsoft Azure, and Google Borg. These datasets encapsulate various service characteristics, including synchronized diurnal rhythms, stochastic bursts, and high-frequency noise. The evaluation encompasses a range of models, including statistical methods, deep learning architectures, and emerging foundation models. The authors emphasize the importance of predictive quantile selection, which they identify as a critical factor influencing decision utility. They provide guidelines for calibrating these selections to optimize the trade-off between resource efficiency and service reliability.

Results
The experiments reveal that while foundation models exhibit superior zero-shot forecasting accuracy compared to traditional models, this does not directly correlate with improved decision utility in resource consolidation tasks. The authors present quantitative results demonstrating that the choice of predictive quantiles significantly impacts the effectiveness of resource allocation strategies. Specific metrics and comparisons against baseline models are provided, although exact numerical results are not detailed in the summary.

Limitations
The authors acknowledge that their findings are based on specific datasets and may not generalize across all cloud environments. They also note that the benchmark does not account for all possible operational constraints in real-world scenarios, such as varying service level agreements (SLAs) or unexpected demand spikes. Additionally, the reliance on zero-shot generalization raises questions about the robustness of foundation models in diverse operational contexts.

Why it matters
The implications of this work are significant for both academic research and practical applications in cloud computing. By providing a structured benchmark for evaluating forecasting models in resource consolidation, CloudCons facilitates the development of more effective decision-making tools in cloud environments. The insights regarding predictive quantile selection offer actionable strategies for practitioners aiming to enhance resource efficiency while maintaining service reliability. This research contributes to the ongoing discourse on optimizing cloud resource management, as published in arXiv cs.AI.

By Callan Zhang · Jun 11, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI