Notable efficiency inference

MLSkip: Data Skipping for ML Filters via Lightweight Metadata

Mihail Stoian, Mark Gerarts, Pascal Ginter, Andreas Zimmerer, Jan Van den Bussche, Andreas Kipf

Published
Jun 2, 2026 — 17:36 UTC

Problem
The paper addresses a gap in the literature regarding data management techniques for machine learning (ML) filters, particularly in the context of recent AI functions integrated into database systems. Traditional data skipping methods, which are effective for integer and string data, are inadequate for the new filter types that rely on complex ML models. The authors highlight the absence of mechanisms for efficiently pruning non-qualifying row groups when accessing data from blob storage. This work is a preprint and has not undergone peer review.

Method
The authors propose MLSkip, which leverages Parquet’s default min-max metadata to facilitate data skipping for ML filters. They establish connections to existing research on ML query languages and neural network verification to support their approach. The core technical contribution includes the introduction of a size-bounded 2D convex hull metadata structure, which enhances the effectiveness of pruning by allowing verification tools to utilize the metadata more efficiently. The proposed method is evaluated using ReLU architectures on datasets from TPC-H and TPC-DS, with a focus on filters exhibiting low selectivity.

Results
Preliminary experiments demonstrate that MLSkip achieves an average pruning effectiveness of 27.4% for filters with selectivity below 0.1%. By implementing the enhanced metadata structure, the pruning effectiveness improves to 38.31%. Additionally, the authors report an end-to-end speedup of 1.07× over PyTorch when integrated into DuckDB, indicating a significant performance improvement in query execution times.

Limitations
The authors acknowledge that their results are preliminary and may not generalize across all types of ML models or datasets. They do not address potential scalability issues when applying MLSkip to larger datasets or more complex filter predicates. Furthermore, the reliance on Parquet’s min-max metadata may limit applicability to other storage formats that do not support similar metadata structures.

Why it matters
The introduction of MLSkip has significant implications for optimizing data management in systems that utilize ML filters, particularly in scenarios where query efficiency is critical. By improving data skipping techniques, this work can enhance the performance of database systems that integrate AI functions, thereby facilitating more efficient data retrieval and processing. This research contributes to the ongoing discourse on the intersection of machine learning and database management, as discussed in related works available on arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 2, 2026 · Editorial standards →

Source: arXiv cs.LG