Notable other

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel, Liqin Ye

Published: May 27, 2026 — 16:36 UTC

Problem
This paper addresses the lack of a large-scale, standardized dataset and benchmark for analyzing Initial Public Offering (IPO) filings, which are critical documents in financial markets. IPO filings are complex, multimodal documents that often exceed 500,000 tokens and lack consistent structural organization. The authors note that existing models struggle with the unique challenges posed by these documents, particularly in terms of multimodal reasoning and alignment with expert human judgments. This work is presented as a preprint and has not yet undergone peer review.

Method
The authors introduce the IPO-Toolkit, an open-source framework designed for downloading and parsing IPO filings into a standardized, section-structured format. The toolkit performs several key functions: it segments lengthy filings, extracts embedded images, and produces structured outputs suitable for large-scale analysis. The resulting IPO-Dataset comprises over 109,000 IPO filings and amendments from 1994 to 2026, including more than 76,000 images. The dataset is structured to facilitate evaluation tasks, particularly focusing on the quality and misleadingness of extracted financial charts. The authors employ state-of-the-art multimodal models to assess these tasks, revealing significant discrepancies between model outputs and expert evaluations.

Results
The experiments conducted using the IPO-Dataset demonstrate that current multimodal models often misalign with human judgments regarding the quality and misleadingness of financial charts extracted from IPO filings. While specific numerical results are not disclosed in the abstract, the authors emphasize that the divergence from expert assessments highlights critical limitations in existing models’ capabilities to interpret long, multimodal regulatory documents. The structured evaluation tasks established in this work serve as a benchmark for future research in this domain.

Limitations
The authors acknowledge several limitations, including the potential biases inherent in the dataset due to the historical context of the filings and the challenges of generalizing findings across different industries. They also note that the multimodal models tested may not be fully representative of the state-of-the-art, as the focus was primarily on chart quality and misleadingness. Additionally, the toolkit’s reliance on the structural organization of IPO filings may limit its applicability to other types of regulatory documents that do not conform to similar formats.

Why it matters
The introduction of the IPO-Toolkit and IPO-Dataset represents a significant advancement in the analysis of IPO filings, providing researchers and practitioners with the necessary tools to conduct large-scale, reproducible analyses of multimodal documents. This work lays the groundwork for future studies aimed at improving multimodal reasoning capabilities in AI systems, particularly in the financial domain. By establishing structured evaluation tasks, the authors highlight the need for better alignment between AI models and expert human judgments, which is crucial for enhancing the interpretability and reliability of AI in regulatory contexts.

Authors: Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel, Liqin Ye, Arnav Hiray, Rutwik Routu et al.
Source: arXiv:2605.28714
URL: https://arxiv.org/abs/2605.28714v1

By Callan Zhang · May 27, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI