Major regulation policy Microsoft

Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"

Published: Jun 5, 2026 — 12:10 UTC

Also in this story: Meta

Microsoft’s approach to training its MAI models has come under scrutiny, as it has been revealed that the tech giant utilized unlicensed web data, despite previously assuring stakeholders of using “enterprise grade, clean and commercially licensed data.” This revelation raises important questions about data ethics and compliance in AI development, particularly as companies increasingly rely on vast datasets to enhance their machine learning models.

The controversy centers around the use of data sourced from Common Crawl, a publicly available web archive, which Microsoft reportedly leveraged for training its MAI models. This practice aligns with a broader trend in the AI industry where firms often invoke fair use to justify their data sourcing strategies. Microsoft, like many other AI labs, places the onus on website owners to block its crawlers if they wish to prevent their content from being used, a stance that has sparked debate about the ethical implications of such practices. As noted by The Decoder, this reliance on unlicensed data contradicts the company’s public assurances regarding the quality and legality of its training datasets.

This situation not only affects Microsoft’s reputation but also has broader implications for the AI landscape. As competitors like Google and OpenAI continue to refine their own data sourcing strategies, the pressure mounts on all players to ensure compliance with copyright laws and ethical standards. The AI industry is at a critical juncture where transparency and accountability in data usage are becoming paramount. A recent report indicated that 86% of AI developers are concerned about the legal ramifications of their data sourcing practices, highlighting a significant area of vulnerability for companies operating in this space.

For users and stakeholders, this revelation could lead to increased scrutiny of AI products and services, as consumers become more aware of the ethical considerations surrounding data usage. Investors may also reassess their positions in companies that do not prioritize ethical data practices, potentially reshaping the competitive landscape. As companies strive to build trust with their users, the emphasis on transparent and compliant data sourcing will likely become a key differentiator in the market.

Looking ahead, it will be essential to monitor how Microsoft and its competitors adapt their data strategies in response to this scrutiny, as well as any potential regulatory changes that may arise from ongoing discussions about data ethics in AI.

By Callan Zhang · Jun 5, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: The Decoder