TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
Anurup Ganguli
- Published
- May 14, 2026 — 16:46 UTC
Problem
This paper addresses the challenge of continual pre-training of large language models (LLMs) on diverse text domains without relying on replay mechanisms, task identifiers, or regularization penalties, which are known to scale poorly. The authors highlight that existing methods often lead to catastrophic forgetting, where previously learned information is lost when new tasks are introduced. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors propose TFGN (Task-Free, Replay-Free Continual Pre-Training), an architectural overlay for transformer models that enables input-conditioned, parameter-efficient updates while maintaining the integrity of the original transformer architecture. TFGN employs a Read/Write decomposition strategy, where the forward pass remains fully dense, but updates to parameters are structured to prevent overwriting prior-domain subspaces. The model is evaluated across six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) with 1 billion tokens per phase, using three model scales (~398M, ~739M, ~9B) and two training regimes (From-Scratch and Retrofit). The authors also introduce two extensions: a closed-loop meta-control layer (Extension A) that further reduces forgetting by 81% and an operator-level plan vector (Extension B) that maintains high fidelity in forward-pass behavior.
Results
TFGN demonstrates significant performance metrics, achieving a backward transfer score of -0.007 at LLaMA 3.1 8B Retrofit and HellaSwag retention scores of 0.506, 0.504, and 0.510 across different model scales. The model exhibits >=99.59% L2-orthogonal gradient separation between domain pairs, indicating effective management of cross-domain learning. Notably, the held-out JavaScript perplexity (PPL) decreases by 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch due to training on Python, showcasing positive cross-domain forward transfer. The extensions further enhance performance, particularly in reducing forgetting and maintaining operational fidelity.
Limitations
The authors acknowledge that while TFGN effectively mitigates catastrophic forgetting, the reliance on a specific architectural overlay may limit its applicability to other model types or tasks. Additionally, the performance metrics are based on specific benchmarks, and the generalizability of results across a broader range of tasks remains to be validated. The paper does not address potential computational overhead introduced by the architectural modifications.
Why it matters
The introduction of TFGN represents a significant advancement in the field of continual learning for LLMs, providing a framework that allows for effective learning across diverse domains without the pitfalls of traditional methods. This work has implications for the development of more robust and adaptable language models that can continuously learn from new data while retaining previously acquired knowledge. The architectural innovations may inspire further research into efficient continual learning strategies and the design of LLMs capable of autonomous learning.
Authors: Anurup Ganguli
Source: arXiv:2605.15053
URL: https://arxiv.org/abs/2605.15053v1
By Turing Wire editorial staff · May 14, 2026 · Editorial standards →
Source: arXiv cs.AI