DNQ: Deep Nash Q-Network for Partially Observable n-Player Games
Qintong Xie, Edward Koh, Xavier Cadet, Peter Chin
- Published
- Jun 4, 2026 — 17:58 UTC
Problem
The paper addresses the challenge of training agents in multi-turn simultaneous bidding scenarios within partially observable n-player games, a gap in the literature concerning equilibrium strategies in competitive environments. The authors highlight the limitations of existing methods that struggle with scalability and computational efficiency, particularly when dealing with larger numbers of agents. This work is presented as a preprint, indicating that it has not yet undergone peer review.
Method
The proposed Deep Nash Q-Network (DNQ) framework employs a solver-in-the-loop equilibrium supervision approach. The architecture consists of a shared critic that predicts either pairwise payoff matrices or an N-player payoff tensor. The training process alternates between trajectory collection, critic-based payoff estimation, equilibrium computation, and policy imitation. The agents are trained by minimizing the Kullback-Leibler (KL) divergence between their masked policies and the equilibrium targets derived from an external solver. The pairwise formulation significantly reduces the computational burden associated with equilibrium solving, allowing for scalability to larger agent populations. The authors also detail the training compute requirements, emphasizing the efficiency gains achieved through the shared critic mechanism.
Results
The experimental results demonstrate that the pairwise formulation of DNQ outperforms the exact formulation in terms of critic loss, policy entropy, and training cost. Specifically, the pairwise method scales effectively to larger numbers of agents, while the exact method becomes computationally impractical as the joint game size increases. The authors provide quantitative comparisons, showing that the pairwise approach maintains strategic fidelity while significantly reducing resource usage and training time. The results indicate a clear trade-off between scalability and strategic accuracy in repeated competitive environments.
Limitations
The authors acknowledge that the reliance on a shared critic may introduce biases in payoff estimation, potentially affecting the equilibrium computation. Additionally, the pairwise formulation, while scalable, may not capture the full complexity of interactions in larger n-player games. The paper does not address the potential impact of varying information structures or the effects of noise in the bidding environment, which could further complicate the training dynamics.
Why it matters
The DNQ framework has significant implications for the development of intelligent bidding agents in real-world competitive systems, such as auctions and resource allocation scenarios. By providing a scalable solution to the equilibrium computation problem, this work paves the way for more sophisticated multi-agent systems capable of operating under shared constraints and limited information. The findings contribute to the broader field of multi-agent reinforcement learning and game theory, as discussed in related literature. For further details, see the full paper available on arXiv.
By Turing Wire editorial staff · Jun 4, 2026 · Editorial standards →
Source: arXiv cs.LG