Notable agents robotics

Pareto Q-Learning with Reward Machines

arXiv:2606.19134 Papers with Code →

Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières

Published: Jun 17, 2026 — 14:44 UTC

Problem
The paper addresses the limitations of existing multi-objective reinforcement learning (MORL) methods, particularly in scenarios where the reward structure is defined by reward machines (RMs). Traditional approaches often struggle with non-Markovian reward signals and fail to efficiently synthesize Pareto-optimal policies. This work is a preprint and has not undergone peer review, highlighting the need for further validation in the MORL literature.

Method
PQLRM integrates Pareto Q-Learning (PQL) with Q-Learning with Reward Machines (QRM). PQL maintains a set of vector-valued Q-estimates to approximate the Pareto front, allowing for the simultaneous optimization of multiple objectives. The algorithm leverages the factored automaton structure of RMs to enhance the learning process. Specifically, PQLRM employs a multi-policy framework that enables the agent to learn from the RM-encoded rewards while maintaining sample efficiency. The authors do not disclose specific training compute requirements, but the architecture is designed to handle the complexities of the cross-product Markov Decision Process (MDP) generated by the RM.

Results
Experimental evaluations demonstrate that PQLRM converges significantly faster than a naive PQL baseline when applied to the cross-product MDP. The authors report that PQLRM can synthesize Pareto-optimal policies that QRM alone cannot achieve. While specific numerical results are not provided in the abstract, the comparative performance indicates a marked improvement in both convergence speed and policy quality, suggesting a robust enhancement over existing methods.

Limitations
The authors acknowledge that while PQLRM improves upon naive PQL, it may still face challenges in highly complex environments where the RM structure becomes intricate. Additionally, the reliance on the RM’s factored automaton structure may limit its applicability to tasks that do not conform to this representation. The paper does not address potential scalability issues or the computational overhead introduced by maintaining multiple Q-estimates, which could impact performance in resource-constrained settings.

Why it matters
The introduction of PQLRM has significant implications for the field of multi-objective reinforcement learning, particularly in applications where reward structures are complex and non-Markovian. By enhancing sample efficiency and policy synthesis capabilities, PQLRM could facilitate more effective learning in real-world scenarios, such as robotics and autonomous systems, where multiple objectives must be balanced. This work contributes to the ongoing discourse in MORL and sets the stage for future research on integrating reward machines with advanced learning algorithms, as published in arXiv.

By Callan Zhang · Jun 17, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI