Large Language Models (LLMs) are emerging as promising tools for automated reinforcement learning (RL) reward design, owing to their robust capabilities in commonsense reasoning and code generation. By engaging in dialogues with RL agents, LLMs construct a Reward Observation Space (ROS) by selecting relevant environment states and defining their internal operations. However, existing frameworks have not effectively leveraged historical exploration data or manual task descriptions to iteratively evolve this space. In this paper, we propose a novel heuristic framework that enhances LLM-driven reward design by evolving the ROS through a table-based exploration caching mechanism and a text-code reconciliation strategy. Our framework introduces a state execution table, which tracks the historical usage and success rates of environment states, overcoming the Markovian constraint typically found in LLM dialogues and facilitating more effective exploration. Furthermore, we reconcile user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives. Comprehensive evaluations on benchmark RL tasks demonstrate the effectiveness and stability of the proposed framework.
In this demo, we visualize our unmodified best reward for each environment and the policy trained using this reward. Our environment suite spans 20 distinct tasks across the Bidexterous Manipulation (Dexterity) benchmark.
Dexterity
The reward function from the selected task will be shown here.
@article{ma2024xxxxx,
title = {Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution},
author = {Zen Kit Heng and Zimeng Zhao and Tianhao Wu and Yuanfei Wang and Mingdong Wu and Yangang Wang and Hao Dong},
year = {2024},
journal = {arXiv preprint arXiv: Comming Soon}
}