Predictive Style Matching: Natural and Robust Humanoid Locomotion

Nedelchev, Simeon; Zaliaev, Eduard; Chaikovskaia, Ekaterina; Davydenko, Egor; Gorbachev, Roman

Predictive Style Matching

Natural and Robust Humanoid Locomotion

Simeon Nedelchev^1,2, Eduard Zaliaev³, Ekaterina Chaikovskaia¹, Egor Davydenko¹, Roman Gorbachev¹

¹Moscow Institute of Physics and Technology (MIPT)
²Innopolis University · ³Sber Robotics Center

2026 · Preprint

arXiv (soon) Hardware Videos Cite (soon) Code (WIP)

Abstract

Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind—task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance.

We propose Predictive Style Matching (PSM), in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline.

On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.

Method

PSM keeps rich style supervision during training while preserving a policy-only deployment interface. An offline predictor f_φ is trained on retargeted human walking, frozen, and queried only inside simulation to supply matching rewards; the deployed actor π_θ uses the same proprioceptive observations as vanilla RL—no clip phase, reference pose, or predictor at run time.

Stage 1

Offline motion predictor

Supervised learning fits f_φ on retargeted walking clips from corpus D. Lower-body joint history, root-frame foot positions, yaw-aligned velocities, and short-horizon velocity command cues form the input; the network predicts 17 upper-body joint angles and 8 gait descriptors (step length/width, foot orientation, root height, duty cycle) over a short horizon. Training minimizes reconstruction loss with a left–right symmetry term, then the predictor is frozen.

Stage 1 predictor training pipeline — Offline supervised learning of f_φ: conditioning inputs from lower-body history and commands; GRU–MLP architecture trained with reconstruction and symmetry losses.

Stage 2

RL with predictive style matching

PPO trains π_θ in massively parallel MuJoCo simulation with standard locomotion rewards r^loco. Each step, the frozen f_φ is queried on rolling buffers and the first forecast defines exponential matching rewards on upper-body joints and gait scalars. A locomotion-first curriculum ramps the matching weight only after balance and recovery are established; only π_θ is exported for hardware.

Stage 2 RL training pipeline — PPO with frozen f_φ: predicted targets shape training-time matching rewards added to locomotion rewards; deployment keeps a standard proprioceptive actor without running the predictor.

Deployment

Hardware locomotion

On the Unitree G1, the exported policy autonomously coordinates arm swing, pelvis motion, and stepping under fixed velocity commands, with the same inference cost as a vanilla task-only controller.

Example locomotion

Disturbance recovery (short)

Results

We compare vanilla RL, clip tracking (BeyondMimic on the same clips), and PSM on Unitree G1 in simulation and hardware under matched MDP, randomization, curriculum, and disturbance schedules. PSM reduces upper-body DTW by roughly 8× vs. vanilla RL while staying within one standard deviation on fall rate; tracking attains the lowest style error but falls about 5× more often under pushes.

Full Videos

Extended hardware recordings for walking and disturbance scenarios.

Walking (full)

Disturbance (full)

Code

A first-draft implementation is in the GitHub repository (repo root: offline PsmPredictor training; RL via mjlab task Psm-G1). This release is work in progress—APIs, defaults, and docs may change. See the README for install and usage. Sim2real deploy follows unitree_rl_mjlab (mjlab + Unitree G1). Questions: simkaned@gmail.com.