Predictive Style Matching

Natural and Robust Humanoid Locomotion

Simeon Nedelchev1,2, Eduard Zaliaev3, Ekaterina Chaikovskaia1, Egor Davydenko1, Roman Gorbachev1
1Moscow Institute of Physics and Technology (MIPT)
2Innopolis University  ·  3Sber Robotics Center

2026 · Preprint

Abstract

Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind—task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance.

We propose Predictive Style Matching (PSM), in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline.

On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.

Method

PSM keeps rich style supervision during training while preserving a policy-only deployment interface. An offline predictor fφ is trained on retargeted human walking, frozen, and queried only inside simulation to supply matching rewards; the deployed actor πθ uses the same proprioceptive observations as vanilla RL—no clip phase, reference pose, or predictor at run time.

Stage 1

Offline motion predictor

Supervised learning fits fφ on retargeted walking clips from corpus D. Lower-body joint history, root-frame foot positions, yaw-aligned velocities, and short-horizon velocity command cues form the input; the network predicts 17 upper-body joint angles and 8 gait descriptors (step length/width, foot orientation, root height, duty cycle) over a short horizon. Training minimizes reconstruction loss with a left–right symmetry term, then the predictor is frozen.

Stage 1 predictor training pipeline
Offline supervised learning of fφ: conditioning inputs from lower-body history and commands; GRU–MLP architecture trained with reconstruction and symmetry losses.
Stage 2

RL with predictive style matching

PPO trains πθ in massively parallel MuJoCo simulation with standard locomotion rewards rloco. Each step, the frozen fφ is queried on rolling buffers and the first forecast defines exponential matching rewards on upper-body joints and gait scalars. A locomotion-first curriculum ramps the matching weight only after balance and recovery are established; only πθ is exported for hardware.

Stage 2 RL training pipeline
PPO with frozen fφ: predicted targets shape training-time matching rewards added to locomotion rewards; deployment keeps a standard proprioceptive actor without running the predictor.
Deployment

Hardware locomotion

On the Unitree G1, the exported policy autonomously coordinates arm swing, pelvis motion, and stepping under fixed velocity commands, with the same inference cost as a vanilla task-only controller.

Example locomotion

Disturbance recovery (short)

Results

We compare vanilla RL, clip tracking (BeyondMimic on the same clips), and PSM on Unitree G1 in simulation and hardware under matched MDP, randomization, curriculum, and disturbance schedules. PSM reduces upper-body DTW by roughly 8× vs. vanilla RL while staying within one standard deviation on fall rate; tracking attains the lowest style error but falls about 5× more often under pushes.

Naturalness

Simulation batch without pushes: upper-body DTW to reference motion (log scale, top) and velocity RMSE (bottom) confirm all methods solve the same locomotion task.

Naturalness vs command scenario

Robustness

Same trajectories under impulsive pushes and velocity kicks: fall rate (top) and task recovery time Tvel (bottom) by disturbance type.

Robustness vs disturbance scenario

Full Videos

Extended hardware recordings for walking and disturbance scenarios.

Walking (full)

Disturbance (full)

Code

A first-draft implementation is in the GitHub repository (repo root: offline PsmPredictor training; RL via mjlab task Psm-G1). This release is work in progress—APIs, defaults, and docs may change. See the README for install and usage. Sim2real deploy follows unitree_rl_mjlab (mjlab + Unitree G1). Questions: simkaned@gmail.com.