ChatDyn🗣️
ChatDyn: Language-Driven Multi-Actor Dynamics Generation
in Street Scenes

† denotes corresponding author.
1 Shanghai Jiao Tong University, 2 Shanghai AI Laboratory, 3 Max Planck Institute for Intelligent Systems, 4 ETH Zurich, 5 The University of Hong Kong
Arxiv

Result 1

Language command: " A person is taking a taxi at the left roadside, and a vehicle overtakes the taxi. Two persons are walking together with one's arm around another's shoulder. A person is chasing another person along the roadside."

Result 2

Language command: "A person pushes another person, and another person is making a phone call, then walks along the roadside. A vehicle turns right at the intersection, and a hurried vehicle overtakes a stationary one."

Abstract

Generating realistic and interactive dynamics of traffic participants according to specific instruction is critical for street scene simulation. However, there is currently lack of a comprehensive method that generates realistic dynamics of different types of participants including vehicle and pedes- trian, with different kinds of interactions between them. In this paper, we introduce ChatDyn, the first system capa- ble of generating interactive, controllable and realistic par- ticipants dynamics in street scene based on language in- structions. To achieve precise control through complex lan- guage, ChatDyn employs a multi-LLM-agent role-playing approach, which utilizes natural language inputs to plan the trajectories and behaviors for different traffic participants. To generate realistic fine-grained dynamics based on the planning, ChatDyn designs two novel executors: the PedEx- ecutor, a unified multi-task executor that generates realistic pedestrian dynamics under different task plannings; and the VehExecutor, a physical transition based policy that gener- ates physically plausible vehicle dynamics. Extensive ex- periments show the realistic generation results under com- plex commands, and validate the effectiveness of the com- ponents.

Teaser.

Overview

ChatDyn interprets and analyzes user language instruc- tions, then produces scene dynamics that align with them. ChatDyn employs a two-stage process: high-level planning which plans trajectory and behavior under complex and ab- stract command; low-level generation for fine-grained, re- alistic dynamics generation. Since user instructions may contain many specific details that require precise control and abstract semantics that need to be understood, Chat- Dyn leverages multi-LLM-agent role-playing, treating each traffic participant as an LLM-agent. This approach capital- izes on the LLM's ability to comprehend semantic information and its extensive commonsense priors, using specific tools and interaction process to complete high-level trajec- tory and behavior planning. Each traffic participant's cor- responding agent is also equipped with an executor as one of the tools. After the high-level planning is completed, the executor uses the planning results to execute the low-level generation process. The executors generate fine-grained, realistic, and physically feasible dynamics based on high- level planning.


Method.

PedExecutor

The pedestrian executor (PedExecutor) generates low-level pedestrian dynamics based on the trajectories and behaviors planned from high-level planning. Pedestrian behaviors can be subdivided into single-agent behavior directly specified by language and interactive be- haviors that occur between multiple agents. Thus the chal- lenge lies in simultaneously handling trajectory following, single-agent motion specification, and multi-agent interac- tions, while maintaining human-like quality. To achieve these, PedExecutor utilizes multi-task unified training to ex- ecute the trajectory, single-agent behavior, and multi-agent interactions as planned by the LLM with a single policy. For human-like quality, the action space incorporates hierarchi- cal control to provide priors, while the reward function uses body masked AMP to encourage human-like control. Ulti- mately PedExecutor returns realistic dynamics that follow planned trajectory and complete desired behaviors.


Method.

VehExecutor

The vehicle executor (VehExecutor) generates the final realistic and physically feasible ve- hicle dynamics with control policy based on the high- level planned trajectory, which may initially violate cer- tain dynamic constraints. To involve physical constraints and achieve precise control, VehExecutor utilizes physics- based transition environment, combined with history-aware state and action space design. The final dynamics can be obtained by accumulating the vehicle position and head- ing from environment.


Method.

Pedestrain results and comparison

Results and comparison of PedExecutor. The video including the comparison of following, imitation and interaction tasks with other methods. Our PedExecutor accomplishes the control signals and generates most realistic pedestrian dynamics.

Vehicle results and comparison

Results and comparison of VehExecutor. Without VehExecutor, the results from bezier curve show obvious oversteering and unnatural behavior. The results of the VehExecutor are significantly more natural and physically feasible.