An Interactive Agent Foundation Model
- Zane Durante
- Bidipta Sarkar
- Ran Gong
- Rohan Taori
- Yusuke Noda
- Paul Tang
- Ehsan Adeli
- Shrinidhi Kowshika Lakshmikanth
- Kevin Schulman
- Arnold Milstein
- Demetri Terzopoulos
- Ade Famoti
- Noboru Kuno
- Ashley Llorens
- Hoi Vo
- Katsu Ikeuchi
- Li Fei-Fei
- Jianfeng Gao
- Naoki Wake
- Qiuyuan Huang
Paper
Model (Coming Soon!)
Abstract
The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains – Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.
Robotics Examples
We pre-train our model on the CALVIN and Language Table Datasets. Below we show example successful rollouts of our policy in randomized initial conditions in both environments.
CALVIN:
Open Drawer
Move Slider Right
Lift Red Block
Close Drawer
Language Table:
Block to Absolute Location
Block to Block
Block to Relative Location
Separate Blocks
Gaming Examples
We also pre-train our model on datasets of Minecraft and Bleeding Edge videos annotated with actions. Below we show example predicted actions from our policy compared to ground truth actions taken by a player following specific instructions.
Minecraft Task: The player is using an iron sword to attack and kill pigs in a forest …
Video
Actions
Ground Truth | Predicted Action |
---|---|
[CAMERAX7] | [CAMERAX4] |
[CAMERAY22] | [CAMERAY9] |
[CAMERAX0] | [CAMERAX4] |
[CAMERAY2] | [CAMERAY23] |
[attack] | [attack] |
[CAMERAX-1] | |
[CAMERAY2] | |
[attack] | [attack] |
[attack] | [attack] |
Bleeding Edge Task: The player is controlling a red robot … fighting other characters …
Video
Actions
Ground Truth | Predicted Action |
---|---|
[lockon][meleeattack] [lrot159] [lmag4] | [specialability2] [lrot216] [lmag4] |
[lockon][meleeattack] [lrot159] [lmag4] | [lockon][meleeattack] [lrot159] [lmag4] |
[lockon][meleeattack] [lrot159] [lmag4] | [lockon][meleeattack] [lrot159] [lmag4] |
[lockon][meleeattack] [lrot160] [lmag4] | [lockon][meleeattack] [lrot162] [lmag4] |
Healthcare Examples
We fine-tuned our model on healthcare data and find that pre-training on robotics and gaming datasets provides some positive transfer for predicting actions. Below we show examples of our fine-tuned model performing various tasks in healthcare.
Video Captioning
Video
Actions
The patient is awake and calm. The patient is cooperative. The patient is alert.
Video Question Answering
Video
Actions
Q: Where is the patient? A: patient is in deep sedation. The patient likely requires assistance.
Action recognition (RASS)
Video
Actions
0 - Alert and calm
Citation
The website template was borrowed from Jon Barron