An Interactive Agent Foundation Model


  • Zane Durante
  • Bidipta Sarkar
  • Ran Gong
  • Rohan Taori
  • Yusuke Noda
  • Paul Tang
  • Ehsan Adeli
  • Shrinidhi Kowshika Lakshmikanth
  • Kevin Schulman
  • Arnold Milstein
  • Demetri Terzopoulos
  • Ade Famoti
  • Noboru Kuno
  • Ashley Llorens
  • Hoi Vo
  • Katsu Ikeuchi
  • Li Fei-Fei
  • Jianfeng Gao
  • Naoki Wake
  • Qiuyuan Huang



Abstract

The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains – Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.

Robotics Examples

We pre-train our model on the CALVIN and Language Table Datasets. Below we show example successful rollouts of our policy in randomized initial conditions in both environments.

CALVIN:

Open Drawer


Move Slider Right


Lift Red Block


Close Drawer


Language Table:

Block to Absolute Location


Block to Block


Block to Relative Location


Separate Blocks


Gaming Examples

We also pre-train our model on datasets of Minecraft and Bleeding Edge videos annotated with actions. Below we show example predicted actions from our policy compared to ground truth actions taken by a player following specific instructions.

Minecraft Task: The player is using an iron sword to attack and kill pigs in a forest …

Video


Actions

Ground Truth Predicted Action
[CAMERAX7] [CAMERAX4]
[CAMERAY22] [CAMERAY9]
[CAMERAX0] [CAMERAX4]
[CAMERAY2] [CAMERAY23]
[attack] [attack]
[CAMERAX-1]  
[CAMERAY2]  
[attack] [attack]
[attack] [attack]

Bleeding Edge Task: The player is controlling a red robot … fighting other characters …

Video


Actions

Ground Truth Predicted Action
[lockon][meleeattack] [lrot159] [lmag4] [specialability2] [lrot216] [lmag4]
[lockon][meleeattack] [lrot159] [lmag4] [lockon][meleeattack] [lrot159] [lmag4]
[lockon][meleeattack] [lrot159] [lmag4] [lockon][meleeattack] [lrot159] [lmag4]
[lockon][meleeattack] [lrot160] [lmag4] [lockon][meleeattack] [lrot162] [lmag4]

Healthcare Examples

We fine-tuned our model on healthcare data and find that pre-training on robotics and gaming datasets provides some positive transfer for predicting actions. Below we show examples of our fine-tuned model performing various tasks in healthcare.

Video Captioning

Video


Actions

The patient is awake and calm. The patient is cooperative. The patient is alert.

Video Question Answering

Video


Actions

Q: Where is the patient? A: patient is in deep sedation. The patient likely requires assistance.

Action recognition (RASS)

Video


Actions

0 - Alert and calm

Citation



The website template was borrowed from Jon Barron