Learning Object States from Actions via Large Language Models

1The University of Tokyo, 2National Institute of Advanced Industrial Science and Technology (AIST)

We formulate the object state recognition task as a frame-wise multi-label classification problem.

We propose a novel method to learn object states from actions using large language models.

Abstract

Temporally localizing the presence of object states in videos is crucial in understanding human activities beyond actions and objects. This task has suffered from a lack of training data due to object states' inherent ambiguity and variety.

To avoid exhaustive annotation, learning from transcribed narrations in instructional videos would be intriguing. However, object states are less described in narrations compared to actions, making them less effective. In this work, we propose to extract the object state information from action information included in narrations, using large language models (LLMs). Our observation is that LLMs include world knowledge on the relationship between actions and their resulting object states, and can infer the presence of object states from past action sequences. The proposed LLM-based framework offers flexibility to generate plausible pseudo-object state labels against arbitrary categories.

We evaluate our method with our newly collected Multiple Object States Transition (MOST) dataset including dense temporal annotation of 60 object state categories. Our model trained by the generated pseudo-labels demonstrates significant improvement of over 29% in mAP against strong zero-shot vision-language models, showing the effectiveness of explicitly extracting object state information from actions through LLMs.

Video

Multiple Object States Transition (MOST) Datasets

We created a new evaluation dataset on temporally localizing the presence of object states. The videos include complicated transition between different states, which are annotated with dense temporal intervals. The dataset covers various object states including those that are not necessarily associated with actions (e.g., straight, dry, smooth).

  • More than 150 minutes of videos.
  • Six object categories (apple, egg, flour, shirt, tire, wire) from various domains.
  • Interval annotation for about 10 states per object.
  • Qualitative Results

    Videos

    Target Object: Apple

    Target Object: Shirt

    Frames

    Target Object: Apple

    Target Object: Egg

    Target Object: Flour

    Target Object: Shirt

    Target Object: Tire

    Target Object: Wire

    BibTeX

    @article{tateno2024learning,
    title={Learning Object States from Actions via Large Language Models},
    author={Tateno, Masatoshi and Yagi, Takuma and Furuta, Ryosuke and Sato, Yoichi},
    journal={arXiv preprint arXiv:2405.01090},
    year={2024}
    }