Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Referring Atomic Video Action Recognition
Karlsruhe Institute of Technology, Germany.
RISE Research Institutes of Sweden, Digital Systems, Data Science. KTH Royal Institute of Technology, Sweden.ORCID iD: 0009-0004-3798-8603
Hunan University, China.
Karlsruhe Institute of Technology, Germany.
Show others and affiliations
2025 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 15077 LNCS, p. 166-185Article in journal (Refereed) Published
Abstract [en]

We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36, 630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet – a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at RAVAR.

Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH , 2025. Vol. 15077 LNCS, p. 166-185
Keywords [en]
Action recognition; Atomic actions; Localisation; Multi-stream architecture; Question Answering; Referring expressions; Spatial localization; Textual description; Video data; Video retrieval; Semantics
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:ri:diva-77988DOI: 10.1007/978-3-031-72655-2_10Scopus ID: 2-s2.0-85213009172OAI: oai:DiVA.org:ri-77988DiVA, id: diva2:1941977
Conference
18th European Conference on Computer Vision, ECCV 2024. Milan, Italy. 29 September 2024 through 4 October 202
Note

The project served to prepare the SFB 1574 Circular Factory for the Perpetual Product (project ID: 471687386), approved by the German Research Foundation (DFG, German Research Foundation) with a start date of April 1, 2024. This work was also partially supported in part by the SmartAge project sponsored by the Carl Zeiss Stiftung (P2019-01-003; 2021\u20132026). This work was performed on the HoreKa supercomputer funded by the Ministry of Science, Research and the Arts Baden-W\"urttemberg and by the Federal Ministry of Education and Research. The authors also acknowledge support by the state of Baden-W\"urttemberg through bwHPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG. This project is also supported by the National Key RD Program under Grant 2022YFB4701400. Lastly, the authors thank for the support of Dr. Sepideh Pashami, the Swedish Innovation Agency VINNOVA, the Digital Futures.

Available from: 2025-03-03 Created: 2025-03-03 Last updated: 2025-09-23Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Fu, Jia

Search in DiVA

By author/editor
Fu, Jia
By organisation
Data Science
In the same journal
Lecture Notes in Computer Science
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 79 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf