Abstract
Multimodal learning has demonstrated remarkable performance improvements over unimodal architectures. However, multimodal learning methods often exhibit deteriorated performances if one or more modalities are missing. This may be attributed to the commonly used multi-branch design containing modality-specific components, making such approaches reliant on the availability of a complete set of modalities. In this work, we propose a robust multimodal learning framework, Chameleon, that adapts a common-space visual learning network to align all input modalities. To enable this, we present the unification of input modalities into one format by encoding any non-visual modality into visual representations thus making it robust to missing modalities. Extensive experiments are performed on multimodal classification task using four textual-visual (Hateful Memes, UPMC Food-101, MM-IMDb, and Ferramenta) and two audio-visual (avMNIST, VoxCeleb) datasets. Chameleon not only achieves superior performance when all modalities are present at train/test time but also demonstrates notable resilience in the case of missing modalities.
Citation
Muhammad Irzam
Liaqat,
Shah
Nawaz,
Muhammad Zaigham Zaheer,
Muhammad Saad Saeed,
Hassan Sajjad,
Tom De Schepper,
Karthik Nandakumar,
Muhammad Haris Khan,
Ignazio Gallo,
Markus
Schedl
Chameleon: A Multimodal Learning Framework Robust to Missing Modalities
International Journal of Multimedia Information Retrieval, doi:10.1007/s13735-025-00370-y, 2025.
BibTeX
@article{Liaqat2025Chameleon, title = {Chameleon: A Multimodal Learning Framework Robust to Missing Modalities}, author = {Liaqat, Muhammad Irzam and Nawaz, Shah and Zaheer, Muhammad Zaigham and Saeed, Muhammad Saad and Sajjad, Hassan and De Schepper, Tom and Nandakumar, Karthik and Khan, Muhammad Haris and Gallo, Ignazio and Schedl, Markus}, journal = {International Journal of Multimedia Information Retrieval}, doi = {10.1007/s13735-025-00370-y}, url = {https://doi.org/10.1007/s13735-025-00370-y}, year = {2025} }