
Abstract
Representations of the audio content of music tracks and of related data (such as lyrics, user-generated tags, or interaction data) are often leveraged in music retrieval and recommendation systems. It is therefore important to know how the choice of representation affects the results of similarity-based music retrieval systems. In this work, we address this question under several aspects. We analyze the accuracy, coverage, hubness, popularity bias, and robustness of retrieval systems based on different content modalities (text, audio, video) and on user–item interactions, and analyze the impact of corresponding features on multimodal retrieval systems. The paper gives useful insight into which modality to leverage depending on the aspects of retrieval results to prioritize and hence provides guidelines for practical real-world scenarios.
Citation
Marta
Moscati,
Gustavo
Escobedo,
Eduardo Hernandez Almanza,
Jonas Peché,
Markus
Schedl
Audio, Lyrics, Videoclips, Interactions? An Analysis of Uni- and Multi-modal Music Retrieval Systems in Terms of Accuracy and Beyond-accuracy Aspects
Proceedings of the 3rd Music Recommender Systems Workshop (MuRS) co-located with the 19th ACM Conference on Recommender Systems (RecSys 2025), Prague, Czech Republic., 2025.
BibTeX
@proceedings{Moscati2025multimodal_mir, title = {Audio, Lyrics, Videoclips, Interactions? An Analysis of Uni- and Multi-modal Music Retrieval Systems in Terms of Accuracy and Beyond-accuracy Aspects}, author = {Moscati, Marta and Escobedo, Gustavo and Hernandez Almanza, Eduardo and Peché, Jonas and Schedl, Markus}, booktitle = {Proceedings of the 3rd Music Recommender Systems Workshop (MuRS) co-located with the 19th ACM Conference on Recommender Systems (RecSys 2025), Prague, Czech Republic.}, editor = {Ferraro, Andrés and Porcaro, Lorenzo and Bauer, Christine}, publisher = {CEUR-WS.org}, year = {2025} }