Fashion Retrieval

Multi-Modal Retrieval in Fashion Domain

Fashion Retrieval

Multi-Modal Retrieval in Fashion Domain

Gianluca Moro, Stefano Salvatori, Giacomo Frisoni

Description

State-of-the-art (SOTA) works proposed in literature use vision-and-language transformers to assign similarity scores to joint text-image pairs, then used for sorting the results during a retrieval phase. However, this approach is inefficient since it requires coupling a query with every record in the dataset and computing a forward pass for each sample at runtime, precluding scalability to large-scale datasets. We thus propose a solution that overcomes the above limitation by combining transformers and deep metric learning to create a latent space where text and images are separately embedded and their spatial proximity translates into semantic similarity. Our architecture does not use any convolutional neural networks to process images, allowing us to test different levels of image-processing details, together with multiple metric learning losses. We improve retrieval accuracy results (+18.71% and +9.22% Rank@1 on image- to-text and text-to-image, respectively) on the FashionGen benchmark dataset while being up to 512x faster. Finally, we analyze the speed-up obtainable by different approximate nearest neighbor retrieval strategies — an optimization precluded to current SOTA contributions.

Keywords: multi-modal retrieval; deep metric learning; vision-and-language transformers; fashion domain

Try Me