Query-Oriented Micro-Video Summarization

Query-oriented micro-video summarization task aims to generate a concise sentence with two properties: (a) summarizing the main semantic of the micro-video and (b) being expressed in the form of search queries to facilitate retrieval. Despite its enormous application value in the retrieval area, this direction has barely been explored. Previous studies of summarization mostly focus on the content summarization for traditional long videos. Directly applying these studies is prone to gain unsatisfactory results because of the unique features of micro-videos and queries: diverse entities and complex scenes within a short time, semantic gaps between modalities, and various queries in distinct expressions. To specifically adapt to these characteristics, we propose a query-oriented micro-video summarization model, dubbed QMS. It employs an encoder-decoder-based transformer architecture as the skeleton. The multi-modal (visual and textual) signals are passed through two modal-specific encoders to obtain their representations, followed by an entity-aware representation learning module to identify and highlight critical entity information. As to the optimization, regarding the large semantic gaps between modalities, we assign different confidence scores according to their semantic relevance in the optimization process. Additionally, we develop a novel strategy to sample the effective target query among the diverse query set with various expressions. Extensive experiments demonstrate the superiority of the QMS scheme, on both the summarization and retrieval tasks, over several state-of-the-art methods.

Introduction

Micro-Video Summarization for Retrieval

One effective approach for micro-video retrieval is to evaluate similarities between the user-generated video summarizations and queries
But user tend to overlook summarization when uploading micro-videos due to the inconvenience on mobile devices
So we propose to generate micro-video summarizations to help retrieval

Challenges

Difference between long video and micro-video: The duration is a significant difference between microvideos and long videos. Despite the short length, micro-videos cover a wide range of scenes and diverse entities to convey the complete theme (See above figure (a))
Semantic gap between the different modalities:
Intricate strategy modeling:

Query-Oriented Micro-Video Summarization

Introduction

Micro-Video Summarization for Retrieval

Challenges

Method

Experiment