LEOPARD: A Vision Language Model for Text-Rich Multi-Image Tasks

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.

Multi-image Text-rich Scenarios is Common in Real Life

Multi-page Documents

Multiple Charts/Tables

Q: what's the total amount of Income from continuing operations of 2006, and Assets of Pro forma?

A: (13649.0 + 1623.9) = 15272.9.

Webpage Trajectories

Challenges for Developing Multi-image Text-rich Model

C1: Limited training resources

Text-rich images have different distribution with natural images

C2: Balance between image tokens and visual resolutions

Under normal visual encoder resolution setting, texts is not legible.
Split images into sub-images results in more visual tokens

Solution

S1: Data Collection and Construction

Collect New Multi-image Resources From Real World (585K)
- Web Data Annotation: Multiple charts from the same articles. User shared slide decks.
- Transform from unimodal textual data: Plot table data onto images. Split large tables into multiple small tables.
- Assemble Single Image Datasets: concatenating multiple single image question answering samples into multi-image, multi-turn sample.
Adopt Existing Multi-image Datasets (340K)