ACM Multimedia 2026 · Dataset Track · Rio de Janeiro, Brazil

MMArt: A Multi-Perspective Multimodal Dataset
for Visual Art Understanding

Shuai Wang  ·  Wangyuan Ding  ·  Yixian Shen  ·  Jia-Hong Huang  ·  Stevan Rudinac  ·  Monika Kackovic  ·  Nachoem Wijnberg  ·  Marcel Worring

University of Amsterdam

Overview

MMArt is a large-scale dataset of 74,234 WikiArt paintings, each annotated with four independently generated interpretive perspectives — Narrative, Formal, Emotional, Historical — plus a harmonized unified caption. This website provides supplementary materials for the paper; see the paper for full methodology, experiments, and results.

MMArt dataset overview
Figure 1. Each painting is annotated with four independent perspectives and a harmonized unified caption. Validity is evaluated through reconstruction and retrieval experiments.

Dataset at a Glance

74kPaintings
20Art Styles
743Artists
5Centuries
35.75MTotal Words

The Four Perspectives

Each painting is annotated with four independently generated, specialist-model captions — one per interpretive dimension — plus a synthesized unified caption (U). Full prompt templates for each are in S1.

N — narrative

Narrative & Scene

Depicted entities, figures, scene composition, and story.

Qwen3-VL-8B-Instruct
F — formal

Formal Analysis

Composition, brushwork, palette, light and shadow.

GalleryGPT (LLaVA-7B + LoRA)
E — emotional

Emotional Response

Mood, atmosphere, and psychological tone.

Qwen3-VL-8B-Instruct + ArtEmis-v2
H — historical

Historical Context

Art-historical meaning, symbolism, and cultural codes.

Qwen3-VL-8B-Instruct + RAG

Dataset Access

Data Schema

{
  "image_id":         "Romanticism/delacroix_liberty-leading-the-people.jpg",
  "title":            "liberty-leading-the-people",
  "artist":           "delacroix",
  "style":            "romanticism",
  "date":             "1830",
  "e_narrative":      "A triumphant woman strides forward...",
  "e_formal":         "The diagonal composition surges from lower left...",
  "e_emotional":      "The painting radiates fierce exhilaration...",
  "e_historical":     "Painted in the wake of the July Revolution...",
  "e_unified":        "Liberty Leading the People captures the explosive...",
  "dominant_emotion": "awe",
  "n_perspectives":   4
}

Citation

@inproceedings{wang2026mmart,
  title     = {MMArt: A Multi-Perspective Multimodal Dataset for Visual Art Understanding},
  author    = {Wang, Shuai and others},
  booktitle = {Under submission to ACM International Conference on Multimedia},
  year      = {2026},
}

S1 — Full Prompt Templates

The paper states: "Full prompt templates for all five generation steps are provided in the supplementary website." (§3.2). All prompts are reproduced verbatim below. Click any block to expand.

S1.1   Narrative Perspective   (πnarr)
Qwen3-VL-8B-Instruct Image + text  ·  Target: ~80 words
You are an art interpretation assistant.

Given the painting titled "{title}" by {artist}, write a detailed
**narrative and scene interpretation** — what is happening or might be
happening in the scene. Focus on storytelling, implied action,
relationships between figures, and atmosphere.

Guidelines:
- Length: ~80 words
- Tone: descriptive and interpretive, not technical
- Avoid: artistic terms (e.g. "chiaroscuro", "composition"),
  historical facts, or the artist's name

Write the narrative and scene interpretation:
S1.2   Formal Perspective   (πform)
GalleryGPT LLaVA-7B + LoRA  ·  Deterministic (temp=0, beams=1)  ·  Target: ~80 words
Compose a short paragraph of formal analysis for this painting.
Describe the composition, use of color and light, brushwork or technique,
spatial organisation, and any notable visual effects.
Focus purely on how the painting is made, not what it depicts or its
historical context.
Length: ~80 words.
S1.3   Emotional Perspective   (πemot)
Qwen3-VL-8B-Instruct Image + text + ArtEmis utterances  ·  Target: ~80 words

Variant A — With ARTEMIS-v2 grounding (99.0% of paintings)

You are an art interpretation assistant.

Look at the painting "{title}" by {artist}.

Real viewers responded to this painting with the following emotional reactions:
{utterances}

The most common emotional response was: {dominant_emotion}.

Using both what you see in the painting and these viewer reactions as grounding,
write a coherent ~80-word **emotional interpretation** — the mood it evokes,
the atmosphere, and the psychological tone.
Synthesize the visual qualities of the painting with the viewer reactions
into a unified emotional description.
Write in third person (e.g. "The painting evokes..."), not first person.

Write the emotional interpretation:

Variant B — Vision only (fallback, 1.0%)

You are an art interpretation assistant.

Given the painting titled "{title}" by {artist}, write an ~80-word
**emotional interpretation** — the mood it evokes, the atmosphere, and
the psychological tone it creates in a viewer.
Focus on emotional and affective qualities only.
Avoid describing what is depicted or analysing technique.
Write in third person (e.g. "The painting evokes...").

Write the emotional interpretation:
S1.4   Historical Perspective   (πhist)
Qwen3-VL-8B-Instruct Image + text + RAG context  ·  Target: ~80 words

Variant A — With retrieved art-history context (similarity ≥ 0.25)

You are an art historian.

The following context has been retrieved from an art knowledge base about
"{title}" by {artist} ({style}, {date}):
{context}

Using the retrieved context and your knowledge of art history,
write a coherent ~80-word **historical and cultural interpretation** of this
painting — covering the artistic movement, historical period, cultural setting,
and any relevant influences, patronage, or significance.

Guidelines:
- Focus on history and cultural context only
- Do NOT describe visual elements, colours, brushwork, or composition
- Do NOT speculate on specific details not supported by the context or
  established art history
- Write in third person (e.g. "The painting reflects...")

Write the historical interpretation:

Variant B — No RAG (fallback when similarity < 0.25)

You are an art historian.

Write a ~80-word **historical and cultural interpretation** of "{title}"
by {artist} ({style}, {date}) — covering the artistic movement, historical
period, cultural setting, and any relevant influences or significance.

Guidelines:
- Focus on history and cultural context only
- Do NOT describe visual elements, colours, brushwork, or composition
- If you are uncertain about specific facts for this artist or work,
  speak at the level of the movement and period
- Write in third person (e.g. "The painting reflects...")

Write the historical interpretation:
S1.5   Unified Caption   ( πunif)
Qwen3-8B Text-only via vLLM  ·  Target: ~150 words  ·  Applied to all 74,234 paintings

System

You are an art writer producing unified painting descriptions for an academic dataset.

Given four analytical perspectives on a painting, write a single unified
description of approximately 150 words that integrates all four perspectives
into coherent prose.

Rules:
- Do not use section headers or bullet points
- Do not start with "This painting" or "The painting"
- Write in present tense, third person
- Preserve specific details from each perspective: what is depicted,
  visual structure and technique, emotional atmosphere, and art-historical context
- Output only the description, nothing else

User

Painting: "{title}" by {artist}

[Narrative]
{e_narrative text}

[Formal]
{e_formal text}

[Emotional]
{e_emotional text}

[Historical]
{e_historical text}

Write a unified ~150-word description integrating all perspectives above.

S2 — Additional Implementation Details

The main paper reports max_tokens=256 with default decoding temperature for VLM perspectives. The following parameters were omitted due to space constraints.

Text-only LLM Generation (Qwen3-8B via vLLM)

ParameterPhase 1 — SynthesisPhase 4 — Unified
Temperature0.30.3
Top-P0.90.9
Max tokens200250
Repetition penalty1.051.05
Dtypebfloat16bfloat16
Batch size51264
Chain-of-thoughtdisableddisabled

Image Generation (Phase 2)

ModelStepsGuidance ScaleGen. Res.Output Res.
FLUX.2-Klein-4B41.01024×1024512×512
Qwen-Image-2512254.0 (true CFG)1024×1024512×512

Qwen-Image uses negative prompt: "blurry, low quality, deformed, ugly, text, watermark, signature, extra limbs, bad anatomy". Downsampling via Lanczos. Seed: 42.

Shared vLLM Infrastructure

ParameterValue
max_model_len4096 tokens
gpu_memory_utilization0.85
Vision image rangemin 256×28², max 1280×28² pixels
Random seed42

S3 — RAG Construction (Historical Perspective)

The main paper states that the top-5 context documents are retrieved via embedding ranking. The complete retrieval pipeline is as follows.

Chunk Index

  • Embedding model: sentence-transformers/all-MiniLM-L6-v2 (384-dim, L2-normalised)
  • Chunking: sliding window — 1,000 tokens per chunk, 100-token stride over Wikipedia art pages
  • Index: binary embedding matrix + text key-value store

Retrieval at Inference

A query string is constructed from title, artist, style, and date, then:

  1. Embed with local sentence-transformer (~2 ms on CPU)
  2. Cosine similarity vs. all chunks (L2-normalised dot product)
  3. If max_similarity < 0.25 → fall back to no-RAG prompt (Variant B)
  4. Otherwise return top-5 chunks as {context}
ParameterValue
Embedding modelsentence-transformers/all-MiniLM-L6-v2
Embedding dim384
Similarity metricCosine (L2-normalised dot product)
Similarity threshold0.25

S4 — Quality Evaluation: LLM-as-Judge Prompt

The paper reports judge scores (Table 3) and states the full prompt is in the supplementary. Judge model: google/gemma-3-27b-it (temperature 0.1, max tokens 150, bfloat16).

System Prompt
You are an expert art critic evaluating AI-generated art descriptions.
You will be shown a painting and one perspective description. Rate it on three dimensions.
Respond ONLY with valid JSON, no other text.
User Prompt Template
Painting metadata: Artist={artist}, Style={style}, Date={date}

Perspective type: {label}
Expected focus: {focus_desc}

Description to evaluate:
"""{perspective_text}"""

Rate this description on a scale of 1–5 for each dimension:
- perspective_fidelity: Does the text genuinely focus on {label} aspects?
  (1=wrong focus, 5=excellent focus)
- factual_accuracy: Is the content factually correct about this painting?
  (1=many errors, 5=accurate)
- depth: Does it provide substantive detail beyond generic description?
  (1=generic, 5=highly specific)

Respond with ONLY this JSON:
{"perspective_fidelity": <1-5>, "factual_accuracy": <1-5>, "depth": <1-5>, "reasoning": "<one sentence>"}

Expected Focus per Perspective

PerspectiveExpected focus passed to judge
Narrativethe depicted subjects, scenes, narrative elements, and actions
Formalcomposition, palette, brushwork, technique, and visual structure
Emotionalmood, emotional atmosphere, and affective qualities the work evokes
Historicalart movement, historical period, cultural context, and artist's style
Unifiedan integrated view combining narrative, formal, emotional, and historical aspects
── S6: Per-Style Analysis ── -->

S6 — Per-Style Analysis

The figures below break down reconstruction and retrieval results by the 20 WikiArt styles in the 1,000-painting benchmark sample — supplementary to the paper's aggregated numbers.

Retrieval Heatmap: All 10 Conditions × 20 Styles

The H column is near-zero across all styles. Pop Art is consistently easiest to retrieve; Abstract Expressionism is hardest.

Retrieval heatmap
Figure B. R@1 heatmap across 20 art styles and 10 query conditions. Vertical line separates single-perspective from multi-perspective conditions.

Regeneration CLIP: Single Perspectives × 20 Styles

Formal (F) dominates CLIP similarity for most styles. Exception: Ukiyo-e achieves its highest CLIP score with Historical (H = 0.765), where cultural context uniquely encodes the woodblock aesthetic.

Regen CLIP heatmap
Figure C. Per-style CLIP similarity for regenerated images under four single-perspective conditions (averaged over FLUX.2-Klein and Qwen-Image).