Vistoria

A Multimodal System to Support Fictional Story Writing through Instrumental Image-Text Co-Editing

Kexue Fu1,2*, Jingfei Huang3*, Long Ling4*, Sumin Hong2, Yihang Zuo5, RAY LC1, Toby Jia-Jun Li2

1City University of Hong Kong    2University of Notre Dame    3Harvard University    4Tongji University    5HKUST (Guangzhou)

CHI 2026

Overview

Video not loading? Watch on YouTube

Background

Human thinking involves multimodal processing. Visual processes play a central role in cognition: we recall experiences as spatial scenes, form mental models through imagery, and use visual structure to organize and interpret information. Language is similarly entwined with imagery. Text comprehension often evokes mental pictures, and abstract ideas are commonly articulated through spatial metaphors such as path, framework, or perspective.

Story writing is particularly multimodal in nature. During the planning phase, experienced writers often use both imagery and language to construct the story world. They visualize spatial layouts, character interactions, and scene dynamics, while using textual notes to label, sequence, and reason about narrative structure. In the translating phase, visual details serve as an anchor that shapes how writers organize narrative detail and emotional tone.

Yet, current writing tools remain overwhelmingly text-centric. Although recent systems powered by Large Language Models incorporate visual elements through image generation or retrieval, these visuals remain peripheral. They function mainly as prompts, static references, or organizational diagrams, rather than as tightly integrated, manipulable representations along with text.

Theory & Design

Theoretical Foundation

Dual Coding Theory posits that humans draw on two interconnected cognitive channels: verbal (language-based) and nonverbal (visual-spatial). For story writing, visual representations serve as a parallel cognitive resource that complements verbal planning and sparks creative connections.

Instrumental Interaction provides design principles for building interfaces where user actions become manipulable objects: (1) Reification—transforming abstract commands into tangible instruments; (2) Polymorphism—allowing instruments to operate across different content types; (3) Reuse—enabling outputs to feed into subsequent operations.

Formative Study

We conducted a Wizard-of-Oz co-design study with 10 experienced fiction writers to understand how they integrate multimodal content when planning and drafting stories:

1. Multimodal Input Reifies Vague Ideas

Sketches externalized spatial imagination; text expressed connections to existing stories; images communicated style and mood expectations.

2. Image-Text Interplay

Neither modality sufficed alone: images "set the vibe" while text reframed meaning. We observed a recurring cycle of divergence and convergence between abstract exploration and concrete grounding.

3. Direct Manipulation

Participants frequently merged elements through collaging and recombination, and desired fine-grained control—regenerating regions, extracting elements, or annotating personas.

4. Fragmented to Coherent

Organizing fragments into coherent narratives was challenging. Writers requested clustering, surfacing relations, and consolidating materials into reusable "setting cards."

Cyclic Workflow

Synthesizing these insights, Vistoria introduces a cyclic workflow where multimodal artifacts and ideas co-evolve: (A) Instrumental operations enable image-text co-editing; (B) Resulting artifacts prompt new story directions; (C) Image-text alignment supports idea formation by coordinating verbal and non-verbal processing; (D) Emerging ideas are externalized into new artifacts, closing the loop for iterative ideation.

Vistoria Cyclic Workflow Framework

Key Features

Vistoria introduces a set of instrumental operations designed around three principles: Reification, Polymorphism, and Dual Coding alignment. These operations enable synchronized image-text co-editing to enhance planning and translating of fictional story writing.

Multimodal Generation

Vistoria transforms multimodal input (sketches, text, and images) into image-text cards that reflect the writer's intention. Two modes are available: Exact Craft for precise realization, and Creative Spark for diverse exploration.

Intention Reification

Transform vague ideas using sketches, text, and images into concrete narrative materials.

Dual Generation Modes

Exact Craft for precision, Creative Spark for diverse exploration and serendipitous discovery.

Lasso - Extract & Focus

The Lasso instrument turns the abstract action of "focusing on part of a story" into a manipulable unit. Selecting a region in either an image or text triggers the generation of a new card with enriched narrative and visual details, enabling writers to shift between narrative scales.

Extract from Image

Circle visual details to generate focused story segments with enriched descriptions.

Multi-scale Editing

Work at different narrative scales - from global story arcs to specific scene details.

Collage - Recombine & Create

The Collage instrument reifies "recombining inspirations" into tangible manipulation. Image fragments, sketches, or text can be directly composed within a collage frame to form new narrative possibilities unachievable through text or images alone.

Creative Recombination

Merge extracted elements from different sources to discover new narrative directions.

Sketch + Image Fusion

Combine hand-drawn sketches with images to externalize and specify your creative vision.

Filter - Align Style & Tone

The Filter instrument parameterizes affect: applying a "melancholic" or "dreamy" filter simultaneously adjusts visual style and rewrites accompanying prose to match the emotional tone, maintaining stylistic coherence critical in fictional story writing.

Synchronized Tone Setting

Visual filters automatically align image style with text emotion for consistent narrative mood.

Affective Parameterization

From warm and intimate to dramatic and mysterious - set the perfect emotional atmosphere.

Perspective Shift - Change Viewpoint

The Perspective-Shift instrument changes visual viewpoint and automatically regenerates story fragments from first-, third-, or second-person perspective, allowing writers to explore empathy, distance, and awareness in a synchronized manner.

Narrative Voice Transformation

A new camera angle corresponds to a new narrative voice - reshape how events are perceived.

Empathy & Distance Control

First-person for intimacy, third-person for structural awareness - explore different framings.

Cluster - Organize Fragments

The Cluster instrument supports writers in moving from scattered notes and visuals toward coherent structure: grouping related materials, surfacing relationships, and consolidating them into reusable setting cards for ongoing planning.

Clustering & setting cards

Organize multimodal fragments and strengthen narrative coherence as ideas accumulate.

Key Results

A controlled study with 12 participants demonstrated that Vistoria's multimodal co-editing approach:

  • Enhances expressiveness (p = .023) - participants could better capture their creative intentions
  • Increases immersion (p = .0006) - the exploratory nature felt playful and engaging
  • Improves collaboration perception (p = .0418) - system felt like a creative partner
  • Supports divergent exploration - participants explored more directions with greater depth
  • Preserves agency and ownership - writers felt in control while benefiting from AI assistance

While the multimodal workflow increased mental and physical workload, participants valued the enhanced sense of mastery over narrative content and treated the system as a supportive co-pilot rather than a substitute for their own creativity.

Citation

If you find this work useful, please cite our paper:

BibTeX

@inproceedings{fu2026vistoria, author = {Fu, Kexue and Huang, Jingfei and Ling, Long and Hong, Sumin and Zuo, Yihang and LC, RAY and Li, Toby Jia-Jun}, title = {Vistoria: A Multimodal System to Support Fictional Story Writing through Instrumental Image-Text Co-Editing}, booktitle = {Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems}, series = {CHI '26}, year = {2026}, location = {Barcelona, Spain}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3772318.3790400}, url = {https://doi.org/10.1145/3772318.3790400} }

ACM Reference Format

Kexue Fu, Jingfei Huang, Long Ling, Sumin Hong, Yihang Zuo, RAY LC, and Toby Jia-Jun Li. 2026. Vistoria: A Multimodal System to Support Fictional Story Writing through Instrumental Image-Text Co-Editing. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26), April 13–17, 2026, Barcelona, Spain. ACM, New York, NY, USA, 26 pages. https://doi.org/10.1145/3772318.3790400