HarmoView: Harmonizing Multi-View Constraints
for Identity-Consistent Video Generation

Cong Wang, Zhentao Yu, Hongmei Wang, Weicong Liang, Zixiang Zhou,
Zilin Yang, Jiarong Ou, Rui Chen, Yuan Zhou, Qinglin Lu

Tencent Hunyuan

Corresponding author: cw_research@163.com

Abstract

Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three synergistic architectural refinements complemented by a staged training curriculum. Specifically, (1) we first introduce Multi-level Feature Injection (MFI) to anchor identity fidelity; by injecting raw ViT features from frontal reference images alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, (2) we employ learnable proxy tokens to unify heterogeneous reference layouts across single- and multi-view settings while simultaneously resolving the reference–view mismatch problem. (3) Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the model’s original generative priors, we propose the Progressive View Curriculum, a four-stage training strategy that employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity using an in-house LoRA-augmented pipeline and multi-stage filtering processes. Extensive evaluation on our multi-view benchmark—comprising 100 manually-curated cases spanning 52 unique identities—demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

Method Overview

Three architectural refinements on a pre-trained Wan2.2-T2V backbone, plus a four-stage Progressive View Curriculum for stable training.

HarmoView framework overview
Figure 1. Overview of HarmoView. Multi-view references are encoded and concatenated with video noise along the sequence dimension, while frontal appearance features are injected into the text branch via cross-attention. Missing views are filled with learnable proxy tokens, and Jump-RoPE inserts positional gaps to isolate different identities. The four-stage Progressive View Curriculum (PVC) drives a stable transition from single-view to multi-view spatial reasoning.

MFIMulti-level Feature Injection

Raw ViT appearance features from frontal references are projected and concatenated with text tokens, providing persistent low-level identity anchors that complement the high-level features inside the DiT blocks throughout the entire denoising process.

LPTLearnable Proxy Tokens

Learnable placeholders unify the sequence layout across single- and multi-view settings. They act as “attention sinks” that absorb unmappable features, mitigating the reference–view mismatch problem and stabilizing the structural priors of the backbone.

Jump-RoPEIdentity-wise Feature Isolation

Discontinuous coordinate jumps insert logical gaps between the positional bands of different identities and between references and video noise, enforcing strict feature boundaries to suppress cross-identity crosstalk.

PVCProgressive View Curriculum

A four-stage schedule—single-view scaling, multi-view warm-up, strategic view dropout, and high-fidelity refinement—gradually increases reference complexity, enabling robust multi-view reasoning without degrading text-following or motion-synthesis capability.

Comparison with the state-of-the-art methods

Eight methods per case on HarmoView-Bench: four open-source baselines, two commercial engines, and HarmoView (1-View & Multi-View).

Reference 1
Person 1
Reference 2
Person 2
Single shot, starting with a medium shot in an ancient Chinese imperial palace setting. On the left is a young woman with long brown hair, wearing a golden crown inlaid with jewels and a red robe embroidered with dragons and phoenixes. She turns her face slightly to the right, gazes at the man on the right, and gently rests her head on his shoulder while speaking to him. On the right is a young man with a shaved head, wearing a yellow Buddhist robe, hands clasped together in front of his chest, palms facing each other with fingertips pointing upward, body slightly leaning forward, head bowed, eyes lowered, brows furrowed, listening with a tense expression. The camera slowly pushes in from the medium shot of both characters, ending in a close-up on the man's face to capture his tense expression. The background features red palace walls, carved window lattices, and hanging red curtains. Realistic style, cinematic quality.
点击播放全部
Compared with the other methods, HarmoView better preserves identity consistency under large viewpoint changes while offering stronger prompt fidelity—including her head resting on his shoulder, a shaved-head young man in a yellow Buddhist robe, body slightly leaning forward, and head bowed with eyes lowered. It is also worth noting that Kling 3.0 exhibits a clear reference-copy artifact (white earphones absent from the text or references), whereas our result stays aligned with the intended wardrobe and accessories.
Open-Source Models
VACE
Kaleido
RefAlign
HuMo
Commercial Engines & HarmoView
Kling 3.0
Wan 2.7
HarmoView★ (1-View)
HarmoView (Multi-View)
Reference 1
Person 1
Reference 2
Person 2
The video is a medium close-up shot, captured from a low-angle仰拍 perspective with a wide-angle lens and a fixed camera. On the left side of the frame is a young man wearing a shirt and vest, with black rectangular-framed glasses. He is bowing with his hands clasped in front of his chest, speaking with an excited tone. On the right side is another young man, dressed in a luxurious dark green tailcoat, wearing silver round-framed glasses. He occupies most of the frame, looking toward the upper left of the frame with a cold, intense gaze, speaking with an excited tone. The background features the Canton Tower and parked luxury cars. The camera slowly and smoothly rotates from a tilted diagonal composition to a frontal view, during which the left character is shown in profile and the right character is shown in full frontal expression. The low-angle perspective is maintained throughout. Realistic style, cinematic quality.
点击播放全部
Across every open-source and closed-source baseline, HarmoView most consistently delivers superior identity preservation, semantic faithfulness, and photorealism. It better follows the staged layout—on the left, a young man in a shirt and vest, bowing with hands clasped in front of his chest; on the right, the companion occupying most of the frame—together with the camera move: rotating from a tilted diagonal composition to a frontal view, while keeping both faces, wardrobe, and the Canton Tower backdrop stable throughout.
Open-Source Models
VACE
Kaleido
RefAlign
HuMo
Commercial Engines & HarmoView
Kling 3.0
Wan 2.7
HarmoView★ (1-View)
HarmoView (Multi-View)
Reference 1
Person 1
Reference 2
Person 2
Cinematic, photorealistic quality, modern urban cultivation style. High-angle frontal view, showing a full-body shot of the young man in the center of the frame. He has deep brown eyes and slightly messy black short hair, wearing a cultivation robe over a black casual shirt, standing on a huge, gleaming sword, flying rapidly through a sea of clouds and mountain peaks, his hair flowing wildly with the wind, conveying extreme speed. The camera dynamically follows him, slowly rising and tilting down, gradually expanding from a close-up of the character to a panoramic view of the entire scene, fully showcasing the undulating, bizarre peaks, the misty morning fog, and the vast sea of clouds. The lighting is rich in layers, with extremely fine details, creating a strong epic feeling and a grand spatial atmosphere. Realistic style, cinematic quality.
点击播放全部
HarmoView (Multi-View) maintains strong identity consistency for the hero while outperforming all other methods on semantic alignment and photorealism. In particular it better matches the prompt details—slightly messy black short hair, standing on a huge sword, and the camera slowly rising and tilting down—so costume, sword scale, and cloudscape read as a coherent cinematic take rather than a drifting stylization.
Open-Source Models
VACE
Kaleido
RefAlign
HuMo
Commercial Engines & HarmoView
Kling 3.0
Wan 2.7
HarmoView★ (1-View)
HarmoView (Multi-View)
Reference 1
Person 1
Reference 2
Person 2
The video is a static medium shot, captured from a frontal, eye-level angle. In the center of the frame is a young man with dyed yellow short hair and dark brown eyes, wearing a dark jacket over a white crew-neck shirt. He holds a giant blue water jug with both hands, initially facing the camera. At the start of the video, he turns his body from front to side, looks up, and gently pushes the jug towards the camera. He then speaks a line, followed by a 'cheers' gesture, raising the jug with both hands as if toasting, his gaze sincere. Immediately after, he strains with both hands to lift the heavy jug high, tilts his head back, opens his mouth, and drinks deeply from the jug's opening, the water inside sloshing naturally with his movement. The background is a static yellow wooden wall. At the bottom of the frame, white text with a black outline reads: '为我的莽撞自罚一杯'. Realistic style, cinematic quality.
点击播放全部
HarmoView (Multi-View) delivers stronger physical plausibility and identity consistency than both open-source and closed-source baselines on this clip. It also follows the prompt more faithfully—including dyed yellow short hair, pushing the jug toward the camera, raising the jug with both hands for the toast, and the generated on-screen subtitle—while keeping liquid motion, hand–object contact, and facial likeness coherent through the drinking motion.
Open-Source Models
VACE
Kaleido
RefAlign
HuMo
Commercial Engines & HarmoView
Kling 3.0
Wan 2.7
HarmoView★ (1-View)
HarmoView (Multi-View)
Reference 1
Person 1
Reference 2
Person 2
A single shot rendered as a single frame, in a realistic style, shot from a medium-close distance. The scene shows an Asian man circling around on the grass while holding a golden retriever, looking very happy. The camera gradually zooms in from a distance, eventually focusing on the man’s face as he spins.
点击播放全部
HarmoView (Multi-View) clearly surpasses the open-source baselines on this prompt: it faithfully executes the instruction to hold the golden retriever and spin on the grass while maintaining stronger identity consistency for both the person and the dog. Among closed-source engines, our model additionally achieves tighter ID preservation than Wan 2.7, keeping the handler and the dog recognizable and coupled throughout the rotation and zoom-in.
Open-Source Models
VACE
Kaleido
RefAlign
HuMo
Commercial Engines & HarmoView
Kling 3.0
Wan 2.7
HarmoView★ (1-View)
HarmoView (Multi-View)

Ablation Study

Architectural ablation on Jump-RoPE.

w/o Jump-RoPE — Sequential Indexing

Case 1
Reference 1
Person 1
Reference 2
Person 2
w/o Jump-RoPE (Sequential RoPE)
Full HarmoView (w/ Jump-RoPE)
IssueWithout the logical gaps introduced by Jump-RoPE, the glasses features of the two characters are confused.
Case 2
Person 1 — view 3 (frontal)
Person 1
Person 2 — view 3 (frontal)
Person 2
w/o Jump-RoPE (Sequential RoPE)
Full HarmoView (w/ Jump-RoPE)
IssueWithout the logical gaps introduced by Jump-RoPE, the right person in the generated video blends the jawline of Identity 1 and the upper-face of Identity 2.

Dataset & Benchmark

One training-corpus sample and one HarmoView-Bench case, each with multi-view references, prompt, and target video.

Training Dataset Sample Dataset

A representative entry from the HarmoView training corpus — two identities, each captured from seven reference views, paired with a structured prompt and the target video.

Dataset Sample — Two-person Multi-View Tuple

Dataset
Person 1
V1 Person 1 — view 1
V2 Person 1 — view 2
V3 Person 1 — view 3
V4 Person 1 — view 4
V5 Person 1 — view 5
V6 Person 1 — view 6
V7 Person 1 — view 7
Person 2
V1 Person 2 — view 1
V2 Person 2 — view 2
V3 Person 2 — view 3
V4 Person 2 — view 4
V5 Person 2 — view 5
V6 Person 2 — view 6
V7 Person 2 — view 7

An East Asian man in his 20s or early 30s stands behind an East Asian woman who has fair skin and long brown hair tied back. He wears a black vest over a dark green tunic and has a topknot hairstyle. She is dressed in light-colored robes with a high collar. They both look towards something just outside the right side of the frame where part of their audience's head can be seen. The woman's expression shifts between shock and anxiety; she opens her mouth slightly before closing it again. Her hands grip the man's arm for support. The man maintains eye contact forward throughout the clip.

HarmoView-Bench Case Benchmark

A curated evaluation case from HarmoView-Bench — two identities with five reference views each, plus the evaluation prompt. The benchmark only ships the input tuple; no ground-truth video is included.

HarmoView-Bench — Large-Pose Evaluation

Benchmark
Person 1
V1 Person 1 — view 1
V2 Person 1 — view 2
V3 Person 1 — view 3
V4 Person 1 — view 4
V5 Person 1 — view 5
Person 2
V1 Person 2 — view 1
V2 Person 2 — view 2
V3 Person 2 — view 3
V4 Person 2 — view 4
V5 Person 2 — view 5

A single-shot video begins with a medium close-up, slowly zooming in to a facial close-up, presenting a cinematic still quality. On the left is a young man with an average face shape, dark brown eyes, and short black hair combed back and tied with a ribbon, wearing a flowing, exquisite xianxia-style robe and matching trousers. On the right is a young man with an average face shape, dark brown eyes, and short black hair with bangs, combed back and secured with a hairpin, wearing a flowing, exquisite xianxia-style robe and matching trousers. The two stand back-to-back at the foot of a mountain in a magnificent fairyland. The man on the left turns his head slightly, gazing at the man on the right with a melancholic expression. The man on the right bows his head in thought, then slowly lifts his head and speaks first, his voice low and firm. The man on the left replies resolutely. Afterwards, the man on the right stands still, staring into the distance, while the man on the left turns and flies away on the wind, his robes billowing. The camera pans right, focusing on a close-up of the man on the right's face; he looks sorrowful, tears welling in his eyes, with the blurred figure of the man on the left flying away visible in the background. The background is a magnificent fairyland: a waterfall cascades from a cliff into a lake below, with rising mist and swirling clouds, bathed in soft sunlight creating a dreamy atmosphere. Realistic style, cinematic quality.

Limitations & Societal Impacts

Scope boundaries and responsible-use considerations for HarmoView.

Limitations

HarmoView is currently designed for portrait-identity consistency only. The method is optimized for human facial identity preservation and has not been validated for broader object categories or full-scene identity correspondence. In addition, the current formulation supports up to two identities per generation. This constraint may limit applicability in scenarios requiring long-cast interactions, dense crowds, or more compositional role assignments. Future work should investigate scalable identity slots, improved multi-person disentanglement, and broader-domain consistency modeling.

Societal Impacts

As a portrait-consistent video generation method, HarmoView may be misused to create deceptive or non-consensual synthetic media (e.g., identity impersonation and deepfake-style content). Such misuse can amplify reputational harm, harassment, and misinformation. We therefore recommend responsible deployment practices, including explicit policy constraints, access control, provenance signaling (e.g., watermarking or content credentials when feasible), and human-in-the-loop review for high-risk use cases. We also encourage downstream users to comply with local legal and ethical requirements regarding consent, privacy, and biometric identity usage.

BibTeX

If you find HarmoView useful for your research, please consider citing our work.

@article{wang2026harmoview,
  title   = {HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation},
  author  = {Wang, Cong and Yu, Zhentao and Wang, Hongmei and Liang, Weicong and
             Zhou, Zixiang and Yang, Zilin and Ou, Jiarong and Chen, Rui and
             Zhou, Yuan and Lu, Qinglin},
  journal = {arXiv preprint arXiv:2606.10839},
  year    = {2026}
}