HarmoView: Harmonizing Multi-View Constraints
for Identity-Consistent Video Generation

Cong Wang✉, Zhentao Yu, Hongmei Wang, Weicong Liang, Zixiang Zhou,
Zilin Yang, Jiarong Ou, Rui Chen, Yuan Zhou, Qinglin Lu

Tencent Hunyuan

✉ Corresponding author: cw_research@163.com

arXiv Code

chunjie_4.mp4

The entire video is rendered as a single take, shot from a medium-shot selfie perspective, starting with a level angle. The main character in the center of the frame corresponds to the first image, with the character’s left arm wrapped around the tiger beside them and the right arm around the lion beside them. As the video begins, the camera rises from a level shot to a high-angle shot. The tiger on the left acts like a clingy big cat, constantly rubbing its furry head against the main character’s cheeks and ears, even sticking out its tongue as if to lick them; The lion on the right, however, appears lazy and domineering. It opens its huge mouth for a long yawn, revealing its fangs but showing no sign of aggression, then presses its heavy head down hard on the main character’s shoulder. Sandwiched between the two “big cats,” the main character is tickled by the tiger’s rubbing. Unable to resist, they squint and burst into laughter, their body swaying slightly as they try to maintain their selfie pose amidst the enthusiastic onslaught of the two beasts. A breeze sweeps across the grassland in the background, casting the swaying shadows of acacia trees, while warm sunlight dances across all three of them, creating an atmosphere that is joyful, intimate, and comical.

chunjie_6.mp4

The video opens with a close-up shot from a side-frontal eye-level angle, focusing on a young woman in the center of the frame. She faces the camera directly, her right hand raised, holding a lit sparkler. The sparkler is positioned in the foreground, slightly off-center, continuously emitting fine golden sparks that fly outward in all directions before fading. Thin wisps of smoke slowly rise and dissipate around the sparkler. The woman has shoulder-length black hair, slightly tousled and gently moving in the breeze. She wears a dark gray LA baseball cap, a thick gray scarf wrapped around her neck, and a dark top with light-colored inner lining visible at the cuffs. Initially, the focus is sharply on the sparkler; then, the focus smoothly shifts to her face. Her eyes gaze gently at the camera, and her chest rises and falls with subtle breaths. The environment is a dark night, with a blurred background showing only soft bokeh lights on the right, shimmering in shades of teal and warm yellow. The primary lighting comes from the warm-toned sparks, casting soft highlights on her cheeks and the edges of her scarf, creating delicate, rhythmic shifts in light and shadow across her face that accentuate her contours. The contrast between the dark background and the bright sparks makes the sparkle more prominent. The overall lighting is soft with scattered warm glows, creating a serene and warm atmosphere. The camera movement is extremely slow push-pull with a cinematic handheld shake. Realistic style, cinematic quality.

chunjie_7.mp4

Cinematographic, realistic photography; medium shot. The video is generated as a single continuous shot, with the main character in the center of the frame matching the first image. The character is a tough special forces soldier, wearing black sunglasses and a tactical vest. His arms are heavily muscled with prominent veins, and he holds an assault rifle with a cold, stern expression. The character is leading a squad of fully armed soldiers through the dimly lit corridors of an abandoned factory, with blurred teammates trailing behind him. A string of warm yellow light bulbs hangs from the ceiling of the corridor, creating a contrast between cool and warm lighting. The shot employs a tracking camera movement, starting slightly to the left of the character’s front. The camera then slowly pans to the right, gradually shifting to a rear-side angle as the character advances. The background is blurred, with high-definition, 8K quality, exuding raw intensity and featuring a style reminiscent of action movie stills.

chunjie_12.mp4

The video consists of a single take, showing the characters in a medium shot. The camera follows the characters as they move; the person in front of the horse corresponds to the first image, while the character behind corresponds to the second image. The two characters, dressed in traditional attire, are galloping on a brown steed, entering the frame from the right and moving toward the left. The character in front (wearing a pink Hanfu) faces the left side of the frame but turns her head to the left, cheerfully uttering a line. The character behind her, wearing a light blue Hanfu, also speaks a line with a smile. Both have their hair styled in ancient buns, their beaming smiles radiating immense joy and freedom. The camera pulls back to a wide shot, revealing the two galloping across a vast green grassland. The scene features a deep blue sky, rolling hills in the distance, and bright sunshine.

chunjie_13.mp4

The video consists of a single shot, beginning with a medium shot of an ancient Chinese court scene. The figure on the left corresponds to the first image, and the figure on the right corresponds to the second image. The figure on the left wears a red court robe embroidered with dragons and phoenixes and a golden crown inlaid with jewels. His face is turned slightly to the right; he gazes at the figure on the right, leans against that person’s shoulder, and speaks to him. The figure on the right is a bald monk dressed in a yellow robe, with his hands pressed together in front of his chest, palms facing each other and fingertips pointing upward, and his body leaning slightly forward. The camera zooms in for a close-up of the right-hand figure’s facial expression: his head is bowed, his eyes cast downward, his expression tense, and his brow furrowed. The background features red palace walls, carved window frames, and hanging red curtains.

chunjie_16.mp4

The entire sequence is rendered as a single-take video in a realistic style, shot in medium-close-ups from a low, upward angle using a wide-angle lens with a fixed camera. The person on the left corresponds to the first image, and the person on the right corresponds to the second image. In front of the Canton Tower, the camera tilts to capture a diagonal composition of the imposing figure of the character on the right. The character on the right, dressed in an elegant dark green tailcoat, fills most of the frame. He looks toward the lower left of the screen with a cold gaze and delivers a line with an air of authority. The character on the left, wearing a shirt and vest, bows with clasped fists while looking down, shouting a line with excitement, with luxury cars parked in the background. The camera rotates slowly and smoothly, gradually shifting from the original tilted diagonal angle to a frontal perspective, capturing the left character’s profile and the right character’s facial expression. The low-angle, upward shot is maintained throughout the camera movement.

chunjie_33.mp4

Cinematic, photorealistic quality, modern urban cultivation style. High-angle frontal view, showing a full-body shot of the young man in the center of the frame. He has deep brown eyes and slightly messy black short hair, wearing a cultivation robe over a black casual shirt, standing on a huge, gleaming sword, flying rapidly through a sea of clouds and mountain peaks, his hair flowing wildly with the wind, conveying extreme speed. The camera dynamically follows him, slowly rising and tilting down, gradually expanding from a close-up of the character to a panoramic view of the entire scene, fully showcasing the undulating, bizarre peaks, the misty morning fog, and the vast sea of clouds. The lighting is rich in layers, with extremely fine details, creating a strong epic feeling and a grand spatial atmosphere. Realistic style, cinematic quality.

chunjie_46.mp4

The entire scene is rendered as a single-shot video, with the central subject corresponding to the first image. A snowy portrait, a beautiful, artistic shot. A close-up captures a young character with fair skin, long straight black hair, and a straight fringe. The character wears a thick black knitted scarf and a black coat. Snow swirls through the air, fluttering onto the character’s hair and eyelashes, while blurred snowflakes drift in the foreground. The video begins with a slight profile shot as she looks up at the sky. The camera circles around the subject, with a blurred snowy landscape in the background. Heavy snow falls as a gentle breeze ruffles her bangs and the ends of her hair. The lighting is soft and cool, creating a rich cinematic atmosphere.

homemade_0.mp4

A single shot rendered as a single frame, in a realistic style, shot from a medium-close distance. The scene shows an Asian man circling around on the grass while holding a golden retriever, looking very happy. The camera gradually zooms in from a distance, eventually focusing on the man’s face as he spins.

Abstract

Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three synergistic architectural refinements complemented by a staged training curriculum. Specifically, (1) we first introduce Multi-level Feature Injection (MFI) to anchor identity fidelity; by injecting raw ViT features from frontal reference images alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, (2) we employ learnable proxy tokens to unify heterogeneous reference layouts across single- and multi-view settings while simultaneously resolving the reference–view mismatch problem. (3) Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the model’s original generative priors, we propose the Progressive View Curriculum, a four-stage training strategy that employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity using an in-house LoRA-augmented pipeline and multi-stage filtering processes. Extensive evaluation on our multi-view benchmark—comprising 100 manually-curated cases spanning 52 unique identities—demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

Method Overview

Three architectural refinements on a pre-trained Wan2.2-T2V backbone, plus a four-stage Progressive View Curriculum for stable training.

HarmoView framework overview — **Figure 1. Overview of HarmoView.** Multi-view references are encoded and concatenated with video noise along the sequence dimension, while frontal appearance features are injected into the text branch via cross-attention. Missing views are filled with learnable proxy tokens, and Jump-RoPE inserts positional gaps to isolate different identities. The four-stage Progressive View Curriculum (PVC) drives a stable transition from single-view to multi-view spatial reasoning.

MFIMulti-level Feature Injection

Raw ViT appearance features from frontal references are projected and concatenated with text tokens, providing persistent low-level identity anchors that complement the high-level features inside the DiT blocks throughout the entire denoising process.

LPTLearnable Proxy Tokens

Learnable placeholders unify the sequence layout across single- and multi-view settings. They act as “attention sinks” that absorb unmappable features, mitigating the reference–view mismatch problem and stabilizing the structural priors of the backbone.

Jump-RoPEIdentity-wise Feature Isolation

Discontinuous coordinate jumps insert logical gaps between the positional bands of different identities and between references and video noise, enforcing strict feature boundaries to suppress cross-identity crosstalk.

PVCProgressive View Curriculum

A four-stage schedule—single-view scaling, multi-view warm-up, strategic view dropout, and high-fidelity refinement—gradually increases reference complexity, enabling robust multi-view reasoning without degrading text-following or motion-synthesis capability.

Comparison with the state-of-the-art methods

Eight methods per case on HarmoView-Bench: four open-source baselines, two commercial engines, and HarmoView (1-View & Multi-View).

Person 1

Person 2

Single shot, starting with a medium shot in an ancient Chinese imperial palace setting. On the left is a young woman with long brown hair, wearing a golden crown inlaid with jewels and a red robe embroidered with dragons and phoenixes. She turns her face slightly to the right, gazes at the man on the right, and gently rests her head on his shoulder while speaking to him. On the right is a young man with a shaved head, wearing a yellow Buddhist robe, hands clasped together in front of his chest, palms facing each other with fingertips pointing upward, body slightly leaning forward, head bowed, eyes lowered, brows furrowed, listening with a tense expression. The camera slowly pushes in from the medium shot of both characters, ending in a close-up on the man's face to capture his tense expression. The background features red palace walls, carved window lattices, and hanging red curtains. Realistic style, cinematic quality.

点击播放全部

Compared with the other methods, HarmoView better preserves identity consistency under large viewpoint changes while offering stronger prompt fidelity—including her head resting on his shoulder, a shaved-head young man in a yellow Buddhist robe, body slightly leaning forward, and head bowed with eyes lowered. It is also worth noting that Kling 3.0 exhibits a clear reference-copy artifact (white earphones absent from the text or references), whereas our result stays aligned with the intended wardrobe and accessories.

Open-Source Models

VACE

Kaleido

RefAlign

HuMo

Commercial Engines & HarmoView

Kling 3.0

Wan 2.7

HarmoView★ (1-View)

HarmoView (Multi-View)

Person 1

Person 2

The video is a medium close-up shot, captured from a low-angle仰拍 perspective with a wide-angle lens and a fixed camera. On the left side of the frame is a young man wearing a shirt and vest, with black rectangular-framed glasses. He is bowing with his hands clasped in front of his chest, speaking with an excited tone. On the right side is another young man, dressed in a luxurious dark green tailcoat, wearing silver round-framed glasses. He occupies most of the frame, looking toward the upper left of the frame with a cold, intense gaze, speaking with an excited tone. The background features the Canton Tower and parked luxury cars. The camera slowly and smoothly rotates from a tilted diagonal composition to a frontal view, during which the left character is shown in profile and the right character is shown in full frontal expression. The low-angle perspective is maintained throughout. Realistic style, cinematic quality.

点击播放全部

Across every open-source and closed-source baseline, HarmoView most consistently delivers superior identity preservation, semantic faithfulness, and photorealism. It better follows the staged layout—on the left, a young man in a shirt and vest, bowing with hands clasped in front of his chest; on the right, the companion occupying most of the frame—together with the camera move: rotating from a tilted diagonal composition to a frontal view, while keeping both faces, wardrobe, and the Canton Tower backdrop stable throughout.

Open-Source Models

VACE

Kaleido

RefAlign

HuMo

Commercial Engines & HarmoView

Kling 3.0

Wan 2.7

HarmoView★ (1-View)

HarmoView (Multi-View)

Person 1

Person 2

点击播放全部

HarmoView (Multi-View) maintains strong identity consistency for the hero while outperforming all other methods on semantic alignment and photorealism. In particular it better matches the prompt details—slightly messy black short hair, standing on a huge sword, and the camera slowly rising and tilting down—so costume, sword scale, and cloudscape read as a coherent cinematic take rather than a drifting stylization.

Open-Source Models

VACE

Kaleido

RefAlign

HuMo

Commercial Engines & HarmoView

Kling 3.0

Wan 2.7

HarmoView★ (1-View)

HarmoView (Multi-View)

Person 1

Person 2

The video is a static medium shot, captured from a frontal, eye-level angle. In the center of the frame is a young man with dyed yellow short hair and dark brown eyes, wearing a dark jacket over a white crew-neck shirt. He holds a giant blue water jug with both hands, initially facing the camera. At the start of the video, he turns his body from front to side, looks up, and gently pushes the jug towards the camera. He then speaks a line, followed by a 'cheers' gesture, raising the jug with both hands as if toasting, his gaze sincere. Immediately after, he strains with both hands to lift the heavy jug high, tilts his head back, opens his mouth, and drinks deeply from the jug's opening, the water inside sloshing naturally with his movement. The background is a static yellow wooden wall. At the bottom of the frame, white text with a black outline reads: '为我的莽撞自罚一杯'. Realistic style, cinematic quality.

点击播放全部

HarmoView (Multi-View) delivers stronger physical plausibility and identity consistency than both open-source and closed-source baselines on this clip. It also follows the prompt more faithfully—including dyed yellow short hair, pushing the jug toward the camera, raising the jug with both hands for the toast, and the generated on-screen subtitle—while keeping liquid motion, hand–object contact, and facial likeness coherent through the drinking motion.

Open-Source Models

VACE

Kaleido

RefAlign

HuMo

Commercial Engines & HarmoView

Kling 3.0

Wan 2.7

HarmoView★ (1-View)

HarmoView (Multi-View)

Person 1

Person 2

点击播放全部

HarmoView (Multi-View) clearly surpasses the open-source baselines on this prompt: it faithfully executes the instruction to hold the golden retriever and spin on the grass while maintaining stronger identity consistency for both the person and the dog. Among closed-source engines, our model additionally achieves tighter ID preservation than Wan 2.7, keeping the handler and the dog recognizable and coupled throughout the rotation and zoom-in.

Open-Source Models

VACE

Kaleido

RefAlign

HuMo

Commercial Engines & HarmoView

Kling 3.0

Wan 2.7

HarmoView★ (1-View)

HarmoView (Multi-View)

Dataset & Benchmark

One training-corpus sample and one HarmoView-Bench case, each with multi-view references, prompt, and target video.

Training Dataset Sample Dataset

A representative entry from the HarmoView training corpus — two identities, each captured from seven reference views, paired with a structured prompt and the target video.

Dataset Sample — Two-person Multi-View Tuple

Dataset

Person 1

Person 2

An East Asian man in his 20s or early 30s stands behind an East Asian woman who has fair skin and long brown hair tied back. He wears a black vest over a dark green tunic and has a topknot hairstyle. She is dressed in light-colored robes with a high collar. They both look towards something just outside the right side of the frame where part of their audience's head can be seen. The woman's expression shifts between shock and anxiety; she opens her mouth slightly before closing it again. Her hands grip the man's arm for support. The man maintains eye contact forward throughout the clip.

HarmoView-Bench Case Benchmark

A curated evaluation case from HarmoView-Bench — two identities with five reference views each, plus the evaluation prompt. The benchmark only ships the input tuple; no ground-truth video is included.

HarmoView-Bench — Large-Pose Evaluation

Benchmark

Person 1

Person 2

A single-shot video begins with a medium close-up, slowly zooming in to a facial close-up, presenting a cinematic still quality. On the left is a young man with an average face shape, dark brown eyes, and short black hair combed back and tied with a ribbon, wearing a flowing, exquisite xianxia-style robe and matching trousers. On the right is a young man with an average face shape, dark brown eyes, and short black hair with bangs, combed back and secured with a hairpin, wearing a flowing, exquisite xianxia-style robe and matching trousers. The two stand back-to-back at the foot of a mountain in a magnificent fairyland. The man on the left turns his head slightly, gazing at the man on the right with a melancholic expression. The man on the right bows his head in thought, then slowly lifts his head and speaks first, his voice low and firm. The man on the left replies resolutely. Afterwards, the man on the right stands still, staring into the distance, while the man on the left turns and flies away on the wind, his robes billowing. The camera pans right, focusing on a close-up of the man on the right's face; he looks sorrowful, tears welling in his eyes, with the blurred figure of the man on the left flying away visible in the background. The background is a magnificent fairyland: a waterfall cascades from a cliff into a lake below, with rising mist and swirling clouds, bathed in soft sunlight creating a dreamy atmosphere. Realistic style, cinematic quality.

HarmoView: Harmonizing Multi-View Constraints
for Identity-Consistent Video Generation

Abstract