Tencent Hunyuan
Corresponding author: cw_research@163.com
Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three synergistic architectural refinements complemented by a staged training curriculum. Specifically, (1) we first introduce Multi-level Feature Injection (MFI) to anchor identity fidelity; by injecting raw ViT features from frontal reference images alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, (2) we employ learnable proxy tokens to unify heterogeneous reference layouts across single- and multi-view settings while simultaneously resolving the reference–view mismatch problem. (3) Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the model’s original generative priors, we propose the Progressive View Curriculum, a four-stage training strategy that employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity using an in-house LoRA-augmented pipeline and multi-stage filtering processes. Extensive evaluation on our multi-view benchmark—comprising 100 manually-curated cases spanning 52 unique identities—demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.
Three architectural refinements on a pre-trained Wan2.2-T2V backbone, plus a four-stage Progressive View Curriculum for stable training.
Raw ViT appearance features from frontal references are projected and concatenated with text tokens, providing persistent low-level identity anchors that complement the high-level features inside the DiT blocks throughout the entire denoising process.
Learnable placeholders unify the sequence layout across single- and multi-view settings. They act as “attention sinks” that absorb unmappable features, mitigating the reference–view mismatch problem and stabilizing the structural priors of the backbone.
Discontinuous coordinate jumps insert logical gaps between the positional bands of different identities and between references and video noise, enforcing strict feature boundaries to suppress cross-identity crosstalk.
A four-stage schedule—single-view scaling, multi-view warm-up, strategic view dropout, and high-fidelity refinement—gradually increases reference complexity, enabling robust multi-view reasoning without degrading text-following or motion-synthesis capability.
Eight methods per case on HarmoView-Bench: four open-source baselines, two commercial engines, and HarmoView (1-View & Multi-View).
Architectural ablation on Jump-RoPE.
One training-corpus sample and one HarmoView-Bench case, each with multi-view references, prompt, and target video.
A representative entry from the HarmoView training corpus — two identities, each captured from seven reference views, paired with a structured prompt and the target video.
An East Asian man in his 20s or early 30s stands behind an East Asian woman who has fair skin and long brown hair tied back. He wears a black vest over a dark green tunic and has a topknot hairstyle. She is dressed in light-colored robes with a high collar. They both look towards something just outside the right side of the frame where part of their audience's head can be seen. The woman's expression shifts between shock and anxiety; she opens her mouth slightly before closing it again. Her hands grip the man's arm for support. The man maintains eye contact forward throughout the clip.
A curated evaluation case from HarmoView-Bench — two identities with five reference views each, plus the evaluation prompt. The benchmark only ships the input tuple; no ground-truth video is included.
A single-shot video begins with a medium close-up, slowly zooming in to a facial close-up, presenting a cinematic still quality. On the left is a young man with an average face shape, dark brown eyes, and short black hair combed back and tied with a ribbon, wearing a flowing, exquisite xianxia-style robe and matching trousers. On the right is a young man with an average face shape, dark brown eyes, and short black hair with bangs, combed back and secured with a hairpin, wearing a flowing, exquisite xianxia-style robe and matching trousers. The two stand back-to-back at the foot of a mountain in a magnificent fairyland. The man on the left turns his head slightly, gazing at the man on the right with a melancholic expression. The man on the right bows his head in thought, then slowly lifts his head and speaks first, his voice low and firm. The man on the left replies resolutely. Afterwards, the man on the right stands still, staring into the distance, while the man on the left turns and flies away on the wind, his robes billowing. The camera pans right, focusing on a close-up of the man on the right's face; he looks sorrowful, tears welling in his eyes, with the blurred figure of the man on the left flying away visible in the background. The background is a magnificent fairyland: a waterfall cascades from a cliff into a lake below, with rising mist and swirling clouds, bathed in soft sunlight creating a dreamy atmosphere. Realistic style, cinematic quality.
Scope boundaries and responsible-use considerations for HarmoView.
HarmoView is currently designed for portrait-identity consistency only. The method is optimized for human facial identity preservation and has not been validated for broader object categories or full-scene identity correspondence. In addition, the current formulation supports up to two identities per generation. This constraint may limit applicability in scenarios requiring long-cast interactions, dense crowds, or more compositional role assignments. Future work should investigate scalable identity slots, improved multi-person disentanglement, and broader-domain consistency modeling.
As a portrait-consistent video generation method, HarmoView may be misused to create deceptive or non-consensual synthetic media (e.g., identity impersonation and deepfake-style content). Such misuse can amplify reputational harm, harassment, and misinformation. We therefore recommend responsible deployment practices, including explicit policy constraints, access control, provenance signaling (e.g., watermarking or content credentials when feasible), and human-in-the-loop review for high-risk use cases. We also encourage downstream users to comply with local legal and ethical requirements regarding consent, privacy, and biometric identity usage.
If you find HarmoView useful for your research, please consider citing our work.
@article{wang2026harmoview,
title = {HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation},
author = {Wang, Cong and Yu, Zhentao and Wang, Hongmei and Liang, Weicong and
Zhou, Zixiang and Yang, Zilin and Ou, Jiarong and Chen, Rui and
Zhou, Yuan and Lu, Qinglin},
journal = {arXiv preprint arXiv:2606.10839},
year = {2026}
}