Motion Inversion for Video Customization

Luozhou Wang^1,*, Ziyang Mai^1,*, Guibao Shen¹, Yixun Liang¹,

Xin Tao³, Pengfei Wan³, Di Zhang³, Yijun Li⁴, Yingcong Chen^1,2,†

¹HKUST(GZ), ²HKUST, ³Kuaishou, ⁴Adobe Research
^*Indicates Equal Contribution. ^†Indicates Corresponding Author.

Paper Supplementary 🤗 HF demo Code

Source Video (Orbit shot)

"A rabbit, low poly game art style."

Source Video (Crane up shot)

"An island by the sea."

Source Video

"A robot is dancing."

Source Video

"Monkeys are playing coconut."

Motion Embeddings

In this research, we propose Motion Embeddings, a set of temporally coherent embeddings derived from a given video. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values

More results

Source Video (Orbit shot)

"A house, 3d style."

Source Video

"Skeleton in suit is dancing, in autumn."

Source Video (Crane up shot)

"Ice on the sea in sunset."

Source Video

"A tiger doing pull-ups in the forest."

Source Video (Custom shot)

"A high-tech chip."

Source Video

"A dragon sitting in a flora garden."

Debiasing

Left: For the Motion Query-Key Embedding, which influences the attention map, we exclude the spatial dimensions. Including them would cause the attention map between frames to capture the object's shape (e.g., the shape of the tank in the original video is visible in the attention map). Right: Following the concept of optical flow, we apply a differential operation to the Spatial-2D Motion Value Embedding, removing static appearance and preserving dynamic motion.

Motion Embeddings for ZeroScope and AnimateDiff

Our motion embedding functionally operates like positional embedding, which is found in almost all video generative models. This means our motion embeddings can be easily applied to many different video generative models using common techniques.
The first three slides are based on ZeroScope, and the last slide is based on AnimateDiff.

BibTeX

        @misc{wang2024motioninversionvideocustomization,
          title={Motion Inversion for Video Customization}, 
          author={Luozhou Wang and Ziyang Mai and Guibao Shen and Yixun Liang and Xin Tao and Pengfei Wan and Di Zhang and Yijun Li and Yingcong Chen},
          year={2024},
          eprint={2403.20193},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2403.20193}, 
          }