Motion Inversion for Video Customization

Luozhou Wang¹, Guibao Shen¹, Yixun Liang¹, Xin Tao³, Pengfei Wan³, Di Zhang³, Yijun Li⁴, Yingcong Chen^1,2,*

¹HKUST(GZ), ²HKUST, ³Kuaishou, ⁴Adobe Research
^*Indicates Corresponding Author.

Motion Embeddings

In this research, we propose Motion Embeddings, a set of explicit, temporally coherent one-dimensional embeddings derived from a given video. This representation enables the customization of video motion with remarkable efficiency, requiring less than 0.5 million parameters and less than 10 minutes of training time.

Motion Embeddings for UNet3D (ZeroScope)

Source Video

"A tank is running in the desert."

"A tiger is running in the forest."

Source Video

"A knight in armor rides a Segway"

"A toy train chugs around a roundabout tree"

Source Video

"A teddy bear is riding a tricycle in Times Square"

"A pigeon is strutting around a town square"

Source Video

"A little airplane does loops over the grass"

"A model car speeds down a miniature track"

Source Video

"A turtle plods in the sea"

"A penguin is sliding on an icy slope"

Source Video

"A soccer ball weaves through cones on its own"

"A rabbit bounds across a green lawn"

Temporal Discrepancy

We also identify the Temporal Discrepancy in video generative models, which refers to variations in how different motion modules process temporal relationships between frames. In Unet3D:

(a) Attention maps of temporal transformer in down and up blocks demonstrate a "local" pattern where each frame focuses on its neighboring frames, a crucial aspect for accurately capturing motion when inverting a custom video into motion embeddings.
(b) Attention maps in "Mid blocks" demonstrate a "global" attention pattern where all frames primarily attend to the first and last frames disregarding adjacent frame information, leading to ineffective motion representation.

Motion Embeddings for UNetMotion (AnimateDiff)

Our motion embedding functionally operates like positional embedding, which is found in almost all video generative models. This means our motion embeddings can be easily applied to many different video generative models using common techniques.

Source Video

"A fish swimming in the lake"

"A motorbike driving in a city"

Source Video

"A ship sailing on the sea"

"A car driving on a road"

Source Video

"A tiger walking in the forest"

"An elephant walking on the rocks"

Motion Embeddings for DiT (Latte, Open-Sora)

Coming soon!

BibTeX

@misc{wang2024motion,
        title={Motion Inversion for Video Customization}, 
        author={Luozhou Wang and Guibao Shen and Yixun Liang and Xin Tao and Pengfei Wan and Di Zhang and Yijun Li and Yingcong Chen},
        year={2024},
        eprint={2403.20193},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }