In this research, we propose Motion Embeddings, a set of explicit, temporally coherent one-dimensional embeddings derived from a given video. This representation enables the customization of video motion with remarkable efficiency, requiring less than 0.5 million parameters and less than 10 minutes of training time.
We also identify the Temporal Discrepancy in video generative models, which refers to variations in how different motion modules process temporal relationships between frames. In Unet3D:
@misc{wang2024motion,
title={Motion Inversion for Video Customization},
author={Luozhou Wang and Guibao Shen and Yixun Liang and Xin Tao and Pengfei Wan and Di Zhang and Yijun Li and Yingcong Chen},
year={2024},
eprint={2403.20193},
archivePrefix={arXiv},
primaryClass={cs.CV}
}