Generating realistic 3D human motion has been a fundamental goal of the game/animation industry. This work presents a novel transition generation technique that can bridge the actions of people in the foreground by generating 3D poses and shapes in between photos, allowing 3D animators/novice users to easily create/edit 3D motions. To achieve this, we propose an adaptive motion network (ADAM-Net) that effectively learns human motion from masked action sequences to generate kinematically compliant 3D poses and shapes in-between given temporally-sparse photos. Three core learning designs underpin ADAM-Net. First, we introduce a random masking process that randomly masks images from an action sequence and fills masked regions in latent space by interpolation of unmasked images to simulate various transitions under given temporally-sparse photos. Second, we propose a long-range adaptive motion (L-ADAM) attention module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in a sequence, along with a multi-head cross-attention. Third, we develop a short-range adaptive motion (S-ADAM) attention module that weightedly selects and integrates adjacent feature representations at different levels to strengthen temporal correlation. By coupling these designs, the results demonstrate that ADAM-Net excels not only in generating 3D poses and shapes in-between photos, but also in classic 3D human pose and shape estimation.
References[2] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3D human pose and shape via model-fitting in the loop,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. [3] H. Choi, G. Moon, J. Y. Chang, and K. M. Lee, “Beyond static features for temporally consistent 3D human pose and shape from a video,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [7] F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal, “Robust motion in-betweening,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, 2020. [10] W.-L. Wei, J.-C. Lin, T.-L. Liu, and H.-Y. M. Liao, “Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [21] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017. [56] Z. Li, J. Liu, Z. Zhang, S. Xu, and Y. Yan, “CLIFF: Carrying location information in full frames into human pose and shape estimation,”in Proceedings of the European Conference on Computer Vision (ECCV), 2022. |
AcknowledgementsThis webpage template was refered from https://netease-gameai.github.io/ChoreoMaster/. Thanks a lot.All images and videos on the website are for research purposes only. |
@article{In-BetweenPhotos2024,
author = {Wen-Li Wei and Jen-Chun Lin},
title = {Bridging Actions: Generate 3D Poses and Shapes In-Between Photos},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
volume = {},
number = {},
year = {2024},
publisher = {IEEE}
}