Bridging Actions: Generate 3D Poses and Shapes In-Between Photos

accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

Wen-Li Wei and Jen-Chun Lin


Institute of Information Science, Academia Sinica, Taiwan

The pose and shape transitions in-between photos are automatically generated by our system. It allows for generating transitions of variable lengths defined by the user and for arbitrary editing (replacement/insertion/deletion) of the input photos. Bright green represents the generated 3D human pose and shape of given photos, while dark green represents the generated transitions. (Corresponding to Fig. 1 in our paper)

Paper
Supplementary Material
Code (Coming soon...)
(Visualization of Attention)


Abstract

Generating realistic 3D human motion has been a fundamental goal of the game/animation industry. This work presents a novel transition generation technique that can bridge the actions of people in the foreground by generating 3D poses and shapes in between photos, allowing 3D animators/novice users to easily create/edit 3D motions. To achieve this, we propose an adaptive motion network (ADAM-Net) that effectively learns human motion from masked action sequences to generate kinematically compliant 3D poses and shapes in-between given temporally-sparse photos. Three core learning designs underpin ADAM-Net. First, we introduce a random masking process that randomly masks images from an action sequence and fills masked regions in latent space by interpolation of unmasked images to simulate various transitions under given temporally-sparse photos. Second, we propose a long-range adaptive motion (L-ADAM) attention module that leverages visual cues observed from human motion to adaptively recalibrate the range that needs attention in a sequence, along with a multi-head cross-attention. Third, we develop a short-range adaptive motion (S-ADAM) attention module that weightedly selects and integrates adjacent feature representations at different levels to strengthen temporal correlation. By coupling these designs, the results demonstrate that ADAM-Net excels not only in generating 3D poses and shapes in-between photos, but also in classic 3D human pose and shape estimation.


Demo


Qualitative Evaluation

- Visual comparisons with the baseline method, TCMR, and MPS-Net.

Qualitative comparison of baseline method (a), TCMR (b), and our ADAM-Net (c) on the challenging in-the-wild 3DPW dataset. The first and last photos in the sequence are given keyframes, which correspond to photos with bright backgrounds and red windows. The photos in-between keyframes are the original image references (i.e., dark background photos). The transition length is set to 5 in this experiment. Bright colors represent the generated 3D human pose and shape of given photos, while dark colors represent the generated transitions. (Corresponding to Fig. 6 in our paper)

Qualitative comparison of TCMR (a) and our ADAM-Net (b) on the challenging in-the-wild MPI-INF-3DHP dataset, where the transition length is set to 20. Bright colors represent the generated 3D human pose and shape of given photos, while dark colors represent the generated transitions preserved after sampling. (Corresponding to Fig. 7 in our paper)



Qualitative comparison of baseline method (a), TCMR (b), MPS-Net (c), and our ADAM-Net (d) on photos downloaded from the Internet, where the transition length is set to 31. Bright colors represent the generated 3D human pose and shape of given photos, while dark colors represent the generated transitions preserved after sampling. (Corresponding to Fig. 8 in our paper)



Qualitative comparison among our ADAM-Net, ADAM-Net-only MHA [21], and a linear interpolation method where the 3D human meshes for the start and end keyframes of the linear interpolation method are taken from ADAM-Net. Bright colors represent the generated 3D human pose and shape of given photos, while dark colors represent the generated transitions preserved after sampling. (Corresponding to Fig. 9 in our paper)




Qualitative Evaluation

- Visual effects of our ADAM-Net under different transition lengths, inserting/replacing different photos.

Qualitative results of 3D pose and shape transitions generated by ADAM-Net under arbitrarily edited input photos and transition lengths. As compared to (a), (b) inserts one more photo (keyframe), as compared to (b), (c) replaces one of the photos in (b), and as compared to (c), (d) changes the photo positions in (c), which also has a different transition length. Bright colors represent the generated 3D human pose and shape of given photos, while dark colors represent the generated transitions preserved after sampling. (Corresponding to Fig. 11 in our paper)

Qualitative results of animating 3D pose transitions (generated by our ADAM-Net) into avatars. Photos (keyframes) are downloaded from the Internet and edited arbitrarily by the author. The transition length in-between keyframes is set to 10. From top to bottom, each example displays the given keyframes, the unfolded generated 3D pose and shape transitions, and the animated character. (Corresponding to Fig. 13 in our paper)




EXPERIMENTS

- 3D Joint Transition Generation In-Betweening

Qualitative comparison of RMIB [7] with pose estimator [2], RMIB [7] with pose estimator ADAM-Net, and our ADAM-Net on the challenging in-the-wild 3DPW dataset. The red and blue skeletons respectively represent the starting and ending keyframes, while the pink signifies the generated transition of the 3D joints. (Corresponding to Fig. 15 in our paper)



Qualitative comparison of RMIB [7] with pose estimator [56], and our ADAM-Net on the challenging EMDB dataset. The red and blue skeletons respectively represent the starting and ending keyframes, while the pink signifies the generated transition of the 3D joints. (Corresponding to Fig. 16 in our paper)




EXPERIMENTS

- 3D Human Pose and Shape Estimation from Monocular Video

Qualitative comparison of TCMR (left), MPS-Net (middle), and our ADAM-Net (right) for pose and shape estimation on the challenging in-the-wild 3DPW dataset and video downloaded from the Internet. (Corresponding to Fig. 17 in our paper)


References


[2] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis, “Learning to reconstruct 3D human pose and shape via model-fitting in the loop,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.

[3] H. Choi, G. Moon, J. Y. Chang, and K. M. Lee, “Beyond static features for temporally consistent 3D human pose and shape from a video,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

[7] F. G. Harvey, M. Yurick, D. Nowrouzezahrai, and C. Pal, “Robust motion in-betweening,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, 2020.

[10] W.-L. Wei, J.-C. Lin, T.-L. Liu, and H.-Y. M. Liao, “Capturing humans in motion: Temporal-attentive 3D human pose and shape estimation from monocular video,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[21] A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.

[56] Z. Li, J. Liu, Z. Zhang, S. Xu, and Y. Yan, “CLIFF: Carrying location information in full frames into human pose and shape estimation,”in Proceedings of the European Conference on Computer Vision (ECCV), 2022.

Acknowledgements

This webpage template was refered from https://netease-gameai.github.io/ChoreoMaster/. Thanks a lot.
All images and videos on the website are for research purposes only.

BibTeX

@article{In-BetweenPhotos2024,
    author = {Wen-Li Wei and Jen-Chun Lin},
    title = {Bridging Actions: Generate 3D Poses and Shapes In-Between Photos},
    journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)},
    volume = {},
    number = {},
    year = {2024},
    publisher = {IEEE}
}