Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling.
CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications.
We demonstrate JointDiffusion’s improvements over the SOTA in both applications: it increases the contact F1 score by 5% for refinement and decreases the sim. displacement by 46% for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks.
We introduce JointDiffusion, a diffusion model with a contact prediction branch and a shape & pose prediction branch. JointDiffusion learns the data distribution of CHOIR and can be conditioned on an object for grasp synthesis, or a noisy hand-object pair for grasp denoising.
Architecture of JointDiffusion. The 3D U-Net predicts the noise sample ϵ_d for the hand distance field d_H . The contact prediction branch predicts the noise sample ϵ_c for the contact Gaussian parameters c_H from the features of the U-Net’s bottleneck. This joint learning encourages the U-Net to extract features relevant to both tasks, enhancing the accuracy of the learned CHOIR distribution.
We evaluate our proposed method (JointDiffusion + CHOIR + TTO) on the Perturbed ContactPose benchmark and compare against ContactOpt and TOCH.
We also evaluate our method on grasp synthesis with the ContactPose dataset.
@article{morales2024CHOIR,
author = {Théo Morales and Omid Taheri and Gerard Lacey},
title = {A Versatile and Differentiable Hand-Object Interaction Representation},
journal = {WACV},
year = {2025},
}
This work was conducted with the financial support of the Science Foundation Ireland Centre for Research Training in Digitally-Enhanced Reality (d-real) under Grant No. 18/CRT/6224. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.
This project is supported by Science Foundation Ireland under the SFI Frontiers for the Future Programme - Project, award number 22/FFP-P/11522.