Human-SGD: Single-Image 3D Human Digitization with Shape-Guided Diffusion

Badour AlBahar¹ Shunsuke Saito² Hung-Yu Tseng² Changil Kim² Johannes Kopf² Jia-Bin Huang³

¹Kuwait University ²Meta ³University of Maryland

SIGGRAPH Asia 2023

Abstract

We present an approach to generate a 360-degree view of a person with a consistent, high-resolution appearance from a single input image.

NeRF and its variants typically require videos or images from different viewpoints. Most existing approaches taking monocular input either rely on ground-truth 3D scans for supervision or lack 3D consistency. While recent 3D generative models show promise of 3D consistent human digitization, these approaches do not generalize well to diverse clothing appearances, and the results lack photorealism.

Unlike existing work, we utilize high-capacity 2D diffusion models pretrained for general image synthesis tasks as an appearance prior of clothed humans. To achieve better 3D consistency while retaining the input identity, we progressively synthesize multiple views of the human in the input image by inpainting missing regions with shape-guided diffusion conditioned on silhouette and surface normal. We then fuse these synthesized multi-view images via inverse rendering to obtain a fully textured high-resolution 3D mesh of the given person. Experiments show that our approach outperforms prior methods and achieves photorealistic 360-degree synthesis of a wide range of clothed humans with complex textures from a single image.

Method Overview

To generate a 360-degree view of a person from a single image, we first synthesize multi-view images of the person. We use off-the-shelf methods to infer the 3D geometry and synthesize an initial back-view of the person as a guidance. We add our input view and the synthesized initial back-view to our support set. To generate a new view, we aggregate all the visible pixels from our support set by blending their RGB color, weighted by visibility, viewing angles, and the distance to missing regions. To hallucinate the unseen appearance and synthesize view, we use a pretrained inpainting diffusion model guided by shape cues (normal and silhouette maps). We include the generated view in our support set and repeat this process for all the remaining views.

We then fuse these synthesized multi-view images to obtain a textured 3D human mesh. We use the computed UV parameterization to optimize a UV texture map with the geometry fixed. In each iteration, we differentiably render the UV texture map in every synthesized view from our set of views. We minimize the reconstruction loss between the rendered view and our synthesized view using both LPIPS loss and L1 loss. The fusion results in a textured mesh that can be rendered from any view.

Visual comparisons:

Dataset:
In-the-wild images from AdobeStock.

THuman2.0 dataset.

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

S3F [Corona et al. 2023]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

PHORHUM [Alldieck et al. 2022]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

S3F [Corona et al. 2023]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

PHORHUM [Alldieck et al. 2022]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

S3F [Corona et al. 2023]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

PHORHUM [Alldieck et al. 2022]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

S3F [Corona et al. 2023]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

PHORHUM [Alldieck et al. 2022]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

S3F [Corona et al. 2023]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

PHORHUM [Alldieck et al. 2022]

Input image

PwS baseline [AlBahar et al. 2021]

PIFu [Saito et al. 2019]

Impersonator++ [Liu et al. 2021]

S3F [Corona et al. 2023]

Human-SGD (Ours)

TEXTure [Richardson et al. 2023]

Magic123 [Qian et al. 2023]

TeCH [Huang et al. 2023]

PHORHUM [Alldieck et al. 2022]

Limitations:

Input image

Our 360 generation

This example shows the baked specularity on the face and clothes texture, which is ideally view-dependent.

Input image

Our 360 generation

Since the back-view synthesis approach lacks awareness of the underlying geometry, it may misalign the length of skirts, dresses, and other garments. This can be observed where the legs appear textured with the garment pattern.

Input image

Our 360 generation

Due to the imperfect shape estimation, our 360-degree generation exhibits a missing foot.

BibTeX

@inproceedings{albahar2023humansgd,
  author    = {AlBahar, Badour and Saito, Shunsuke and Tseng, Hung-Yu and Kim, Changil and Kopf, Johannes and Huang, Jia-Bin},
  title     = {Single-Image 3D Human Digitization with Shape-Guided Diffusion},
  booktitle = {SIGGRAPH Asia},
  year      = {2023},
}

References

We compare our 360-degree view synthesis approach with:

Pose with Style (PwS) baseline: We use Pose with Style: Detail-Preserving Pose-Guided Image Synthesis with Conditional StyleGAN to generate multi-view images and then fuse them using our multi-view fusion.
PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization.
Impersonator++. Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis.
TEXTure: Text-Guided Texturing of 3D Shapes. To make it conditional on an input image, we use the input image directly instead of generating an initial view from the depth-to-image diffusion model.
Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.
S3F: Structured 3D Features for Reconstructing Relightable and Animatable Avatars.
PHORHUM: Photorealistic Monocular 3D Reconstruction of Humans Wearing Clothing.
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans.

Human-SGD: Single-Image 3D Human Digitization with Shape-Guided Diffusion

Abstract

Method Overview

Visual comparisons:

Dataset: In-the-wild images from AdobeStock. THuman2.0 dataset.

Limitations:

BibTeX

References

Dataset:
In-the-wild images from AdobeStock.

THuman2.0 dataset.