PERF: Panoramic Neural Radiance Field from a Single Panorama
TPAMI 2024

TL;DR: We present PERF, a 360-degree novel view synthesis framework that trains a panoramic neural radiance field from a single panorama.

Abstract

Neural Radiance Field (NeRF) has achieved substantial progress in novel view synthesis given multi-view images. Recently, some works have attempted to train a NeRF from a single image with 3D priors. They mainly focus on a limited field of view and there are few invisible occlusions, which greatly limits their scalability to real-world 360-degree panoramic scenarios with large-size occlusions. In this paper, we present PERF, a 360-degree novel view synthesis framework that trains a panoramic neural radiance field from a single panorama. Notably, PERF allows 3D roaming in a complex scene without expensive and tedious image collection. To achieve this goal, we propose a novel collaborative RGBD inpainting method and a progressive inpainting-and-erasing method to lift up a 360-degree 2D scene to a 3D scene. Specifically, we first predict a panoramic depth map as initialization given a single panorama and reconstruct visible 3D geometry with volume rendering. Then we introduce a collaborative RGBD inpainting approach into a NeRF for completing RGB images and depth maps from random views, deriving from an RGB stable diffusion model and a monocular depth estimator, aiming to generate the 3D geometry and the appearance of invisible regions. Finally, we introduce an inpainting-and-erasing strategy to avoid inconsistent geometry between a newly-added view and reference views. The two components are integrated into the learning of NeRFs in a unified optimization framework and achieves promising results. Extensive experiments on Replica and a new dataset PERF-in-the-wild demonstrate the superiority of our PERF over state-of-the-art methods. Our PERF can be widely used for real-world applications, such as panorama-to-3D, text-to-3D, and 3D scene stylization applications.

Framework

scales

PERF mainly consists of three modules, including 1) single-view NeRF training with depth maps; 2) collaborative RGBD inpainting; and 3) progressive inpainting-and-erasing. Specifically, given a single panorama, we predict its depth map with a Depth Estimation model and train a NeRF with the input view as initialization. Then a collaborative RGBD inpainting module that contains a Depth Estimator and a Stable Diffusion is proposed to extend NeRF to random views. To avoid geometry conflict, a progressive inpainting-and-erasing module is used to compute a mask of the conflicted regions and eliminate these regions. We fine-tune the panoramic NeRF with the reference single-view panorama and new panoramas generated from random viewpoints until convergence.

Demo Video

Application 1: Single Panorama to 3D


Application 2: Text to 3D Scene

Text to 3D Scene = Text to 2D Panorama + 2D Panorama to 3D Scene (PERF)




Application 3: 3D Stylization

3D Stylization = 2D Panorama Stylization + 2D Panorama to 3D Scene (PERF)


Citation

Related Links

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections.
CaG: Traditional Classification Neural Networks are Good Generators: They are Competitive with DDPMs and GANs.
SparseNeRF: Novel view synthesis with sparse views.
F2-NeRF: Fast Neural Radiance Field Training with Free Camera.
NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction.
Text2light: Zero-Shot Text-Driven HDR Panorama Generation.
StyleLight generates HDR indoor panorama from a limited FOV image.
Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis.
AvatarCLIP proposes a zero-shot text-driven framework for 3D avatar generation and animation.
Text2Human proposes a text-driven controllable human image generation framework.
Relighting4D can relight human actors using the HDRI generated by us.
NeuRIS: Neural Reconstruction of Indoor Scenes Using Normal Priors.
SHERF: Generalizable Human NeRF from a Single Image.
EVA3D: Compositional 3D Human Generation from 2D Image Collections.

Acknowledgments

This work is supported by the National Research Foundation, Singapore under its AI Singapore Programme, NTU NAP, MOE AcRF Tier 2 (T2EP20221-0033), and under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).
The website template is borrowed from Mip-NeRF.