SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

Kim, Jeonghwan; Lan, Y.; Chen, Y.; Nguyen, H. T.; Pan, C.; Pan, Xingang

SceneConductor

3D Scene Generation from a Single Image with Multi-Agent Orchestration

Jeonghwan Kim¹, Yushi Lan², Yongwei Chen¹ Hieu Trung Nguyen³ Chuanyu Pan³ Xingang Pan¹

¹Nanyang Technological University · ²University of Oxford · ³Meshy AI

arXiv SceneConductor GALP Checkpoint BibTeX

Scroll

TL;DR Our framework performs single-image 3D scene generation into three structured stages — scene initialization, environment construction, and multi-agent refinement — and introduce a geometry-aware layout predictor trained with sparse point-map priors, achieving consistent gains in geometric accuracy and perceptual realism.

Pipeline Overview

(a) Scene Initialization

Per-object segmentation masks are lifted to object-level 3D representations, and a geometry-aware layout predictor places them into a coarse initial scene — no scene-level layout annotations required.

(b) Environment Construction

Point-map geometry is fused with the initialization to build the environmental scaffold: floors, walls, materials, and illumination are reconstructed and grounded around the placed objects.

(c) Multi-Agent Refinement

A planner agent detects structural and visual inconsistencies, applies simple corrections itself, and dispatches specialist agents for complex localized edits that are reintegrated into the global scene.

Results

Pick a scene from the gallery below, then toggle through Stage 1 → 2 → 3 to see how SceneConductor reconstructs a complete 3D environment from a single image. All three stages are interactive

Input Image

Current View · Stage 1

Loading Stage 1...

Loading Stage 2...

Loading Stage 3...

Example scenes

Geometry-Aware Layout Predictor

Supervised by sparse geometric priors derived from point maps, enabling training from segmentation-level data without scene-level layout annotations.

Qualitative Comparison

Same input photo reconstructed by prior methods vs. our Geometry-aware Layout Predictor. Ours is interactive

Example scenes

BibTeX

@article{kim2026sceneconductor,
  title   = {SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration},
  author  = {Kim, Jeonghwan and Lan, Yushi and Chen, Yongwei and Nguyen, Hieu Trung and Pan, Chuanyu and Pan, Xingang},
  journal = {arXiv preprint arXiv:2606.08402},
  year    = {2026}
}