Text-to-City

Controllable 3D Urban Block Generation with Latent Diffusion Model

06.2023 - 01.2024
Authors: Junling Zhuang, Guanhong Li, Hang Xu, Jintu Xu, Runjia Tian
Citation: J.L. Zhuang, G.H. Li, H. Xu, J.T. XU, R.J. Tian (2024). Text-to-City: Controllable 3D Urban Block Generation with Latent Diffusion Model. In ACCELEERATED DESIGN, Proceeding of the 29th International Conference of the Association for ComputerAided Architectural Design Research in Asia (CAADRIA) 2024, vol.2, pp. 169-178.

Abstract

The rise of deep learning has introduced novel computational tools for urban block design. Many researchers have explored generative urban block design using either rule-based or deep learning methods. However, these methods often fall short in adequately capturing morphological features and essential design indicators like building density. Latent diffusion models, particularly in the context of urban design, offer a groundbreaking solution. These models can generate cityscapes directly from text descriptions, incorporating a wide array of design indicators. This paper introduces a novel workflow that utilizes Stable Diffusion, a state-of-the-art latent diffusion model, to generate 3D urban environments. The process involves reconstructing 3D urban block models from generated depth images, employing a systematic depth-to-height mapping technique. Additionally, the paper explores the extrapolation between various urban morphological characteristics, aiming to generate novel urban forms that transcend existing city models. This innovative approach not only facilitates the accurate generation of urban blocks with specific morphological characteristics and design metrics, such as building density, but also demonstrates its versatility through application to three distinct cities. This methodology, tested on select cities, holds potential for broader range of urban environments and more design indicators, setting the stage for future computational urban design research.

Experiments and Applications

The study analysed Berlin, Hamburg, and Cambridge (USA), each with unique street patterns—Berlin's linear, Hamburg's enclosed, and Cambridge's grid-like. The dataset featured diverse roof styles and used a standardized 140-meter height mapping for depth images. Building density metrics were calculated for balanced distribution within each city, resulting in a dataset of 100 images per city.

In our study, we tested eight caption formats for image-text pairing, with the below image highlighting the top three. Format 3, such as "Baroque town texture, Density_16, City Plan View" for Berlin, was the most effective. It aligns with Dreambooth's identifier strategy, ensuring accurate density labels and reducing ambiguity compared to natural language formats.

We expanded our dataset sixteen fold using rotation and flip transformations, creating three datasets for different cities. Each contains 1660 pairs of depth maps and text descriptions, totalling 4800 training pairs. Figure 7 shows these pairs, highlighting city-wise building density variations.

We tested numerous combination of learning rate and scheduler during hyperparameter optimization. The best setup (No. 09) achieved a 0.03557 reconstruction error and 0.040 density loss over 160 epochs on an NVIDIA 4090 GPU, using a 3e-06 learning rate with a constant scheduler. Extensive testing over 200 epochs revealed that neural network loss is more affected by learning rate than schedulers.

We also tested the performance of our top model, No.9, using Berlin as an example. The model is clearly more accurate in the textual(Input) Density it is trained on.

We ran batch inference generation of grayscale urban morphology patterns and extrude these images using our Grasshopper script to reconstruct 3D urban models. Our model allows the user to control output according to site and road conditions using ControlNet, including specific city morphology, which is seen that the generated results fit the text input and can well connect to the surrounding urban morphology.

Deforum Diffusion was utilized to merge city morphologies by interpolating between keyframes based on different text prompts. This approach generated hybrids of city styles.