top of page

LLM Planning Agents

Exploring the Potential and Challenges of Large Language Model Agents in Urban Design and Planning
  • 02.2024 - 11.2024

  • Authors: Eduardo-Rico Carranza, Sheng-Yang Huang, Guanhong Li

  • Affiliation: Architectural Association School of Architecture

  • Citation: Rico Carranza, E., Huang, S.-Y., & Li, GH. (2025). LLM Planning Agents: Exploring the Potential and Challenges of Large Language Model Agents in Urban Design and Planning.caadria.2025 (in press)

Abstract

The integration of Large Language Models (LLMs) as planning agents in urban design and planning represents a novel approach to addressing the field's inherent complexity. This study explores their potential and challenges, focusing on their ability to simulate decision-making processes, enhance stakeholder engagement, and provide analytical support. Using an agentic framework, the research evaluates 63 urban development proposals with a specific focus on water management, employing both sequential and nested frameworks. Several LLMs were tested to investigate performance differences across model scales. The findings reveal that while LLM agents exhibit "common sense" and follow planning advice, their reliance on accessible data often results in overly generic outputs, underscoring the need for better data retrieval mechanisms such as Retrieval-Augmented Generation (RAG). Experimental results show nested frameworks outperform sequential ones in reasoning and decision-making, but limitations persist, including biases, limited spatial awareness, and occasional off-topic generation. Addressing these challenges required novel agent architectures and prompt engineering. Smaller models sometimes outperformed larger ones, challenging the assumption that scale guarantees accuracy. Despite these constraints, LLMs demonstrated value in identifying overlooked details and enhancing scenario exploration. This study also advocates for improvements in spatial reasoning, data integration, and framework design.

Mothods and Findings
Screenshot 2025-01-02 145011.png

We selected seven sites within the region of west Sussex. the two in the national park graffham, upwaltham should see strict restrictions planning proposal, so there ground truth is 3-meaning harder to approve. While chidham, Bosham, and Pagham near the protected wetland area should prioritise nature. The rest two Chichester and easter wittering should see relaxed planning proposal approval, which gives them lower ground truth number 2 and 1.5 respectively.

For each site, three projects were tested, each described in a text paragraph outlining the development of 200 housing units and the corresponding water management strategies.
Project1, incorporated comprehensive sustainable water management measures. Project 2 adopted some while project 3 included no such provisions. So project 1 is most likely to be approved, and project 3 should be the hardest, which gives its scoring to 3.

​

We tested two frameworks with similar agent structure:

Sequential framework:

  • Research agent: gathering constraints or domain-specific information, concise in 200 words

    • Tool: serper websearch: water and environmental management​

  • Planning agent: considered the proposals alongside the information provided by the research agents to make a final decision (approve/reject)

Nested framework:

  • Research agent: same as above

  • Critic: question research agent, review, refine and summarize

  • Planning agent: same as above

​

For the Planning agent's responses, due to llm’s tendency to explain its answer, it hardly output single word decision like approve or reject. To standardise comparisons, responses were passed through a classifier (scikit-learn) to assign a numerical score (1-4).

Screenshot 2025-01-02 150221.png

Finding 1: Larger models do not consistently outperform smaller ones. The relatively small Llama 3.23B model achieved the highest score in the sequential test (2.2), while most models clustered around an average score of 2 (on a 0-4 scale). Scores improved from sequential to nested frameworks across all models, except for Phi-3-mini, which could not run in the nested configuration.

Screenshot 2025-01-02 150314.png

Finding 2: The nested framework showed a closer correlation with the ground truth, particularly in aligning lower scores (green) with easier-to-approve proposals and higher scores (red) with more problematic ones. However, the alignment was partial, with some models deviating significantly from expected outcomes.

Finding 3: Models tended to be more conservative than the ground truth, with most producing higher average scores. The nested framework showed better alignment by approving more projects and avoiding excessive prudence.

Screenshot 2025-01-02 150324.png

Finding 4: Models struggled with site-specific constraints. For example, Llama 3.23B assigned overly negative scores to East Wittering, a site expected to have fewer development constraints. Smaller models like Llama and Mistral showed better alignment with the ground truth across different projects, assigning lower scores to low-quality projects and higher scores to high-quality ones.

​

For more discussion and further steps, refer to Link. 

Link
bottom of page