LLMs Still Can’t Plan; Can LRMs? OpenAI’s o1 on PlanBench.”

Paper citation: Valmeekam, Karthik, Kaya Stechly, and Subbarao Kambhampati. “LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench.” arXiv preprint arXiv:2409.13373 (2024).

Summary:

The research investigates the distinction between traditional LLMs, which rely heavily on approximate retrieval, and OpenAI’s latest offering, the O1 model, characterized as a Large Reasoning Model (LRM). Utilizing the PlanBench framework, which contains a static set of planning problems, this study reveals that despite O1’s potential for enhanced reasoning due to its specialized training, it still exhibits weaknesses — particularly in more complex scenarios and accuracy on obfuscated tasks.

The results indicate that while O1 reaches a remarkable 97.8% accuracy on straightforward Blocksworld tasks, its performance drastically declines on obfuscated problems, achieving merely 37.3%. Furthermore, O1 faces challenges recognizing unsolvable instances, at times inaccurately classifying solvable problems as unsolvable. The findings underscore the necessity for a robust evaluation framework, addressing both the effectiveness and the efficiency of reasoning models like O1, as well as raising concerns about interpretability and the opaque nature of its decision-making process.

Approach:

To prepare inputs for the o1 model evaluation on planning tasks with the Blocksworld domain, appropriate formats in both natural language and PDDL (Planning Domain Definition Language) must be crafted.

# Example PDDL preparation for Blocksworld
def prepare_pddl_input(initial_conditions, goal_conditions):
    pddl_input = f"""
    (define (problem blocksworld)
      (:domain blocksworld)
      (:init {initial_conditions})
      (:goal {goal_conditions})
    )
    """
    return pddl_input
# Example usage
initial_conditions = "(clear A) (clear B) (on A B) (on-table A) (on-table B)"
goal_conditions = "(on B A)"
pddl_input = prepare_pddl_input(initial_conditions, goal_conditions)
print(pddl_input)

This function generates a PDDL problem definition to assess the model’s planning capabilities.

Algorithm Execution:

To execute the evaluation, we would call the o1 model with the prepared inputs and monitor its outputs.

# Hypothetical code to run the model and evaluate output
def evaluate_o1_model(pddl_input):
    response = call_openai_api(model="o1-preview", prompt=pddl_input)
    return response
# Evaluate the model 
model_output = evaluate_o1_model(pddl_input)
print(model_output)

In this snippet, we send the PDDL input to an API representing the o1 model, capturing its response.

Running the Algorithm:

Finally, we can implement the evaluation process in a straightforward manner.

def run_evaluation(initial_conditions, goal_conditions):
    pddl_input = prepare_pddl_input(initial_conditions, goal_conditions)
    model_output = evaluate_o1_model(pddl_input)
    # Process the output from the model to check for accuracy
    return model_output

# Example conditions
initial_conditions = "(clear A) (on B A) (on-table B)"
goal_conditions = "(on B A)"
result = run_evaluation(initial_conditions, goal_conditions)
print("Model Output:", result)

This function integrates all previous steps to run a complete evaluation of the o1 model on the specified conditions.

Summary of the Evaluation

Research Questions: The study seeks to answer whether the new LRM, O1, significantly enhances the planning processes compared to its predecessors and how it handles unsolvable problems in benchmark tasks.

Evaluation Methodology: To assess the LRM’s capabilities, the researchers employed the PlanBench benchmark, composed of 600 planning instances that include a straightforward Blocksworld setup and obfuscated versions. By methodically measuring the model’s performance across these instances, the study aimed to highlight both improvements and limitations in reasoning and planning tasks.

Results: The evaluation reveals stark contrasts in performance metrics. O1 achieves a high success rate on simpler problems, yet shows a dramatic decline when faced with more challenging or altered setups. In cases where the goal states were impossible, O1 displayed a 27% recognition rate for unsolvable instances, often falsely declaring solvable problems as unsolvable — demonstrating crucial limitations in its reasoning capabilities.

Surprising Findings from the Evaluation Results:

A notable finding was O1’s tendency to provide creatively inaccurate justifications for incorrect answers, reflecting issues with its reasoning process. For example, even when presented with clear unsolvable instances, O1 sometimes claimed feasibility based on past states from the reasoning chain, indicating a potential drift from purely logical reasoning to an “impressionistic” analysis.

Analysis: Pros:

Improved Performance: O1 shows considerable improvements over traditional LLMs, achieving close to perfect scores on straightforward instances.
Benchmark Development: The introduction of the PlanBench benchmark aids in systematically evaluating AI reasoning capabilities, contributing valuable insights for future research in this area.

Analysis: Cons:

Performance Variability: The contrast in performance on obfuscated tasks reveals significant limitations with O1’s reasoning, indicating a lack of robustness and reliability in challenging scenarios.
Lack of Guarantees: The model still struggles with providing correctness guarantees, raising safety and operational concerns for real-world application in critical domains.

This evaluation not only elucidates the forward strides made in AI planning but also emphasizes the need for more interpretative transparency and robust evaluation mechanisms to harness the full potential of emerging reasoning models.

Paper Summaries

Analysis of the paper “LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench.”