Research Article

Scalable Financial Backtesting Framework

Insights into AI-powered decision systems and scalable infrastructure

Sabr ResearchMay, 2025

Enhancing LLM Reasoning with Agentic Systems

How structure helps enhance reasoning capabilities

Large Language Models (LLMs) have revolutionized how we interact with information. They are incredibly proficient at generative tasks: writing emails, summarizing documents, or writing code. They have also proven highly capable at extracting information from unstructured data—making them efficient assistants for reading-intensive tasks.

However, a crucial and distinct capability necessary for any tool to support decision-making is the ability to reason: thinking about something in a logical, sensible way. Unlike simple generation, true reasoning requires a specific set of rigorous standards:

  • Multi-step planning and strategic foresight.
  • Strict adherence to logical and physical constraints.
  • Performing accurate, deterministic calculations.
  • Maintaining a consistent chain of logic over long horizons.

When we directly ask an LLM to solve a difficult problem using a single prompt, we often encounter recurring issues: subtle inaccuracies, logical disconnects, or outright hallucinations.

The "State Management" Problem

At their core, LLMs are "stateless" next-token prediction machines. They do not "think" in the human sense; they predict the most likely continuation of the text currently visible in their context window. When dealing with complex tasks, the context window—the model's "working memory"—is prone to flooding.

Context Window Illustration and Attention Decay

Figure 1: Context Window Illustration and Attention Decay

Maintaining a logical chain requires the ability to keep track of "states," which is inherently difficult without an explicit dedicated mechanism. Imagine trying to solve a complex calculus problem without taking notes on paper. You would have to hold every intermediate number in your immediate memory. Eventually, the brain gets overwhelmed, a variable is dropped, and the final answer is incorrect.

Similarly, in a "one-shot" prompt, the LLM's context window gets flooded with intermediate steps, noise, and instructions, causing the model to "forget" context or hallucinate details.

Evolving Architectures and Prompting

Recent research focuses on two primary methods for improving reasoning quality: either refining the instructions within the prompt or building coordinated systems of multiple components.

Chain of Thought (CoT)

By forcing the model to "show its work," we create an internal scratchpad. This helps the LLM stay on track longer, though it remains highly sensitive to minor prompt variations and relies entirely on internal state.

Agentic Systems

This architecture integrates specialized LLMs with deterministic code. By decoupling reasoning from computation, functionalities are separated: an agent explores the problem while an external system manages the state.

Experiment: Traveling Salesperson Problem (TSP)

To evaluate these approaches, we used a 15-city TSP instance with an optimal cost of 291. We tested three distinct scenarios using the Amazon Nova Premier model to observe how structure affects mathematical and logical outcomes.

Scenario A: Zero-Shot

Solve the Traveling Salesperson Problem (TSP) for this Distance Matrix.
Format: Return ONLY the path (e.g., [0, 5, 2...]) and the total cost.
Distance Matrix: {matrix}

Scenario B: Chain of Thought

You are an expert mathematician... Strategy: Step-by-Step Instruction for Branch and Bound Algorithm.
Final Verification Example:
- Path: [0, 12, ..., 0]
- Total Calculation: 23 + ... = TOTAL

Scenario C: The Agentic System

Simple Agentic System Architecture

Figure 2: Simple Agentic System

Strategy Engine (LLM)

Provided with current optimization status and cumulative costs, it analyzes partial solutions to strategically select the most promising branch to explore.

Execution Engine (Script)

Handles the computational heavy lifting. It calculates new path costs and verifies exploration validity, managing the state externally so the LLM stays focused on exploration.

Stopping Criteria: The loop continues until the agent identifies an optimal solution or the maximum iteration count is reached.

Experimental Results

MetricZero-ShotCoT PromptAgentic System
Found Feasible Solution?YesYesYes
Valid Cost Calculation?No (Err: 516)YesYes
Total Path Cost493341291 (Optimal)

Closing the Gap in Machine Reasoning

Reasoning can be decomposed into two distinct aspects: a creative "exploration" component and a component dedicated to mathematical accuracy and logic. Agentic systems provide a robust framework for integrating these elements, effectively addressing the issue of context decay that plagues long-form tasks.

At Sabr Research, we are currently developing new architectural designs for agentic systems to further enhance these reasoning capabilities. By explicitly managing memory and state, we create systems that are not only more capable of multi-step logic but are also fully auditable and transparent—foundational requirements for production-grade decision-support systems.

InfrastructureEngineering