Benchmarking Legal Logic

How SAGE AI Achieves 70% Accuracy in Appellate Courts

Predicting the outcome of legal cases is a task characterized by high uncertainty and complexity. Unlike lower courts, where decisions are often driven by direct factual disputes (e.g., "Did the defendant run the red light?"), US State Appellate Courts operate on a more abstract level. They do not re-try facts; rather, they analyze whether the lower court applied the law correctly.

Consequently, appellate outcomes frequently turn on procedural technicalities, standards of review, and judicial interpretation of precedent rather than the "fairness" of the underlying facts. Even among experienced legal professionals, consensus on the outcome of a difficult appeal is rare.

In this article, we evaluated SAGE AI, a decision engine built on a Retrieval Augmented Logic (RAL) architecture on the AnnoCaseLaw dataset, which consists of complex negligence appeals from US State Courts. Our analysis indicates that when the system’s internal confidence is high, it achieves 70.0% accuracy, suggesting a capability to reliably identify clear legal signals within a complex dataset.

SAGE AI for Legal Applications

It is important to clarify that SAGE is designed as a decision augmentation tool, not a replacement for human legal experts. Rather, it functions as a force multiplier: it enables human experts to "inject" their tacit knowledge into the system, allowing that expertise to be generalized and applied at scale across thousands of new scenarios. Furthermore, in the context of appellate law, several structural factors make a "Human-in-the-Loop" approach not just beneficial, but necessary:

Subjectivity of Standards
Legal terms such as "reasonable care" or "abuse of discretion" are inherently subjective. Their application often depends entirely on a specific judge’s perception, which a model may not fully capture without expert tuning.
External Factors
Judicial decisions may also be influenced by unwritten jurisdictional norms, local political climates, or procedural histories that do not appear in the text of a case summary.
Data Sparsity
A predictive model is limited to the inputs it receives. In many appeals, the decisive factor may be a subtle nuance in the trial transcript that is absent from the brief summary provided to the system.

In high-volume environments where hundreds of cases must be reviewed, partial automation offers significant economic impact. With the cost of complex appellate litigation frequently exceeding $100,000, prioritization is an economic necessity.

By successfully identifying the subset of cases where the legal logic is stable and the outcome is highly predictable (High Confidence), SAGE AI allows human experts to redirect their cognitive resources and expertise to the ambiguous, lower-confidence cases where their expertise adds the most value.

Finally, it is essential to recognize that in the legal domain, prediction without explanation is of limited value. A practitioner cannot advise a client or shape a strategy based on a probability score derived from opaque, hidden data patterns. There is a fundamental need for explainability that is backed by grounded logic, reasoning that is transparent and directly traceable to specific understandable concepts. SAGE AI provides this transparency and grounded explainability through its RAL-based recommendations and rationale.

Methodology & Deep Dive

We utilized the AnnoCaseLaw dataset, a recently introduced benchmark that addresses critical flaws in earlier legal judgment datasets, specifically their lack of realism and reliance on misleading metrics. Crucially, the authors of AnnoCaseLaw emphasize the necessity of explainability: effective legal AI must provide a grounded rationale for its decisions, rather than operating as an opaque "black box" oracle.

Data Processing and Task Definition

The original dataset categorizes case outcomes into three classes: Affirm, Reverse, and Mixed. To replicate a realistic decision-making scenario for litigation professionals, we narrowed the scope to a more actionable binary classification task. We filtered the dataset to retain only cases with definitive outcomes:

AFFIRMThe lower court ruling is upheld.

REVERSEThe lower court ruling is overturned.

This binarization reflects the fundamental binary of legal business strategy: to appeal or not to appeal; to fund a case or to decline it. For each case, the dataset provides the following structured information:

Facts: The narrative of events that led to the initial litigation.
Procedural History: The sequence of decisions and events that occurred in the lower courts.
Application of Law to Facts: The court’s justification for its ruling, detailing exactly how the law applies to the specific facts of the case.
Precedents: A list of case citations (e.g., "Gettemy v. Grgula, 25 Ill. App. 3d 625").

Note on Knowledge Base Construction: To build the Knowledge Base for SAGE AI, we explicitly excluded the raw "Precedents" field. As these fields contained only citation strings without context, they could not inform a prediction without external online search, a variable we excluded to strictly evaluate the system's internal logical reasoning capabilities.

Experimental Setup: The Retrieval Augmented Logic (RAL) Approach

Our methodology distinguishes itself by utilizing a strictly Agentic Workflow rather than traditional model training. We did not fine-tune a Large Language Model (LLM) on legal data, nor did we employ complex prompt engineering to "teach" the model law or legal reasoning. Instead, we relied on a logic retrieval-based architecture, which is the backbone of SAGE AI:

1. The Knowledge Base

We randomly sampled 180 cases from the dataset to serve as a static "Knowledge Base." These cases were not used for training; instead, they function as a reference library. The primary function of this base is to provide valid "logical paths" (i.e., examples of how courts have reasoned in the past) which the system can then generalize to new scenarios.

2. The Evaluation Set

We ran the prediction engine on a separate, unseen set of 237 cases.

3. The Reasoning Mechanism

SAGE follows a three-step logic process for each test case:

Analysis: It parses the facts and procedural history of the new case.
Retrieval: It queries the Knowledge Base to identify relevant historical logical paths.
Synthesis: It formulates a prediction by applying those retrieved logical paths to the new case, effectively mimicking human analogical reasoning.

Baseline Performance: In our specific evaluation set, the "AFFIRM" outcome occurred 54.77% of the time. This establishes the statistical floor: an extremely naive classifier would achieve a 54.77% accuracy simply by guessing the majority class.

Results

To demonstrate value, a system must extract enough signal to surpass the statistical probability floor of 54.77% (the accuracy of blindly guessing the majority class). The results of the SAGE evaluation are detailed below:

Metric	Accuracy	Interpretation
SAGE (Overall)	60.5%	When predicting every case, SAGE exceeds the statistical probability, indicating it is successfully identifying relevant legal logical paths.
SAGE (High Confidence)	70.0%	When the sample is restricted to predictions where SAGE’s internal confidence score exceeded 80%, accuracy improved significantly.

The performance differential between the "Overall" score (60.5%) and the "High Confidence" score (70.0%) is the most critical finding. It suggests that the system is properly calibrated. This is particularly significant because SAGE was not trained as a traditional classifier, where probability scores are mathematically optimized during the learning process. Therefore, this calibration implies that the confidence score is not a statistical artifact, but rather stems directly from the consistency of the inner logic: when the system identifies clear, robust logical paths, it correctly signals high certainty.

When the retrieved historical logical paths can be clearly applied to the new scenario, SAGE reports high confidence and achieves high accuracy. Conversely, when the logic is ambiguous, the confidence score drops. This calibration validates a practical use case for the tool: by effectively flagging the cases it "knows" it can predict while signaling uncertainty on the rest, the system allows the human legal expert to direct their attention to the cases that actually require deep manual review.

Practical Example: Logic in Action

To illustrate how the SAGE AI engine functions in practice and provides rationale for recommendations, we present a specific example from the test set: Thomas v. Lynch, 59 Ill. App. 3d 542 (1978).

CASE RECORD: BACKGROUND (Inputs)

Facts

The accident occurred on that portion of U.S. Route 50 which runs east and west between Salem and Flora, Illinois... There were stop signs at the intersection controlling both north and south bound traffic... There were no stop signs for traffic on Route 50.

Plaintiff got out of his car after the accident... He went over to defendant and said to him "You ran the stop sign", to which defendant replied, "I know I did". The State trooper who investigated the accident issued a traffic citation... Defendant pled guilty to this citation which was admitted into evidence without objection.

Procedural History

Plaintiff brought suit to recover for injuries allegedly sustained as a result of an automobile collision. The jury found for defendant and the plaintiff appeals.

Logic Keys:

1 Admission of Guilt

2 Evidence of Negligence

3 Verdict Error

ACTUAL RULING

Application of Law to Facts

REVERSED

The text below represents the actual rationale provided by the Appellate Court.

"Plaintiff, contends that the trial court erred... We are in agreement...

We consider that standard to have been met in the instant case. The evidence plainly established that defendant was negligent②. He testified that his car struck plaintiff's car...

The investigating officer issued a citation to defendant on the scene for failing to yield the right of way①. And, although defendant testified that he was traveling 10 miles per hour...

Since the evidence clearly established defendant's negligence the jury's verdict must necessarily be based on a finding that plaintiff was contributorily negligent③. Plaintiff's conduct was plainly that of a reasonably prudent person...

We have reviewed the evidence... and that evidence... clearly established that defendant was negligent."

SAGE AI Recommendation

Analysis complete based on historical patterns.

Predicted Outcome

REVERSE

Correct Prediction

Confidence Score88%

Decision Rationale

"The decision to reverse is justified based on the defendant's admission of guilt① for failing to yield at a stop sign, aligning with precedent cases...

The defendant's negligence and breach of duty are evident②, directly causing the accident...

The jury's verdict contradicts the clear evidence③ of negligence, necessitating a reversal to ensure consistency with established legal standards."

Disclaimer: SAGE AI provides recommendations based on historical patterns. This does not constitute financial or legal advice.

Analysis of Similarity: Validating the Logical Path

This comparison demonstrates that SAGE did not simply guess the correct outcome; it successfully traversed the same logical path as the Appellate Court.

Notice the alignment in reasoning steps above: both the human judges and the system independently anchored their arguments on the Admission of Guilt ①, proceeded to the Evidence of Negligence ②, and concluded by identifying the Procedural Error ③ of the jury. This confirms that the model is effectively mirroring the actual judicial reasoning process.

Conclusion

In the complex domain of legal judgment prediction, SAGE AI demonstrates the ability to deliver high accuracy on high-confidence predictions, enabling users to make informed prioritization decisions.

Our findings suggest that the key to achieving high accuracy in this field lies in the robust representation of historical "logical paths" and the system's capacity to extrapolate these paths to novel scenarios in a valid and explainable manner. Furthermore, human expertise remains critical, particularly in designing the underlying Knowledge Base and curating the relevant decision factors that guide the retrieval process.

Finally, it is important to note that this evaluation relied strictly on the text of the case summaries. We did not incorporate external variables known to influence appellate outcomes, such as specific judge profiles, local jurisdictional norms, or a comprehensive precedent database. The successful extraction of a clear predictive signal despite these limitations highlights the power of the core Retrieval Augmented Logic architecture. Future iterations that integrate these wider contextual elements are likely to yield even greater predictive precision.

References & Further Reading

AnnoCaseLaw:
Sesodia, M., Petrova, A., Armour, J., et al. (2025). AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction. arXiv:2503.00128.
Appellate Process Background:
Administrative Office of the U.S. Courts. (n.d.). Appellate Courts and Cases – Journalist’s Guide. United States Courts. Retrieved from uscourts.gov.

DISCLAIMER: The content provided in this article and by the SAGE AI system is for informational and research purposes only. It does not constitute professional legal advice, diagnosis, or strategy. The accuracy metrics presented are based on historical datasets and specific experimental conditions; real-world results may vary. SAGE AI is a decision-support tool designed to augment, not replace, the judgment of qualified legal professionals. Users should always verify AI-generated insights against primary legal sources and consult with counsel before making legal decisions.

SAGE AI: 70% Accuracy in Legal Outcome Prediction