White Hat IQ — SLM Fine-Tuning Research Report

Table of Contents

Why This Research Matters
Abstract
Research Programme — 8 Models
Fine-Tuning Methodology
Expectations & Hypothesis
Models Selected & Why
Architecture & Fine-Tuning Parameters
Results — Summary Table
Key Findings
Learnings
Constraints
Conclusions
Methodology — Full Detail

Why This Research Matters

SLMs act as workers in a digital factory, while LLMs serve as consultants for complex challenges. Agentic AI doesn't require a Swiss Army knife when a single sharp tool will do.

— NVIDIA Developer Blog: How Small Language Models Are Key to Scalable Agentic AI

NVIDIA's case is clear: the future of scalable agentic AI is specialised small language models, not ever-larger general-purpose ones. SLMs are 10–30x cheaper to run, can be fine-tuned in hours rather than weeks, run at the edge without cloud dependency, and — when properly trained — produce more reliable structured outputs than frontier models for narrow, repetitive tasks.

The construction industry is a compelling testbed for this thesis. Contract review, delay analysis, and schedule management are high-value, repetitive, and structurally well-defined — exactly the kind of task NVIDIA argues SLMs can own. A specialist construction AI agent running on-device could analyse contracts in real time, flag delay risks before they escalate, and validate schedules without sharing sensitive project data with a cloud provider.

This research programme tests that hypothesis directly: can small language models (3B–20B parameters), fine-tuned on synthetic construction data, reliably perform these tasks? And what are the real limits of the approach?

Abstract

We evaluated 8 small language models ranging from 3B to 20B parameters on a synthetic construction dataset of 337 examples spanning three domains: contract clause analysis, delay attribution, and schedule analysis. Models span multiple vendors (Mistral, Microsoft, NVIDIA, Google, IBM, Alibaba, OpenAI) and architectures (pure Transformer, hybrid Mamba-2 + Attention, Mixture of Experts). All models were evaluated via the same 6-test standardised evaluation suite (3 Standard + 3 Curveball) with local inference through LM Studio.

Results reveal significant performance variation across the 3B–20B range, with model size showing surprisingly low correlation with performance. The next phase applies a specialist fine-tuning strategy — training each model on a single domain (Contracts, Delays, or Schedules) rather than all three simultaneously — informed by 10 prior trial rounds on Qwen3-4B and Nemotron-3-Nano-4B that demonstrated multi-domain fine-tuning causes catastrophic interference at this dataset scale.

The complete cycle of 8 models × 3 domains × 4 pathways (Base, Base+PE, FT, FT+PE) exposes two architecturally distinct routes to production-grade performance, and the choice between them depends on the knowledge structure of the target domain rather than on model size or training budget.

The first route — fine-tuning — is justified when the target domain admits a closed-form reasoning pattern (TIA framework and Adyard concurrency in Delays; planner-pattern selection and CPM execution in Schedules), when training data of at least sixty pattern-bearing examples is available, and when compute is sufficient for a hundred-plus optimisation steps. Fine-tuned wins in this cycle include Gemma 4 E4B (Contracts +10, Delays +21), Ministral 14B FT+PE (Delays +19), and GPT-OSS 20B FT+PE on Schedules at 92.5/100, the overall best score recorded in the programme.

The second route — a strong base model paired with deterministic scaffolding and an engineered system prompt — is equally viable when resources or training data are constrained, and in several cases is the preferred path even when fine-tuning is feasible. Qwen 3.5 9B Base+PE matched or surpassed its own fine-tune in every domain tested, including a perfect 98/100 on the CB3 Northbrook Solar EPC CPM analysis. The base reasoner, when given explicit rule injection in the system prompt and a deterministic execution layer (CPM solver, clause-database retrieval, validator loop), produces production-grade output without any weight modification, iterates faster, costs less to operate, and adapts dynamically to new clauses, projects, and regulations.

The practical recommendation is therefore domain-conditional rather than model-conditional. Fine-tuning is the preferred path for domains with stable, closed-form knowledge (delay attribution rules, CPM arithmetic, schedule grammar). Base+scaffolding+PE is the preferred path for domains with evolving knowledge (contract clauses, jurisdiction-specific precedent, company-specific commercial terms). The strongest production stack combines both: a fine-tuned planner-analyst model wrapped in retrieval-augmented scaffolding and addressed through a carefully engineered prompt.

Dataset Privacy Notice: The training dataset is synthetic and augmented, designed to mimic real-world construction patterns. It was built from real-world reference material and cannot be shared publicly to avoid indirect exposure of proprietary project data, even in anonymised form.

Research Programme — 8 Models

Model	Vendor	Params	Architecture	Context
Ministral 3B	Mistral	3B	Transformer	131K
Phi 4 Mini	Microsoft	3.8B	Transformer	131K
Nemotron 4B	NVIDIA	4B	Hybrid Mamba-2 + Attn	131K
Gemma 4 E4B	Google	4B	Transformer	131K
Granite 4 Tiny	IBM	7B	MoE	131K
Qwen 3.5 9B	Alibaba	9B	Transformer	131K
Ministral 14B	Mistral	14B	Transformer	131K
GPT OSS 20B	OpenAI	20B	Transformer	131K

Fine-Tuning Methodology

We use Unsloth with QLoRA (Quantised LoRA) — a highly optimised fine-tuning framework that reduces VRAM usage and increases training speed 2–5× versus standard PEFT. Base model weights are kept frozen in their original precision; only low-rank adapter matrices (LoRA) are trained. This enables full fine-tuning-quality adaptation with a fraction of the VRAM and time.

Training Pipeline

Data format: Chat-template JSONL (system + user + assistant messages)
Optimiser: AdamW, weight_decay=0.01, cosine LR schedule with warmup
Gradient accumulation: 4 steps (effective batch size = 4)
Gradient clipping: max_norm=1.0 (mandatory for SSM stability)
Evaluation: Per-epoch eval loss on 29-example holdout set
Checkpointing: Best model by eval loss, with early stopping
Export: Merged weights → GGUF Q8_0 → LM Studio local API

Evaluation Suite

6 evaluation scenarios across 3 domains — one Standard (T1) and one Curveball (CB) per domain. The T1 tier tests competency on realistic project scenarios; the CB tier tests generalisation to novel jurisdictions, project types, and contract forms entirely unseen in training. Responses are scored by an LLM judge against pre-computed golden answers with full reasoning. All models evaluated across up to 8 pathways (Base/FT × Original/Enhanced prompt × Think/NoThink).

Expectations & Hypothesis

The research programme is structured around four hypotheses, derived from the NVIDIA SLM thesis, ten prior multi-domain trial rounds on Qwen3-4B and Nemotron-3-Nano-4B, and the specialist-strategy revision that followed.

Hypothesis A. Specialist fine-tuning — training each model on a single domain rather than all three simultaneously — will outperform multi-domain fine-tuning by avoiding the catastrophic interference observed in the prior trial rounds. A single-domain adapter learns one task grammar without competing objectives, and the dataset scale used here is insufficient to support concurrent domain mastery.

Hypothesis B. Models that already score highest in a domain's base evaluation possess the strongest foundation for fine-tuning on that domain, and the marginal lift from training should be largest for these top-ranked base models. The expectation is that fine-tuning pushes the leading bases toward the 95–100 range on their specialist domain.

Hypothesis C. Schedules will remain the hardest domain after fine-tuning. Schedule tasks demand multi-step numerical reasoning (CPM forward and backward passes, predecessor arithmetic, lag-type interpretation), and even the strongest base score is only 85.8. Fine-tuning is expected to improve format compliance and pattern recall but is unlikely, on its own, to repair arithmetic reasoning gaps.

Hypothesis D. Model size is not the primary predictor of performance. The base evaluation already shows 4B models outperforming 14B and 20B models on several domains. Architecture (thinking capability, hybrid state-space + attention, mixture-of-experts), pre-training data quality, and fine-tuning data design are expected to matter more than raw parameter count for construction-domain reasoning.

Models Selected & Why

These 8 models represent the latest thinking-capable SLMs in the 3B–14B "Small Language Model" range, with GPT-OSS-20B as the only "small-to-medium" exception. Selection criteria: open-weights, local inference capability on 16 GB VRAM, instruction-following, and structured JSON output support.

Model	Size	Thinking	Rationale
Ministral 3B	3B	No	Smallest Mistral model — tests the absolute floor for construction domain capability
Phi 4 Mini	3.8B	Yes	Microsoft's compact reasoning model — strong structured output at minimal parameter count
Nemotron 4B	4B	No	NVIDIA's hybrid Mamba-2 architecture — tests SSM vs Transformer for construction tasks
Gemma 4 E4B	4B	Yes	Google's latest 4B model with thinking — direct comparison to other 4B models
Granite 4 Tiny	7B	No	IBM's enterprise MoE model — tests sparse expert routing for domain specialisation
Qwen 3.5 9B	9B	Yes	Alibaba's mid-range reasoning model — builds on Qwen3-4B trial run findings
Ministral 14B	14B	Yes	Largest SLM in programme — tests whether 14B yields meaningfully better FT results
GPT OSS 20B	20B	Yes	OpenAI's first open-weight model — baseline from frontier lab at small-to-medium scale

Architecture & Fine-Tuning Parameters

The following parameters apply generically across all models during fine-tuning. Specific per-model values will be determined based on base evaluation results and domain assignment.

Parameter	Description / Meaning
LoRA Rank (r)	Dimension of the low-rank adapter matrices. Higher rank = more capacity but higher forgetting risk. Typical range: 8–32.
LoRA Alpha	Scaling factor for LoRA updates. Usually set to 2x the rank (e.g., r=8, alpha=16). Controls how aggressively the adapter modifies the base weights.
Learning Rate	Step size for weight updates. Too high = catastrophic forgetting. Too low = no learning. Typical range: 1e-5 to 5e-5.
Max Epochs	Maximum number of full passes through the training data. Combined with early stopping to prevent overfitting.
Early Stop Patience	Number of epochs without eval loss improvement before stopping training. Prevents wasted compute and overfitting.
Grad Clip	Maximum gradient norm (typically 1.0). Prevents exploding gradients — critical for SSM/Mamba layer stability.
Warmup Ratio	Fraction of training steps with linearly increasing LR. Prevents early destabilisation. Typical: 0.10–0.15.
LoRA Target Modules	Which model layers receive LoRA adapters. Typically attention (q/k/v/o_proj) and MLP (gate/up/down_proj). SSM layers require special handling.
Batch x Accumulation	Effective batch size = batch_size x gradient_accumulation_steps. Constrained by 16 GB VRAM. Typically 1 x 4 = 4.
Dataset size	Number of training examples. Current: 337 (291 train / 46 val). Domain split: 164 contracts, 85 delays, 88 schedules (v3).
Precision	Training dtype. FP16 or BF16 autocast, matched to native model precision. BF16 preferred for stability.

Results — Summary

Model	Size	Thinking?	Base Contracts	Base Delays	Base Schedules	FT Contracts	FT Delays	FT Schedules	Top-3 Domains
Ministral 3B	3B	N/A	49/100	63/100	60/100	—	62/100 ↑	—	Delays
Phi 4 Mini Reasoning	3.8B	Yes	59/100	57/100	56/100	—	—	—	—
Nemotron 4B	4B	Yes	76/100	51/100	71/100	70/100 ↓	—	—	Contracts
Gemma 4 E4B	4B	Yes	82/100	62/100	79/100	92/100 ↑	75/100 ↑	84.5/100 ↑	Contracts Schedules Delays
Granite 4 Tiny	7B	N/A	27/100	57/100	65/100	—	—	—	—
Qwen 3.5 9B	9B	Yes	89/100	55/100	86/100	78/100 ↓	—	76.5/100 ↓	Schedules Contracts
Ministral 14B	14B	Yes	66/100	66/100	62/100	—	79/100 ↑	—	Delays
GPT OSS 20B	20B	Yes	78/100	53/100	78/100	—	—	89/100 ↑	Schedules

Contracts Domain — Top 3

Rank	Model	Score
1st	Qwen 3.5 9B	89/100
2nd	Gemma 4 E4B	82/100
3rd	Nemotron 4B	76/100

Delays Domain — Top 3

Rank	Model	Score
1st	Ministral 14B	66/100
2nd	Ministral 3 3B	63/100
3rd	Gemma 4 E4B	62/100

Schedules Domain — Top 3

Rank	Model	Score
1st	Qwen 3.5 9B	86/100
2nd	Gemma 4 E4B	79/100
3rd	GPT OSS 20B	78/100

Base Model Key Findings

Phi 4 Mini Reasoning

Strong FIDIC delay causation reasoning (CB2 = 78/100, key discriminator passed). Perfect activity completeness in T3 (25/25) and all durations within benchmark ranges (22/30). The model fails on DB clause ID lookup (all null across both contracts tests), predecessor circular dependencies in T3 (three deadlock chains penalise the C section to 0/25), CPM arithmetic in CB3 (negative total floats, duration overstated), and the FIDIC 19.4 cost rule (Force Majeure is time-only, not cost). Extended thinking (45K–56K tokens) helps classification but does not repair arithmetic or graph construction. Full sub-scores are recorded on the Phi 4 Mini tab.

Nemotron 3 Nano 4B

Contracts score 76/100 combined on clean inputs (T1 = 85, CB1 = 66); the earlier 79 (with "C-002 Rejected, 20/21 correct") came from an answer-leaked docx CB1 run. On clean CB1, C-002 returns "Requires Review" (KEY DIS FAIL), and the model exhibits heavy Accept-bias. Delays performance is catastrophically weak (T2 = 35/100, CB2 = 67/100): DEL-003 misclassified, EOT outside range, no FIDIC citations in T2. Schedules generation is strong, but the SS+10 lag is misread as FS+10, producing a 55-working-day inflation in CB3. Full sub-scores on the Nemotron 4B tab.

Ministral 3 3B

Contracts weak on clean inputs (T1 = 63, CB1 = 34); the earlier "CB1 89/100" and "+26 point T→CB recovery" came from an answer-leaked docx run. Clean CB1 shows severe Accept-bias (17 of 21 clauses Accepted) and a C-002 KEY DIS FAIL. Delays performance is moderate: DEL-OW-002 (Contractor, FIDIC 4.15) is correctly classified in both tests, but the FM label is inconsistent and EOT falls outside range. Recurring invalid JSON output (// comments, arithmetic expressions in values) appears across delays tests. Schedules CPM is systematically broken: T3 circular dependencies collapse C to 0; the CB3 SS+10 misread inflates duration to 296 vs golden 265.

Granite 4 Tiny

IBM's enterprise MoE underperforms its 7B class on contracts (T1 = 33/100: four hallucinated articles, Art10 marked Accepted, KEY DIS FAIL), but shows a unique T2 strength — the first model to correctly classify DEL-003 as employer non-critical with zero EOT entitlement. Schedule generation is solid (T3 = 78/100, no circular dependencies). CB3 CPM is broken: Activity 8 is not on the critical path (SS+10 misread as FS+10, ES = 235 not 175). Very brief CB responses suggest a capacity limit at this MoE scale.

Gemma 4 E4B

Base overall 74/100. Contracts: T1 = 93/100 (all 14 articles correct, no hallucinations, Art10 Rejected ✓, strong DB IDs); CB1 = 70/100 on the clean plain-text input. The CB3 breakthrough is the first correct SS+10 lag application in the programme (ES8 = 175), achieving exact 265-working-day duration. Thinking traces (10K–12K characters) show thorough per-item reasoning. The principal weakness is T2 Delays (49/100): DEL-003 placed on the critical path, and a backward-pass error produces negative TF values (−5).

Qwen 3.5 9B

Highest overall base score (77/100). CB3 is outstanding at 96/100 — exact 265-working-day duration, perfect critical path, correct SS+10 (ES8 = 175) and SS+15, and no negative TF values (fixes Gemma's backward-pass error). Contracts is the strongest in the programme: T1 = 88/100 with Art10 Rejected and thorough 41K-character reasoning; CB1 = 89/100 on clean input. The CB1 key discriminator nonetheless fails — C-002 is labelled "Requires Review" because the model correctly identifies DB3 as "completely rejected" but argues that a negotiation-only clause does not match DB3's arbitration-specific entry, missing that Rule 3 applies to the whole Dispute Resolution category. Delays remains consistently weak (T2 = 56, CB2 = 53). T3 circular predecessor dependencies (three chains) kill the C section.

Ministral 14B

T1 contracts key discriminator passes (Art10 Rejected), but CB1 C-002 fails on the clean plain-text input (Requires Review, not Rejected; the earlier "C-002 Rejected" was an answer-leaked docx run; clean CB1 = 54/100). T2 event identification is perfect (35/35 A-section) — the best event recall in the programme. CB2 is strong (70/100): DEL-OW-002 Contractor correct, Adyard principle cited, weather EOT maintained, EOT = 35 within acceptable range. Weaknesses include T2 KEY DIS fail (DEL-003 placed on CP, EOT = 45 not 0), T3 circular dependencies 15↔16↔17 (C = 0/25), and CB3 SS+10 lag misread (ES8 = 165 not 175). Uniquely, CB3 C = 25/25 — no negative TF, cleaner backward pass than Gemma's. reasoning_content is empty across all tests and all outputs are wrapped in a markdown code fence.

GPT-OSS 20B

OpenAI's first open-weight model evaluated locally via LM Studio with reasoning_effort=low. T1 contracts key discriminator passes (Art10 Rejected, explicit Rule 3, strong T1 reasoning with Art10/7/6/3 all 6/6), but CB1 C-002 fails on the clean plain-text input (Requires Review, not Rejected). T3 has no circular dependencies — one of only three models to achieve this (with Granite and Nemotron). CB3 SS+10 is correctly applied in the forward pass (ES8 = 175), matching Gemma and Qwen, but a backward-pass error gives TF8 = 40 and Activity 8 non-critical (KEY DIS FAIL). Delays is the weakest domain (53/100): T2 DEL-003 placed on CP, EOT = 1.5 months vs golden 0; CB2 is severely truncated (29 reasoning tokens, 758 characters), missing the concurrent delay analysis and inverting the FIDIC 19.4 FM cost rule. reasoning_effort=low is insufficient for delay analysis and CPM backward-pass depth.

Learnings — Per Fine-Tuned Model

Ministral 3 3B — Delays

The smallest model fine-tuned in the programme (3B, non-thinking). The Delays cycle proceeds base 47 → FT 57 (+10) → FT+PE 62 (+15 vs base). FT+PE is the strongest configuration: the engineered prompt's explicit FIDIC and Adyard rule injection eliminated an invalid-JSON output defect and corrected the DEL-OW-002 contractor responsibility classification. Even a non-thinking 3B can be lifted into the mid-60s on Delays when the base has a workable foundation. Contracts and Schedules were not fine-tuned for this model; Delays was its strongest base domain (63), so the fine-tune targeted that strength.

Nemotron 3 Nano 4B — Contracts

On clean inputs the Contracts fine-tune scored 70 against base 76 — a −6 regression. The earlier "FT learned the rejection hierarchy (C-002)" claim was an artefact of an answer-leaked docx CB1 run; on clean CB1 both base and fine-tune fail C-002 (Requires Review, not Rejected). The real T1 regressions — article hallucination, over-triggered modification — persist. FT+PE recovered the score to 82, clearing base 76, which is a genuine best-of-both-worlds outcome unique to Nemotron in this cycle. The lesson for hybrid Mamba-2 architectures with limited training data is that fine-tuning alone is risky; FT + engineered prompt is the safer stack.

Gemma 4 E4B — Contracts, Delays, Schedules

The most fine-tuned model in the programme. Contracts moved from 82 to 92 (+10) — the only contracts fine-tune that improves on its base — driven by CB1 70→94 (+24); the fine-tune passes the C-002 key discriminator that the base fails and handles the Finnish YSE 1998 curveball cleanly, provided it is evaluated in thinking mode at ≥8k context. Delays moved from 62 to 75 with FT+PE (+13); this configuration returned the first correct partial EOT for the disputed TP fire in CB2 (38 days vs golden ~40). Schedules v3 moved from 79 to 84.5 with FT vanilla (+5.5); the v3 dataset rework (77 planner-pattern examples, T3+CB3 contamination eliminated) raised T3 from v2's 65 to 83 and removed hallucinations such as "Berlin". Prompt engineering actively hurt Gemma's Schedules CB3 by over-prescribing rules that broke the backward pass. The overall pattern is that Gemma absorbs the fine-tune pattern cleanly across all three domains, while PE helps Delays and hurts Schedules.

Qwen 3.5 9B — Contracts, Schedules

Contracts regressed by −11 (89→78). At temp = 0.6 the fine-tune emits the invalid string "Accepted subject to modification" on 7 of 14 T1 articles — a format defect rather than a reasoning failure. FT+PE rescues the score to 87 but still trails base 89 and base+PE 94. Schedules v3 follows a similar pattern: FT vanilla 76.5 (−2.5 vs base 79), FT+PE 73 (worse still). Crucially, Base+PE on Schedules reaches 87.5 combined (+8.5), including a perfect CB3 98/100 (exact PD = 265, exact CP, all flags consistent). The verdict for Qwen on both domains is to skip fine-tuning entirely and operate the model in Base+PE mode: strong base reasoning combined with explicit CPM rule injection produces production-grade output without any weight modification overhead.

Ministral 3 14B — Delays

The 14B Reasoning model on Delays: base 66 → FT 70 → FT+PE 79 (+19 vs base). FT+PE was the only configuration to compute the Adyard offset exactly right (42d vessel breakdown − 26d concurrent window = 16d contractor LD). The combination of 14B reasoning capacity, planner-pattern training data, and explicit FIDIC rule injection in the prompt produced the strongest Delays score in the programme. Training required the Unsloth path (raw PEFT crashed on the 4-bit Ministral3 base); per-epoch evaluation was disabled because the accelerate fp32 conversion OOMs on logits at sequence length 4096; and stream merge was mandatory for the locally-trained 4-bit adapter, since Unsloth's save_pretrained_merged corrupts on this combination.

GPT-OSS 20B — Schedules (Overall Winner)

The 20B MoE base (3.6B active) was fine-tuned on the Schedules v3 dataset on a cloud A100, the local 4090 Laptop being blocked twice (Unsloth fused-CE incompatibility on SM89, and the transformers MXFP4 hard training guard). FT+PE reached 92.5 combined (T3 = 87, CB3 = 98), the overall programme winner. Both FT vanilla and FT+PE produce a perfect CB3 98/100: PD = 265 exact, CP = [1, 2, 3, 5, 6, 7, 8, 14, 15, 17, 18] matching golden exactly, Activity 8 critical, SS+10 and SS+15 correctly applied, Activity 1 critical = true (where Gemma produced TF = −15). T3 FT+PE uses multi-DB blending (P1/P2/P4/P5/P8) with per-activity scale rationale. The complete cloud cycle cost approximately $2 and 60 minutes wall-clock; the MXFP4 GGUF (13 GB) was deployed locally for LM Studio. The pattern is the inverse of Qwen: large MoE bases combined with planner-pattern fine-tuning and an engineered prompt stack additively rather than regressing.

Constraints

Hardware Constraints

16 GB VRAM GPU — limits batch size, seq length, and model size
Windows OS — Triton/mamba-ssm not installable (Linux only)
Local inference only — no A100/H100 acceleration
Mamba slow-path forces MAX_SEQ_LEN=512 (vs 4096 ideal)

Dataset Constraints

337 examples is very small for fine-tuning complex reasoning
Synthetic data may not capture all real-world edge cases
Cannot be shared (privacy) — limits reproducibility
Imbalanced: contracts heavy (~55%), delays/schedules light

Evaluation Constraints

Latency/throughput not measured in this phase

Conclusions

Overall Conclusion 1 — Best Configuration per Domain

No single pathway wins all three domains. Contracts rewards Base+PE because the task is dominated by knowledge retrieval and fine-tuning bakes in a stale snapshot of a company's clause database. Delays and Schedules reward FT+PE because the task is closed-form reasoning that transfers cleanly into fine-tuned weights. The strongest production stack selects the pathway per domain on the basis of whether the underlying task is dynamic-knowledge or closed-form-reasoning. The cross-model winners are summarised below.

Domain	Best Model	Best Path	Score	Δ vs Best Base
Contracts	Qwen 3.5 9B	Base + PE	94/100	+5
Delays	Ministral 3 14B	FT + PE	79/100	+19
Schedules	GPT-OSS 20B	FT + PE	92.5/100	+13.5

Overall Conclusion 2 — Model Size Is Not the Primary Predictor

Across the 3B–20B range tested, raw parameter count showed weak correlation with final scores. The 4B Gemma E4B fine-tuned contracts (92) beats both the 14B Ministral (best contracts 66 base) and the 20B GPT-OSS (best contracts 78 base). The 9B Qwen Base+PE contracts (94) is the best contracts configuration overall. On Schedules, the 20B GPT-OSS FT+PE (92.5) does win, but Qwen 9B Base+PE (87.5) and Gemma 4B FT (84.5) are within striking distance at a fraction of the inference cost. Architecture (thinking capability, hybrid state-space, mixture-of-experts), pre-training data quality, and fine-tuning data design matter more than parameter count for construction-domain SLM performance.

Contracts — Skip Fine-Tuning, Use Base + PE + RAG Scaffolding

The three contracts fine-tunes split unevenly: Gemma 4 E4B improved (+10 combined, +24 on the CB1 Finnish YSE 1998 curveball), while Nemotron 3 Nano 4B (−6) and Qwen 3.5 9B (−11) regressed. The "trades generalisation for specialisation" pattern does not hold uniformly. Gemma's fine-tune generalised better, but Qwen's invalid-status-string format defect at temp = 0.6 and Nemotron's article hallucination wiped out fine-tuning gains.

The root cause is that contracts requires reasoning combined with retrieval against a company's evolving clause database. Fine-tuning encodes a snapshot; weights trained today are stale by next quarter as precedents shift, commercial strategy changes, and new clauses close. The architecture-correct path is a capable base (Qwen 9B Base+PE = 94 is the programme high) combined with retrieval-augmented scaffolding (clause-database injection, tool-calling, MCP) and a carefully engineered prompt. Fine-tuning bakes in what the company decided in the past; what is needed is teaching how the model reasons, and for contracts that capability is already present in the base. Fine-tuning should be reserved for the rare model–domain combination where it demonstrably clears the base; in this cycle Gemma 4 E4B contracts is the only such example.

Delays — Fine-Tuning Wins Across the Board

Three models completed the full Delays fine-tune cycle (Ministral 3 3B, Gemma 4 E4B, Ministral 3 14B), each with the same six T2-envelope training additions and the same engineered prompt. Every model improved with fine-tuning, and FT+PE was the best configuration in every case.

Model	Base	Base + PE	FT	FT + PE	FT+PE Δ vs base
Ministral 3 3B (non-thinking)	47	56	57	62	+15
Gemma 4 E4B (thinking)	54	66	54	75	+21
Ministral 3 14B (thinking)	60	74	70	79	+19

Delays behaves differently from Contracts because it is pure reasoning — TIA framework, float arithmetic, responsibility classification, Adyard concurrency — with no live clause database and no jurisdiction-specific knowledge that goes stale. Training data teaches how to reason, which transfers cleanly to weights. The single strongest configuration, Gemma 4 E4B FT+PE, returned the first correct partial EOT for the disputed TP fire in CB2 (38 days vs golden ~40). Ministral 14B FT+PE was the only configuration to compute the Adyard offset exactly (42d vessel breakdown minus 26d concurrent window = 16d contractor LD).

Schedules — Three Pathways All Achieve Perfect CB3, FT+PE Wins T3

The Schedules v3 cycle (planner-pattern dataset, 77 train + 11 val, T3 and CB3 task-pattern contamination eliminated) produced the strongest results of the programme. Three configurations achieved a perfect CB3 score of 98/100 on the Northbrook Solar 50MW EPC CPM analysis.

Configuration	T3 (Cologne generation)	CB3 (Solar EPC CPM)	Combined
Gemma 4 E4B FT vanilla	83	86	84.5
Qwen 3.5 9B Base + PE	77	98	87.5
GPT-OSS 20B FT vanilla	80	98	89
GPT-OSS 20B FT + PE	87	98	92.5

Three patterns emerged across model sizes. The small dense model (Gemma 4 E4B) prefers FT vanilla and is actively hurt by PE. The medium dense model (Qwen 3.5 9B) prefers Base+PE and regresses on CB3 under fine-tuning due to an A2 SS bug. The large MoE model (GPT-OSS 20B) shows additive stacking: both FT vanilla and FT+PE achieve perfect CB3, with PE adding +7 on T3. The v3 dataset rework — replacing v1's mixed format and v2's 8-envelope minority signal with 77 pure planner-pattern examples — was the key unlock. The planner pattern encodes activity selection, sequence logic, and duration grounding; the deterministic scaffold computes the CPM math. The model learns the correct abstraction. The CB3 Northbrook Solar critical path (1→2→3→5→6→7→8→14→15→17→18) was matched exactly by all three top configurations, demonstrating that small, medium, and large SLMs can all reach production-grade CPM analysis when paired with the right pathway.

57/100

Base Model Overall

rubric v1.0 · T+CB avg

62/100

FT+PE Delays (LLM-judge)

+15 vs base · Delays cycle winner

Delays

Best Domain

63/100 combined (T=55 CB=71)

Base + FT Done

Status

Delays FT cycle complete

Domain Summary — Rubric v1.0

Domain	T Score (standard)	CB Score (curveball)	Combined	Fine-Tuned	Δ
Contracts	63/100	34/100	49/100	—	—
Delays	55/100	71/100	63/100	62/100 ↑	+15 (FT+PE)
Schedules	62/100	58/100	60/100	—	—

Standard Tests — T1 to T3

Test	Project	Domain	Base (Rubric)	FT	Δ
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	63/100 A=33 B=16 C=14	—	—
T2	AB v AP Residential — Delay Attribution & EOT	Delays	55/100 A=32 B=13 C=10	—	—
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	62/100 A=25 B=30 C=0 D=7	—	—

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.

Test	Project	Domain	Base	FT	Δ
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts	Contracts	34/100 A=17 B=9 C=8	—	—
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	71/100 A=26 B=28 C=17	—	—
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	58/100 A=17 B=16 C=19 D=6	—	—

Base Model Learnings

Contracts: Severe CB1 Accepted-Bias (34/100, clean plain-text input)

CB1 re-run on the standardised plain-text input — the earlier 44/100 came from the answer-leaked docx extraction (the docx carries a "Notes / Flaws" column stating every clause's defect). Clean score: 34/100 (A=17 B=9 C=8). Severe Accepted-bias: 17/21 clauses returned "Accepted" regardless of content. C-002 marked "Requires Review (Inferred)" — KEY DIS FAIL. No key discriminator passes in either T1 or CB1. DB ID accuracy collapses without the leaked answer key (B=9/20). Training must reinforce STATUS DECISION HIERARCHY.

Delays: FM Classification Inconsistent

DEL-OW-002 Contractor (FIDIC 4.15) correctly identified in both T2 and CB2 (KEY DIS ✓). FM label inconsistent: T2 correct, CB2 "neutral" instead of Force Majeure for Event_1. EOT calculations outside expected range in both tests. Concurrency analysis attempted (Adyard principle named in CB2) but EOT totals incorrect. Recurring invalid JSON output (// comments, arithmetic expressions, markdown in values) in both T2 and CB2 — significant quality issue.

Schedules: CPM Arithmetic Systematically Broken

Activity generation strong: all 18 activities with durations in range in both T3 and CB3. CPM calculation fails consistently. T3: circular predecessor dependencies (4 distinct chains) make schedule logically invalid → C=0. CB3: SS+10 lag misread as FS+10 (adding lag to predecessor's EF not ES), inflating duration to 296 vs 265 golden (31 wd error). Activity 8 fails KEY DIS in CB3 (critical=false, TF=-6.01). Negative total float values (activities 8,11,13,14,16) indicate systematic CPM logic error.

Training Priorities

1. JSON output validity — eliminate // comments and arithmetic expressions in values across all domains. 2. CPM lag type interpretation — SS+lag means add lag to predecessor's ES, not EF. 3. Label/reasoning consistency — output status must match reasoning conclusion. 4. EOT range calibration — total EOT should reflect net impact of overlapping events, not sum of individual durations.

Fine-Tune Cycle Results — Delays (T2 + CB2)

FT trained locally on RTX 4090 16GB: 74 examples (68 original + 6 full-schedule T2-envelope), 10 epochs, eval loss 2.51→1.30 monotonic, ~9 min. Q8_0 GGUF deployed. Eval params: temp=0.15, top_p=0.9, min_p=0.06 (non-thinking model). Scoring = LLM-as-judge holistic 0–100 (not Rubric v1.0 — separate cycle, run after the formal rubric grading).

Configuration	T2	CB2	Combined	Δ vs base
Base	48	46	47	—
Base + Prompt Engineering	62	50	56	+9
Fine-Tune	58	56	57	+10
Fine-Tune + Prompt Engineering	66	57	62	+15

Fine-Tune Learnings

FT lifted +10 — envelope examples fixed the format/magnitude collapse

Base vanilla computed a 257-day total delay (used Concrete's finish as project baseline) and emitted invalid JSON with // comments and bare arithmetic. FT got 30 days (golden 29) with clean tia_findings + project_summary envelope. The 6 added full-schedule training examples (Riverside / Oakfield / Hillcrest / Brunswick / Granville / Tamar — distinct projects from Munich) taught the T2 I/O contract: full dual-schedule input → array + summary output. CB2 vessel-breakdown LD party also corrected (base "client" → FT "contractor").

FT + PE +15 — best config; PE rescued FT's T2 EOT inversion

FT vanilla T2 marked contractor delays as EOT-entitled (recommended 45 days vs golden 0). FT + Enhanced PE corrected this: Concrete contractor → LD contractor ✓; recommended EOT 5 (golden 0); LD party = contractor ✓. PE prompt's explicit EOT/LD direction table did the work.

Hard discriminators still missed by all 4 configs

D&W Installation marked critical-path (golden non-critical → 0 EOT vs completion). CB2 TP fire: golden ~40-day partial/negotiated EOT — every config went 0 or 77. PE-introduced regressions: base+PE flipped Concrete Skeleton responsibility (vanilla had it right); FT+PE format regressed (markdown fences, // comments, bare 20+24+38 expressions). Small model can't reliably follow "raw JSON only" instructions.

57/100

Base Model Overall

Contracts 59 · Delays 57 · Schedules 56

Not Fine-Tuned

FT Status

Base-only evaluation

59/100

Best Domain — Contracts

CB1 51 · T1 67 · extended thinking

Base Complete

Status

All 6 tests scored

Domain Summary — Rubric v1.0

Domain	T Score (standard)	CB Score (curveball)	Combined	Fine-Tuned	Δ
Contracts	67/100	51/100	59/100	—	—
Delays	35/100	78/100	57/100	—	—
Schedules	67/100	45/100	56/100	—	—

Standard Tests — T1 to T3

Test	Project	Domain	Base (Rubric)	FT	Δ
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	67/100	—	—
T2	AB v AP Residential — Delay Attribution & EOT	Delays	35/100	—	—
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	67/100	—	—

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector / contract form — tests generalisation beyond training data.

Test	Project	Domain	Base	FT	Δ
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts	Contracts	51/100 A=32 B=9 C=10	—	—
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	78/100	—	—
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	45/100	—	—

Base Model Learnings

Learning 1 — Extended Thinking Helps Delay Attribution, Not CPM Arithmetic

45K–56K reasoning tokens generated per domain, yet CPM arithmetic still fails (negative floats, 55 wd overstatement). Extended thinking improves systematic clause-by-clause analysis (CB2 = 78/100) but cannot prevent arithmetic accumulation errors across 18+ interdependent activities. CPM quality is bounded by arithmetic precision, not reasoning depth. Fine-tuning on schedule examples must reinforce the forward/backward pass algorithm explicitly — reasoning chain alone is insufficient.

Learning 2 — DB Clause ID Lookup Is a Distinct Learned Behaviour

The model correctly assesses clause status (Accepted / Modification / Requires Review / Rejected) in most cases but returns null for all DB clause ID matches. These are separate cognitive tasks: status assessment requires legal reasoning; ID lookup requires memorised format and database awareness. A fine-tuned model needs training examples where correct DB IDs appear in the output — the base model has zero exposure to the internal clause database and cannot infer IDs from first principles.

Learning 3 — FIDIC Delay Classification Is Strongest Base Capability

CB2 = 78/100 is the highest score across all 6 tests. The model correctly classified FM vs Contractor-risk delay events (vessel breakdown = FIDIC 4.15, key discriminator passed), identified the concurrent delay window, and computed EOT within the golden range. The only failure was FIDIC 19.4 cost rule (FM = time only, no cost). This suggests delays is the highest-leverage fine-tuning domain: strong base reasoning + one key rule to reinforce = potentially 90+ score.

Learning 4 — CB1 Drops 16 Points on Clean Input + Malformed Output Schema

T1 (in-distribution) = 67/100; CB1 (Finnish YSE 1998) re-run on the clean plain-text input = 51/100 — the earlier 63 came from the answer-leaked docx extraction (its "Notes / Flaws" column states every clause's defect). A 16-point T→CB gap, not the 4-point gap the leaked run implied. C-002 assessed as "Requires Review (Inferred)" not "Rejected" — key discriminator still fails. The clean CB1 output was also structurally malformed: no clause_id field, scrambled justification-to-clause alignment, C-003/C-021 missing, C-008 duplicated — scored positionally (A=32 B=9 C=10). Fine-tuning must reinforce both DB ID matching and strict output schema.

66/100

Base Model Overall

T1–T3 + CB1–CB3 complete · rubric v1.0

70/100

FT Contracts Combined

Rubric v1.0 · T1=77 · CB1=63

Contracts 76

Best Domain

Contracts combined (T1=85 + CB1=67) ÷ 2

Base Complete

Status

All 6 tests scored · FT done

Fine-Tuning Configuration

Parameters

Hybrid Mamba+Attention

r=8 / α=16

LoRA Config

dropout=0.05 · attn-only · Mamba frozen

Epochs

24.9 min · eval loss 7.92→7.02

Contracts

Domain Focus

140 train + 24 val · direct tensor merge

Domain Summary — Rubric v1.0

Domain	T Score (standard)	CB Score (curveball)	Combined	T Score FT	CB Score FT	Combined FT	Δ
Contracts	85/100	67/100	76/100	77/100	63/100	70/100	−6
Delays	35/100	67/100	51/100	—	—	—	—
Schedules	80/100	62/100	71/100	—	—	—	—

Standard Tests — T1 to T3

Test	Project	Domain	Base (Rubric)	FT	Δ
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	85/100	77/100	−8
T2	AB v AP Residential — Delay Attribution & EOT	Delays	35/100	—	—
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	80/100	—	—

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector — tests generalisation beyond training data.

Test	Project	Domain	Base	FT	Δ
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts	Contracts	67/100 A=32 B=17 C=18	63/100 A=33 B=14 C=16	−4
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	67/100 A=26 B=28 C=13	—	—
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	62/100 A=10 B=29 C=15 D=8	—	—

Prompt-Engineering Results — Contracts (T1 + CB1)

Enhanced system prompt (explicit output schema, clause-count enforcement, anti-Accept-bias calibration, worked example) run on Base and Fine-Tune — same prompt for all configs. Scored against golden, Rubric v1.0.

Configuration	T1	CB1	Combined	Δ vs base
Base	85/100	67/100	76/100	—
Base + Prompt Engineering	80/100	70/100	75/100	−1
Fine-Tune	77/100	63/100	70/100	−6
Fine-Tune + Prompt Engineering	86/100	77/100	82/100	+6

Base Model Learnings

Prompt Engineering — Base: Net Flat (76→75)

The enhanced prompt fixed CB1's C-002 key discriminator (base+PE Rejects it correctly, CB1 67→70) but cost points on T1 (80 vs 85) — it over-matched Art 2 and still failed Art 10's Rule 3. PE cannot fix the 4B base's core reasoning gaps; it sharpens what is already there.

Contracts — T1 Strong (85), CB1 Weaker on Standardised Input (67)

T1 key discriminator passed: Art10 Rejected ✓. CB1 re-run on the standardised plain-text input scores 67/100 — the earlier 80 came from the non-standard docx-extraction run. C-002 KEY DIS FAIL — "Requires Review (Inferred)" not Rejected. Heavy Accept-bias on the Finnish contract: six golden-Modification clauses (C-008/011/016/017/018/019) marked Accepted. T1 (English NEC3) holds at 85; CB1 (Finnish YSE 1998) does not generalise as cleanly.

FIDIC Delay Classification Strong in CB2

CB2 recovery after weak T2: both FIDIC key discriminators correct — DEL-OW-001 classified FM (FIDIC 19.1) and DEL-OW-002 classified Contractor risk (FIDIC 4.15). EOT=77cd slightly above 35–75 range. Demonstrates FIDIC Yellow Book understanding absent in T2.

CPM Lag Type Misinterpretation: SS+10 Treated as FS+10

CB3 project duration 320 vs golden 265 wd (55 wd overrun). Root cause: Activity 8 predecessor SS+10 computed as FS+10 — early start inflated from day 175 to day 230. Despite this, Activity 8 still identified as critical (key discriminator passed). Fix requires training on mixed-lag-type CPM examples.

Float Reasoning Systematically Absent

No float consumption analysis across T2, CB2. EOT calculations omit float as a mechanism — treats all employer delays as time-entitled. T2 classified DEL-003 as concurrent; CB2 EOT overestimated. Both test the same gap: understanding that non-critical employer delays yield cost not time.

Fine-Tune Learnings

Prompt Engineering — Fine-Tune: +12 (70→82), Best Nemotron Config

The biggest PE swing in the programme. FT+PE (82) clears Nemotron's own base (76) — a genuine best-of-both-worlds result. The enhanced prompt's clause-count enforcement and anti-Accept-bias calibration fixed the FT's over-Requires-Review collapse: CB1 63→77, T1 77→86. Fine-Tune + PE is the recommended Nemotron contracts configuration.

"Improvement 1" Retracted — C-002 Key Discriminator Fails on Standardised Input

The earlier docx-extraction run had the FT Rejecting C-002 (the key discriminator) — reported as a headline improvement over base. The standardised plain-text re-run does not reproduce it: FT returns "Requires Review (Inferred)", the same failure as base. The "rejection hierarchy generalised" finding was an artifact of the non-standard input, not learned behaviour.

"Improvement 2" Retracted — C-001 Governing Law Now Reversed

The docx run had the FT Accepting C-001 (Finnish governing law) correctly while base flagged it. The standardised re-run reverses this: base correctly Accepts C-001, the FT downgrades it to "Acceptable subject to modification". On the clean input the FT is the weaker model on this clause, not the stronger one.

Regression 1 — Hallucinated 9 Articles in T1

FT output 23 items for a 14-article contract. Model pattern-matched to training example length rather than counting contract clauses — invented Articles 15–23 directly from DB entries that don't exist in Hamburg Tower. Base stopped cleanly at 14. Root cause: model learned "output array ≈ DB size," not "output array = contract clause count." Fix requires explicit count instruction in Pathway 4 prompt.

Regression 2 — Output Contradicts Reasoning + Over-Triggered Modification

T1 Art6: reasoning trace correctly identified "$3M < $5M DB standard — value difference triggers modification" but output label = Accepted. Alignment failure: model learned the reasoning format but didn't wire it to the status label consistently. Separately, Art1 triggered modification for "residential vs commercial" — cosmetic difference, not a value mismatch. Over-trained on difference detection without threshold calibration.

Diagnosis — FT Regresses on Contracts (−6); Both "Improvements" Were Input Artifacts

On standardised inputs the FT scores 70 vs base 76 (T1 77 vs 85, CB1 63 vs 66). The two positive CB1 findings (C-002 Rejected, C-001 Accepted) did not survive the switch from docx-extraction to plain-text input — they were never learned behaviour. What remains is real and on the un-changed T1 run: article hallucination and over-triggered modification. Cross-clause arithmetic — C-017 CAR insurance EUR 22M = 49% of EUR 45M contract price — remains unsolved by both base and FT. Pathway 4 (FT + Prompt Engineering) needs: (1) explicit clause-count instruction, (2) explicit Rule 3 / "Completely rejected" handling, (3) explicit value-comparison step referencing the contract total price.

74/100

Base Model Overall

rubric v1.0 · T+CB avg

92/100

FT Contracts Combined

T1=89 · CB1=94 · +10 vs base

84.5/100

FT Schedules v3 Combined

T3=83 · CB3=86 · +5.5 vs base

75/100

FT+PE Delays (LLM-judge)

+21 vs base

Base Model Evaluation — Rubric v1.0

Domain	T Score (standard)	CB Score (curveball)	Combined	T Score FT	CB Score FT	Combined FT	Δ
Contracts	93/100	70/100	82/100	89/100	94/100	92/100	+10
Delays	49/100	74/100	62/100	70/100	80/100	75/100	+21 (FT+PE)
Schedules	67/100	91/100	79/100	83/100	86/100	84.5/100	+5.5

Standard Tests — T1 to T3

Test	Project	Domain	Base (Rubric)	FT	Δ
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	93/100 A=49 B=18 C=26	89/100 A=45 B=18 C=26	−4
T2	AB v AP Residential — Delay Attribution & EOT	Delays	49/100 A=30 B=8 C=11	—	—
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	67/100 A=25 B=22 C=0 D=20	83/100 v3 FT vanilla	+16

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.

Test	Project	Domain	Base	FT	Δ
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts	Contracts	70/100 A=35 B=17 C=18	94/100 A=47 B=19 C=28	+24
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	74/100 A=32 B=29 C=13	—	—
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	91/100 A=25 B=30 C=20 D=16	86/100 v3 FT vanilla	−5

Prompt-Engineering Results — Contracts (T1 + CB1)

Configuration	T1	CB1	Combined	Δ vs base
Base	93/100	70/100	82/100	—
Base + Prompt Engineering	90/100	80/100	85/100	+3
Fine-Tune	89/100	94/100	92/100	+10
Fine-Tune + Prompt Engineering	88/100	89/100	89/100	+7

Schedules v3 Fine-Tune Cycle Results

v3 = pure planner-pattern training data (77 train + 11 val examples, 100% planner schema). T3 + CB3 evaluations with separate PE prompts per task. All 4 pathways tested.

Pathway	T3 (Cologne generation)	CB3 (Solar EPC CPM)	Combined	Δ vs Base
Base v1 vanilla	67/100	91/100	79/100	—
FT v3 vanilla	83/100	86/100	84.5/100	+5.5
Base + PE v3	77/100	72/100	74.5/100	−4.5
FT + PE v3	82/100	60/100	71/100	−8

Fine-Tune Cycle Results — Delays (T2 + CB2)

FT trained locally on RTX 4090 16GB via Unsloth: 74 examples, 10 epochs, ~40 min, seq 4096. Stream merge (tensor-by-tensor BF16 base + LoRA delta). 7.5B Q8 text GGUF + 1B mmproj F16 deployed. Eval params: temp=0.6, top_p=0.9, min_p=0.06 (thinking). Scoring = LLM-as-judge holistic 0–100.

Configuration	T2	CB2	Combined	Δ vs base
Base	50	58	54	—
Base + Prompt Engineering	70	62	66	+12
Fine-Tune	50	58	54	flat
Fine-Tune + Prompt Engineering	70	80	75	+21

Base Model Learnings

Prompt Engineering — Base: +3 (82→85), CB1 Curveball +10

The enhanced prompt lifted the Gemma base mainly on the CB1 curveball (70→80) — the schema and anti-Accept-bias calibration sharpened its Finnish-contract analysis. T1 dipped slightly (93→90). Net +3 combined: PE helps a capable base, most where the base was weakest.

Contracts: T1 Strong (93), CB1 Mid on Standardised Input (70) — 82 Combined

T1=93/100: all 14 articles correct statuses, no hallucinated articles, Art10=Rejected with correct DB3 citation, strong reasoning traces (10.9K chars) — the strongest base T1 in the programme. CB1=70/100 on the standardised plain-text input (the earlier 78 came from the non-standard docx-extraction run): KEY DIS FAIL — C-002 returned Requires Review (Inferred) not Rejected. C-008/C-015/C-019 over-accepted (golden=Mod); C-017/C-018 under-rated to Requires Review instead of matching DB8. T1 (English NEC3) generalises better than CB1 (Finnish YSE 1998).

Delays: T2 KEY DIS Fail — DEL-003 Placed on Critical Path

T2=49/100: D&W redesign correctly classified as employer-caused, but placed on the critical path and included in EOT recommendation (27 days). Golden = 0 EOT (DEL-003 is employer delay but NOT on CP — cost-only claim). Concrete Works classified as "concurrent" not "Contractor" (losing B sub-criterion). CB2=74/100: Event 2 Contractor correctly identified (KEY DIS PASS). FM cost rule error — claims weather FM gives cost entitlement but FIDIC 19.4 = time only. Concurrent period Adyard reasoning incorrect (denies weather EOT during overlap when contractor should retain it).

Schedules: CB3 Breakthrough Undermined by Backward Pass Error

CB3=91/100: Only model in the programme to correctly apply SS+10 lag (ES8 = ES7+10 = 175, not EF7+10 = 245). Achieves exact 265 wd project duration (golden) and perfect critical path topology — Activity 8 correctly on CP (B section 30/30). However, backward pass error propagates negative TF values (−5) to Activities 8, 14, 15 — invalid CPM. T3=67/100: all 18 activities present, all durations within benchmark ranges, but circular dependency Act6↔Act7 (6 predecessors include "7FF" AND 7 predecessors include "6FS") collapses C section to 0/25. Same circular dep pattern seen in Phi4 and Ministral.

Thinking Model Advantage — Clearest Signal Yet

Gemma 4 E4B thinking traces average 7K–12K chars, significantly more than other models. The reasoning depth directly contributes to: (1) correct Art9 "Requires Review" classification in T1 (rare label — non-thinking models default to Modification); (2) correct DB clause IDs in both T1 and CB1 without null fallback; (3) correct SS+10 forward pass in CB3 (the only model to get this right). The backward pass CPM error and T3 circular dependency suggest that reasoning depth helps classification/retrieval but does not fix systematic graph-construction errors shared across all models.

Fine-Tune Learnings

Contracts FT Cycle

Prompt Engineering — Fine-Tune: −3 (92→89), Already at the Ceiling

Gemma's fine-tune was the strongest config in the programme (92) and never regressed — there was nothing for PE to reclaim. The enhanced prompt slightly hurt it (CB1 94→89, T1 89→88). PE is a floor-raiser, not a ceiling-raiser: on Gemma the recommended contracts config is the fine-tune alone, no PE.

Merge Fixed — Clean Pure-Tensor-Math Merge (BF16 Base, No BNB)

Earlier save_pretrained_merged on the 4-bit base produced correct tensor shapes but corrupted values — the model emitted only <pad>/<unused> garbage at inference. Re-merged with explicit tensor math against the BF16 base (442 LoRA pairs: 294 LM + 112 vision + 36 audio; towers got zero deltas as expected for text-only training). The clean merge produces coherent output and, given <|think|> in the system prompt, opens the <|channel>thought channel itself — so the FT can think; the "no thinking" was the merge corruption, not training.

Think-Mode Re-Run — FT Matches Base on T1, Beats It on CB1 (+6 Combined)

Re-run with the thought channel active (<|think|> in system, /v1/completions, Q8_0 GGUF at 16k context): T1 89/100, CB1 94/100, Combined 92 vs base 86 — a +6 improvement, not the −13 regression the earlier run showed. The 73/100 was a no-thinking eval; base Gemma 4 E4B is a thinking model, so that comparison was never valid. With thinking on the FT produces full 11.8K / 10.3K-char reasoning traces and complete clause arrays (14/14, 21/21), finish_reason=stop.

CB1 +16 — Rejection Hierarchy Transfers to Finnish YSE 1998

CB1 78→94 (+16). The FT passes all four CB1 key discriminators: C-002 Rejected with DB3 Rule 3 invoked explicitly — base failed this, returning Requires Review — plus C-017 / C-015 / C-012 modifications with correct DB IDs. 19/21 statuses exact, 20/21 DB IDs exact. The rejection hierarchy learned on English NEC3 training data generalised cleanly to a novel jurisdiction and contract form — the same positive transfer the Nemotron FT showed on C-002.

T1 −4 — Minor Over-Flagging on the Training Domain

T1 93→89 (−4), entirely in the A section (45/50 vs base 49/50). Two status slips: Art 1 over-flagged Accepted→Modification (treated a "residential vs commercial" descriptive difference as a value mismatch) and Art 9 Termination returned Requires Review instead of Modification (missed the DB6 match, emitted a null DB ID). Reasoning quality held — C=26/30, every key article 6/6 except Art 9.

Diagnosis — The "Regression" Was a No-Think Confound, Not Small-Dataset Damage

With an apples-to-apples (thinking) eval, the Gemma 4 E4B contracts FT does not regress — it improves +6 combined, driven by a +16 swing on the curveball. Two operational lessons: (1) a thinking base model must be evaluated in thinking mode or the comparison is invalid; (2) the ~4k-token contract prompts overflow LM Studio's 4096 default context — the model must be loaded at ≥8k (16k used here) or the thought channel never closes and content comes back empty.

Schedules v3 FT Cycle

FT v3 wins — +18 T3 vs v2 (65→83), no hallucinations, real DB references

T3 FT v3 cites P1 Frankfurt as historical reference (correct DB project, vs v2's hallucinated "Berlin"). duration_wd 309 (closer to golden 370 than v2's 420). Format clean, no broken syntax. v3 planner-only training (T3+CB3 task-pattern dropped) eliminated overfit. CB3 FT v3 maintains CPM strength (PD=265 ✓, CP matches golden, A8 critical ✓).

PE hurts — Over-prescription breaks SS backward pass

FT+PE CB3 dropped to 60/100 (vs FT vanilla 86): explicit "TF ≥ 0" rule overwhelmed model — emitted critical=true with negative TF and dropped critical key on some activities. PE T3 prompts triggered multi-DB blending (good) but kept P1 Frankfurt's circular 6↔7 dep verbatim (bad). For Gemma 4 v3, the verdict: vanilla FT, leave PE off.

v3 dataset design — Pure planner pattern, task-pattern contamination eliminated

Three iterations: v1 (113 mixed Qwen-think format), v2 (113 with 8 planner envelopes added — minority signal drowned), v3 (77 planner-only — T3+CB3 examples dropped to avoid pattern contamination, eval measures true generalisation). Avg asst length 4075→1017 chars (−75%). Training: 100 steps / 10 epochs / 8.7 min / final loss 0.90 (vs v2 0.95).

Delays FT Cycle

FT + PE NAILED CB2 — partial TP, Adyard correct, total EOT within 2 of golden

CB2 80/100 — best CB2 of all 8 configs across both Gemma and Ministral 3B cycles. Weather neutral FM EOT-yes cost-no ✓. Vessel contractor 42d minus 26 concurrent = 16 days LD ✓ (Adyard correct). TP disputed FM → 38d partial/negotiated ✓ (golden ~40) — first config in the entire programme to call partial, not 0 or 77. Recommended EOT 73 vs golden 75 — within 2 days.

FT alone flat — Gemma's 7.5B base already strong; FT added little without PE

FT vanilla scored ~same as base vanilla (54 / 54). Gemma 4 E4B base is the strongest text model in this programme; FT on 74 examples didn't pull it past where it already sat. PE (the floor-raiser) did the lifting: +12 on base, +21 on FT.

Merge path: Unsloth save_pretrained_merged failed → stream merge

Unsloth's merge corrupted weights for the locally 4-bit-trained delays adapter (cloud-trained contracts adapter worked). convert_lora_to_gguf.py choked on Mistral3-style vision-tower tensors. Final path: tensor-by-tensor stream merge — merged = base_fp16 + (α/r) · (B @ A) applied to 442 LoRA targets one at a time. No Linear4bit dequant, no PEFT vision-tower injection.

50/100

Base Model Overall

rubric v1.0 · T+CB avg

Not Fine-Tuned

FT Status

Base-only evaluation

Schedules

Best Domain

65/100 combined (T=78 CB=52)

Base Complete

Status

T1–T3 + CB1–CB3 scored

Domain Summary — Rubric v1.0

Domain	T Score (standard)	CB Score (curveball)	Combined	Fine-Tuned	Δ
Contracts	33/100	20/100	27/100	—	—
Delays	50/100	63/100	57/100	—	—
Schedules	78/100	52/100	65/100	—	—

Standard Tests — T1 to T3

Test	Project	Domain	Base (Rubric)	FT	Δ
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	33/100 A=15 B=9 C=9	—	—
T2	AB v AP Residential — Delay Attribution & EOT	Delays	50/100 A=19 B=18 C=13	—	—
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	78/100 A=25 B=28 C=14 D=11	—	—

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.

Test	Project	Domain	Base	FT	Δ
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts	Contracts	20/100 A=15 B=1 C=4	—	—
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	63/100 A=26 B=27 C=10	—	—
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	52/100 A=17 B=14 C=13 D=8	—	—

Base Model Learnings

Contracts: Severe T1 Failure — Hallucinations + Status Defaulting

T1 output contained 18 articles from a 14-article contract — 4 hallucinated. Nearly all articles defaulted to "Accepted" regardless of content. Art10 (client-side jurisdiction clause, KEY DIS) marked Accepted and matched to DB15 (Permits) — completely wrong. Art8 correctly identified as matching DB3 (rejected dispute clause) but still marked Accepted. CB1 re-run on the clean plain-text input: 20/100 — the earlier 29 was an answer-leaked docx run (the docx "Notes / Flaws" column had handed it C-002's defect). KEY DIS now FAILS — C-002 Accepted, not Rejected. DB ID matching near-total collapse (B=1/20 — model returned "0" for all 21 IDs). Severe Accept-bias (16/21 Accepted).

Delays: First Model to Pass T2 KEY DIS

Unique result: DEL-003 (D&W redesign, employer-caused) correctly classified as non-critical with no EOT entitlement (eot_entitlement=false) — no other scored model achieved this on T2. However, DEL-001 (mobilization) and DEL-004 (ceramics) not identified, and project summary contradicts event-level analysis (overall_eot_entitlement=true despite DEL-003 correctly zero). CB2 very brief (1370 chars, 3s generation time) — Adyard principle named but applied incorrectly, Event 2 EOT=42 instead of 0.

Schedules: Strong Generation, Broken CPM

T3 is the strongest result (78/100): all 18 activities ✓, all durations within benchmark range ✓, no circular predecessor dependencies ✓ (only model besides Nemotron to avoid this). Benchmark justifications reference project parameters (sand soil, 4 floors, 1500m²). CB3 CPM calculation broken: Activity 8 SS+10 misread as FS (ES=235=EF7, not ES7+10=190) → not on critical path (KEY DIS FAIL). Activity 7 LS=175 < ES=180 (invalid CPM). Activity 14 TF=5 but marked critical=true (inconsistency). Duration=295 vs 265 golden (30 wd error, within ≤50 band).

Training Priorities

1. Contracts status defaulting — model must learn to apply the STATUS DECISION HIERARCHY rather than marking everything Accepted. 2. Article count discipline — never output more articles than the contract contains. 3. CPM lag type — SS+lag means add lag to predecessor ES, not EF. 4. Delays completeness — all events in the as-built schedule must be assessed, not just the most obvious ones. 5. Consistent project summary — summary conclusions must align with event-level analysis.

77/100

Base Model Overall

T1–T3 + CB1–CB3 complete · rubric v1.0

78/100

FT Contracts Combined

T1=76 · CB1=79 · −11 vs base

87.5/100

Base+PE Schedules v3

T3=77 · CB3=98 (perfect) · +8.5

Schedules + Contracts FT

Status

Both domains complete

Fine-Tuning Configuration

Parameters

Transformer (Qwen3 architecture)

r=8 / α=16

LoRA Config

dropout=0.0 · 7 modules · attn+MLP

Epochs

BF16 · A100 40GB · lr=1e-5 cosine

Contracts

Domain Focus

contracts_train_qwen35.jsonl · 4096 seq · cloud adapter

Domain Summary — Rubric v1.0

Domain	T Score (standard)	CB Score (curveball)	Combined	T Score FT	CB Score FT	Combined FT	Δ
Contracts	88/100	89/100	89/100	76/100	79/100	78/100	−11
Delays	56/100	53/100	55/100	—	—	—	—
Schedules	75/100	96/100	86/100	86/100	67/100	76.5/100	−9.5

Standard Tests — T1 to T3

Test	Project	Domain	Base (Rubric)	FT	Δ
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	88/100 A=44 B=18 C=26	76/100 A=32 B=18 C=26	−12
T2	AB v AP Residential — Delay Attribution & EOT	Delays	56/100	—	—
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	75/100	86/100 v3 FT vanilla	+11

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector — tests generalisation beyond training data.

Test	Project	Domain	Base	FT	Δ
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts	Contracts	89/100 A=43 B=20 C=26	79/100 A=39 B=18 C=22	−10
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	53/100 A=23 B=18 C=12	—	—
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	96/100 A=25 B=30 C=25 D=16	67/100 v3 FT vanilla	−29

Prompt-Engineering Results — Contracts (T1 + CB1)

Configuration	T1	CB1	Combined	Δ vs base
Base	88/100	89/100	89/100	—
Base + Prompt Engineering	97/100	90/100	94/100	+5
Fine-Tune	76/100	79/100	78/100	−11
Fine-Tune + Prompt Engineering	84/100	90/100	87/100	−2

Schedules v3 Fine-Tune Cycle Results

v3 = pure planner-pattern training data (77 train + 11 val, 100% planner schema). Same dataset as Gemma 4 v3. Qwen 3.5 9B trained at seq=4096, 100 steps / 10 epochs / 28.2 min / final loss 1.39. All 4 pathways tested with separate PE prompts per task.

Pathway	T3 (Cologne generation)	CB3 (Solar EPC CPM)	Combined	Δ vs Base
Base v1 vanilla	67/100	91/100	79/100	—
FT v3 vanilla	86/100	67/100	76.5/100	−2.5
Base + PE v3	77/100	98/100	87.5/100	+8.5 ✓
FT + PE v3	81/100	65/100	73/100	−6

Base Model Learnings

Prompt Engineering — Base: +5 (89→94), Best Qwen Config

The enhanced prompt took the Qwen base to the top score in the contracts programme: T1 88→97 (14/14 DB IDs, 30/30 reasoning), CB1 89→90. Base + PE (94) beats every other Qwen configuration including the fine-tune. For Qwen, the play is to skip fine-tuning entirely and prompt-engineer the base.

Contracts Strong — T1=88 with Art10 Key Discriminator

Art10 correctly Rejected with explicit Rule 3 invocation and DB3 cited. Longest thinking output in the T1 programme (41K chars) — thorough per-clause reasoning. Art3 DB ID = DB0 correct (vs Gemma's DB1 error). Three status errors (Arts 4/9/13) stem from over-strict interpretation: requires explicit "Modify to X" instruction, does not treat "Ensure..." notes as modification triggers.

CB1 — C-002 Key Discriminator Fails Despite Knowing DB3 Is Rejected

Model correctly identifies DB3 is "completely rejected" in reasoning but argues C-002 (negotiation-only, no arbitration) does not match the DB3 arbitration-specific entry. Misses Rule 3: applies to the Dispute Resolution category regardless of specific mechanism. Labels C-002 "Requires Review (Inferred)" not Rejected. 19K-char reasoning trace — thorough but wrong on the pivotal clause.

CB3 — Programme-High CPM Score (96/100)

Exact 265 wd duration, perfect critical path [1→2→3→5→6→7→8→14→15→17→18]. SS+10 correctly applied (ES8=175), SS+15 correctly applied (ES13=110). Backward pass also correct — no negative TF values, fixing the error Gemma made. Multi-predecessor merge logic correct throughout. The standout CPM result of the base programme.

Delays Weakest Domain (T2=56, CB2=53) + T3 Circular Dependencies

T2: DEL-003 (employer-caused redesign) placed on critical path, EOT=47 vs golden 0 — key discriminator fail. CB2: weather FM classified as foreseeable contractor risk, EOT=0 outside the 35–75 cd range. T3: circular predecessor chains in MEP second-fix activities (Act10↔11, Act12↔13, Act17↔18) → C section 0/25 — same failure pattern as Phi4, Ministral, Gemma.

Fine-Tune Learnings

Contracts FT Cycle

Prompt Engineering — Fine-Tune: +9 (78→87), Recovers Most of the Regression

PE reclaimed most of the fine-tune's −11 contracts regression: CB1 79→90, T1 76→84. The enhanced prompt's exact-status-string reminder also killed the temp-0.6 invalid-status defect. But FT+PE (87) still trails base (89) and base+PE (94) — PE rescues the regressed fine-tune without making it the better choice.

Regression 1 — T1 Emits an Invalid Status String at temp=0.6

The FT systematically outputs "Accepted subject to modification" — not the valid "Acceptable subject to modification" — on 7 of 14 T1 articles. Intent is unambiguous (the modification status) but it fails the exact-string requirement the rubric explicitly tests, scored at 50% credit per affected article. A-section 44→32. Art10 key discriminator still PASSES (Rejected + DB3 + Rule 3).

Thinking Preserved · CB1 No Longer Truncates

FT produces full reasoning traces (8.7K chars T1, 15.2K chars CB1) with systematic per-clause rule application. Re-run at the standardised params, CB1 emitted all 21 clauses — the 17/21 truncation seen in an earlier run did not recur. The cloud pipeline (train → tensor-math merge → GGUF) is clean for Qwen's transformer architecture.

Regression 2 — CB1 −10 · C-002 Key Discriminator Still Fails

CB1 17/21 statuses correct. C-002 KEY DIS FAIL — Requires Review not Rejected; the FT, like base, has no Rejected training examples so cannot apply Rule 3. Status drift on 3 clauses: C-001 over-modified (golden Accepted), C-018 under-rated (RR vs Modification), C-019 over-accepted. DB IDs strong (19/21). C-017 CAR-insurance 49%-undervalue still not computed.

Diagnosis — SFT Degraded Format Discipline + Discrimination; Pathway 4 Needed

The ~140-example SFT regressed the model on contracts vs base (78 vs 86). The headline T1 loss is a format defect — an invalid status string at temp=0.6 — not a reasoning failure; the reasoning traces remain sound. CB1 loss is smaller (−4) and shared with base (C-002 Rule 3). Pathway 4 (FT + Prompt Engineering) needs: (1) exact-status-string enforcement in the prompt, (2) explicit Rule 3 / "Completely rejected" handling, (3) explicit value-comparison step for cross-clause arithmetic.

Schedules v3 FT Cycle

Base + PE wins — 98/100 on CB3 (best result of any pathway across all models)

Qwen 3.5 9B base + PE CB3: PD 265 ✓ matches golden exactly, CP 1→2→3→5→6→7→8→14→15→17→18 ✓ matches golden exactly, all TF ≥ 0, critical flags consistent w/ TF=0, A8 critical ✓ (KEY DIS PASS), SS+10 + SS+15 correctly applied. The explicit CPM rules in the PE prompt let the strong base reasoner execute deterministic math without confusion. For Qwen on schedules, skip fine-tuning entirely and prompt-engineer the base.

FT vanilla — Best T3 precision of all models (duration_wd 374, +1.1% vs golden 370)

T3 FT v3: CP includes 9, 10, 11 (better coverage than Gemma); SS+lag predecessor chains used aggressively. duration_justifications cite historical ranges from DB but not specific project names (Gemma cited P1 Frankfurt). Skeleton 250 wd (Gemma 200) — both over actual ~144 but explained.

FT CB3 regressed — A2 SS dependency misinterpreted as FS (+15d cascade)

Qwen FT CB3 PD=284 (golden 265, +7%): A2 ES=15 (should be 0 because 1SS), cascade breaks rest. CP collapsed to [15, 18]. Same bug appears in FT+PE CB3 (PD 315). Bug not present in base vanilla or base+PE → FT specifically corrupted SS handling. Future v4 should add SS micro-cases to training.

Cross-model verdict — Qwen Base+PE 87.5 > Gemma FT vanilla 84.5

Top pathway per model: Gemma 4 = FT vanilla 84.5 / Qwen 3.5 9B = Base+PE 87.5. PE helps Qwen base dramatically (vs Gemma where PE hurt). Different model architectures respond differently to explicit CPM rule injection — Qwen base reasons crisply with structure, Gemma base benefits more from FT pattern absorption.

65/100

Base Overall

Avg of 3 domains (Rubric v1.0)

79/100

FT + PE Delays Combined

T2=78 · CB2=80 · +19 vs base

Contracts

Best Domain

66/100 combined (T=78 CB=54) — ties Delays 66

Base + Delays FT Done

Status

Delays FT cycle complete

Base Model Evaluation — Rubric v1.0

Domain	T Score	CB Score	Combined	FT
Contracts	78/100	54/100	66/100	—
Delays	61/100	70/100	66/100	FT+PE 79
Schedules	62/100	62/100	62/100	—
Overall	67/100	62/100	65/100	—

Standard Tests — T1 to T3

Test	Project	Domain	A	B	C	D	Total
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	43/50	15/20	20/30	—	78/100
T2	AB v AP Residential — Delay Attribution & EOT	Delays	35/35	13/40	13/25	—	61/100
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	25/25	30/30	0/25	7/20	62/100

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.

Test	Project	Domain	A	B	C	D	Total
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use (21 clauses)	Contracts	22/50	16/20	16/30	—	54/100
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	26/35	28/40	16/25	—	70/100
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	17/25	12/30	25/25	8/20	62/100

Base Model Learnings

CB1 (clean plain-text input): C-002 KEY DIS FAIL, 54/100 — T1 key discriminator passes

T1: Art10 correctly Rejected with DB3 cited ✓. CB1 re-run on the clean plain-text input: 54/100 (A=22 B=16 C=16). The earlier 77/100 was an answer-leaked docx run — the docx carries a "Notes / Flaws" column stating every clause's defect. C-002 returned "Requires Review (Inferred)" not Rejected — KEY DIS FAIL. With the answer key removed the model collapses to heavy Accept-bias: ~15/21 clauses marked Accepted, including most golden-Modification and golden-Requires-Review clauses. DB ID accuracy holds up better than status (B=16/20). C-015 DLP and C-017 CAR-insurance undervalue both missed.

T2 — Perfect event identification (35/35 A-section)

All 4 delay events correctly identified with right activity, duration within ±10 wd, and correct responsibility assignment — first model in the programme to score 35/35 in T2 section A. DEL-001 Mobilization (employer, 5 wd), DEL-002 Concrete (contractor, 35 wd), DEL-003 Doors/Windows (employer, 52 wd), DEL-004 Ceramic (contractor, 18 wd) all captured. CB2 also strong (70/100): DEL-OW-002 Contractor correctly identified via FIDIC 4.15, EOT=35 cd within 35–75 range, Adyard principle cited with weather EOT maintained during concurrent period.

CB3 — No negative TF (C=25/25), cleaner than Gemma

Backward pass complete for all 18 activities with all TF ≥ 0. Gemma's CB3 had negative TF (−5) for acts 8/14/15 from backward pass error. Ministral 14B avoids this. Input durations correctly carried through, ES+D=EF consistent throughout. The CPM structural validity section is the strongest individual sub-score in Ministral 14B's CB3 result.

T2 / CB2 — KEY DIS fail + FM cost rule error

T2: DEL-003 (Doors/Windows employer redesign) critical=true → KEY DIS fail; EOT=45 not 0. CB2: DEL-OW-001 weather classified as "neutral" (partially correct) but cost_entitlement=true — wrong per FIDIC 19.4 (FM = time only, no cost). Adyard application partially correct: weather EOT maintained during concurrent period but concurrent period reasoning confused. Consistent with delays weakness across the programme.

T3 + CB3 — Circular deps and SS+10 lag error

T3: 3-way circular dep 15↔16↔17 (Painting needs Elec Second Fix SS, Elec Second Fix needs Plumbing Second Fix FF, Plumbing Second Fix needs Painting FF) → C=0/25. Same second-fix chain failure as Phi4/Gemma/Qwen. All 18 durations in range and justified — generation quality fine, predecessor graph broken. CB3: SS+10 misread: ES8=165 (used Act7 start) instead of 175 (Act7 start + 10). Duration=290 vs 265 golden (25 wd over). Activity 8 TF=50, critical=false → KEY DIS fail. All outputs wrapped in markdown code fence (D=7/20 in T3).

Fine-Tune Cycle Results — Delays (T2 + CB2)

FT trained locally on RTX 4090 16GB via Unsloth (raw transformers+PEFT+BNB crashed "CUDA driver error" mid-forward — Unsloth's native Mistral3 patching worked). 4-bit NF4 QLoRA, seq 4096, 10 epochs, 28.3 min, final train loss 1.2138. Stream merge (tensor-by-tensor, name-swap fix for adapter model.language_model ↔ base language_model.model). Q6_K GGUF (11.1 GB, matches base quant) via F16→llama-quantize two-step. Eval params: temp=0.6, top_p=0.9, min_p=0.06 (thinking). Scoring = LLM-as-judge holistic 0–100.

Configuration	T2	CB2	Combined	Δ vs base
Base	58	62	60	—
Base + Prompt Engineering	72	75	74	+14
Fine-Tune	70	70	70	+10
Fine-Tune + Prompt Engineering	78	80	79	+19

Fine-Tune Learnings

FT + PE NAILED CB2 Adyard — Vessel 42d − 26 concurrent = 16d LD contractor

First config across the entire 14B cycle to correctly compute the Adyard offset. Weather 35 FM EOT-yes ✓. Concurrent period → EOT yes, no cost ✓ (Adyard). Vessel breakdown net contractor LD = 42 − 26 = 16 days ✓. TP fire: granted full 77 days EOT (golden ~40 partial) — only the Gemma FT+PE called the partial correctly.

FT lifts +10, PE lifts +14 on base, FT+PE = best at +19

Same pattern as Gemma: PE = floor-raiser, FT alone respectable, FT+PE = the win. Base+PE (74) already beats FT-alone (70), reinforcing that this task responds more to better prompting than to small-dataset SFT for a strong 14B base. FT+PE (79) extends the win further — the two stack additively here.

Shared misses with Gemma: D&W critical-path, TP partial, Concrete responsibility flip on PE

D&W Installation marked critical (golden non-critical → 0 EOT). TP fire: 0 or 77, never the negotiated ~40 (only Gemma FT+PE got partial). PE-introduced regression: base+PE and FT+PE both flipped Concrete responsibility employer↔contractor (vanilla configs had it right). These exact same discriminators failed across both Gemma and Ministral 14B — strong signal they need explicit handling in training data, not in prompts.

Local 14B is possible — Unsloth path required, eval disabled (fp32 OOM), stream merge mandatory

The 14B Reasoning model fits 16 GB VRAM at 4-bit NF4 + seq 4096 + grad checkpointing (~14 GB used steady). Raw transformers + PEFT + BNB crashes "CUDA driver error: device not ready" in both BNB dequant and SDPA softmax — Unsloth's native Ministral3 patching bypasses this. Per-epoch eval disabled (accelerate fp32 conversion OOMs on logits [B, 4096, 131072]). Stream merge is the reliable path on locally 4-bit-trained adapters.

70/100

Base Model Overall

Avg of 3 domains

92.5/100

FT+PE Schedules Combined

T3=87 · CB3=98 · v3 overall winner

A100 Cloud

FT Status

trained on Lambda 1×A100 40GB · 35.6 min

MXFP4 GGUF

Local Deploy

13 GB · LM Studio

Schedules v3 Fine-Tune Cycle — A100 Cloud Training

v3 = pure planner-pattern dataset (77 train + 11 val, harmony format). Local laptop training blocked twice (Unsloth fused-CE SM89 incompat + transformers MXFP4 training guard) — fell back to Lambda Labs 1× A100 40GB ($1-2 total). Unsloth path worked on A100 (compute 8.0, no SM89 issue). Final loss 9.43. MXFP4 GGUF (13 GB) pushed to HF private repo (AshrafMMahdy/gpt-oss-20b-schedules-ft-v3) then downloaded locally for LM Studio. Inference: temp=0.6, top_p=0.9, min_p=0.06, reasoning_effort=low, max_tokens=30k.

Pathway	T3 (Cologne generation)	CB3 (Solar EPC CPM)	Combined	Δ vs Base+PE
FT v3 vanilla	80/100	98/100	89/100	+7
FT + PE v3	87/100	98/100	92.5/100	+10.5
Base + PE v3	81/100	83/100	82/100	—

Domain Summary — Rubric v1.0

Domain	T Score (standard)	CB Score (curveball)	Combined	T Score FT	CB Score FT	Combined FT	Δ
Contracts	87/100	69/100	78/100	—	—	—	—
Delays	51/100	54/100	53/100	—	—	—	—
Schedules	81/100	74/100	78/100	80/100	98/100	89/100	+11

Standard Tests — T1 to T3

Test	Project	Domain	A	B	C	D	Total
T1	Hamburg Tower — Contract Clause Analysis (14 articles)	Contracts	42/50	19/20	26/30	—	87/100
T2	AB v AP Residential — Delay Attribution & EOT	Delays	30/35	10/40	11/25	—	51/100
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	25/25	24/30	12/25	20/20	81/100

Curveball Tests — CB1 to CB3

Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.

Test	Project	Domain	A	B	C	D	Total
CB1	VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts	Contracts	34/50	17/20	18/30	—	69/100
CB2	Grim Tide Offshore Wind — FIDIC FM & concurrent delay	Delays	25/35	24/40	5/25	—	54/100
CB3	Northbrook Solar 50MW — EPC CPM schedule	Schedules	25/25	12/30	25/25	12/20	74/100

Base Model Learnings

T1 key discriminator passes — CB1 C-002 fails on clean input

T1: Art10 correctly Rejected with explicit Rule 3 invocation; reasoning quality strong (Art10/7/6/3 all 6/6 in C section). CB1 re-run on the clean plain-text input: C-002 returned "Requires Review (Inferred)" not Rejected — KEY DIS FAIL. On clean input GPT OSS no longer passes the CB1 discriminator; CB1 70→69, Contracts combined 79→78.

T3 no circular predecessor dependencies — all 18 activities and durations correct

All 18 activities present with correct names. All 18 durations within benchmark ranges. No circular deps (only Granite, Nemotron, and GPT OSS 20B achieve this). Valid JSON output, no markdown fence penalty. All-FS predecessor chain is weak but avoids the 15↔16↔17 second-fix trap seen in Phi4, Gemma, Qwen, Ministral.

CB3 SS+10 forward pass correct — ES8=175, duration=265 wd exact

Model correctly applied SS+10 lag type: Act7 starts at 165, ES8=165+10=175, EF8=220. Project duration exactly matches golden (265 wd). Same forward pass accuracy as Gemma and Qwen. CB3 C section perfect (25/25): all positive TF, correct ES/EF consistency, backward pass populated.

T2 KEY DIS FAIL — DEL-003 placed on critical path, EOT=1.5 months vs golden 0

DEL-003 (Doors and Windows, employer-caused) marked impact_on_completion=true. Golden: DEL-003 is non-critical → cost-only claim, zero EOT. Model recommends 1.5-month EOT to contractor. Only 88 reasoning tokens in T2 — insufficient analysis depth for CP reasoning.

CB3 backward pass error — Activity 8 TF=40 not 0 → KEY DIS FAIL

Forward pass correct (ES8=175, EF8=220) but backward pass propagates LF8=260 instead of 220. Model used project-end-routed backward path ignoring Act14 actual constraint (LF14=240). Result: TF8=40, critical=false. Critical path reduced to [15,17,18] only (3 of 11 golden activities). Positive TF throughout (no negatives) — cleaner than Gemma's backward pass but critical path identification wrong.

reasoning_effort=low severely limits delay and CPM analysis

CB2: 29 reasoning tokens, 758 chars output — no concurrent delay analysis (Nov 1-26 Adyard window not identified), FIDIC 19.4 FM cost rule wrong (cost=yes, should be time-only). T2: 88 reasoning tokens — DEL-003 critical path analysis skipped. reasoning_effort=low adequate for contract label lookup but insufficient for EOT and CP reasoning tasks requiring iterative float computation.

Fine-Tune Learnings

GPT-OSS 20B FT+PE = OVERALL v3 WINNER (92.5 combined, beats Qwen Base+PE 87.5 and Gemma FT vanilla 84.5)

FT and FT+PE both produce perfect CB3 (98/100): PD=265 exact, CP=[1,2,3,5,6,7,8,14,15,17,18] exact golden match, A8 critical ✓, SS+10/SS+15 correctly applied, A1 critical=true (Gemma got TF=-15 here). T3 FT+PE 87/100 uses multi-DB blending (P1/P2/P4/P5/P8) with scale rationale per activity. The 20B MoE base + planner-pattern FT + PE prompt = best combination of all 3 models tested.

Cloud training path — Lambda Labs A100 worked perfectly on first try after laptop blocked twice

Local RTX 4090 Laptop (compute 8.9, SM89) hit two independent blockers: (1) Unsloth fused CE Triton kernel CUDA driver error on Ada Lovelace, (2) transformers MXFP4 quantizer's hard training guard. A100 (compute 8.0) bypassed both. Pipeline: SCP scripts+data → instance setup → download weights → train 100 steps/10 epochs (35.6 min) → MXFP4 native merge via Unsloth save_pretrained_merged save_method="mxfp4" → convert_hf_to_gguf.py → HF private repo upload → local download → LM Studio deploy. Total cycle ~60 min, ~$2 cost.

FT T3 vanilla weakness — overly serial predecessor chain (681 wd vs golden 370)

FT vanilla used mostly FS predecessors with no SS+lag parallelism, inflating duration. PE prompt fixed this: FT+PE T3 = 474 wd (-30% vs vanilla, +28% vs golden), with multi-DB blending (P1/P2/P4/P5/P8) and FF chains. Base+PE T3 = 302 wd (-18% under golden) — closest to target, but all-18-activities-critical CP overcall.

Artefacts (all preserved)

MXFP4 GGUF (13 GB): HF AshrafMMahdy/gpt-oss-20b-schedules-ft-v3 (private). LM Studio deploy: gpt-oss-20b-schedules-ft. Scripts: train_gpt_oss_schedules.py (Unsloth), convert_gpt_oss_schedules_data.py (harmony converter), eval_gpt_oss_schedules_pathways.py. See Methodology → Links section for reproducibility bundle.

Dataset

Privacy & Confidentiality: Synthetic and augmented, built from real-world reference material. Cannot be shared publicly to avoid indirect exposure of proprietary project data.

Domain	Description	Train	Val
Contracts	Hamburg Tower ground truth + CUAD open-source + cross-contract synthetic + status classification drills	140	24
Delays	TIA, float absorption, Force Majeure classification, concurrent delay, FIDIC/JCT/NEC analysis	74	11
Schedules (v3)	Pure planner pattern — activity selection, sequence logic (FS/SS/FF + lag), duration justification cited to historical reference projects. Deterministic CPM scaffold computes ES/EF/LS/LF/TF.	77	11
Total		291	46

Evaluation Details

Standard Tests — T1 to T3 (one per domain)

All models queried via LM Studio local API (http://localhost:1234/v1/chat/completions). Full scenario inputs (complete contracts, delay schedules, CPM networks) submitted as-is. Responses evaluated by LLM judge against pre-computed golden answers with full reasoning traces. No keyword matching — judgement is against structured golden answer JSON.

Test	Project	Domain	Description
T1	Hamburg Tower — New Contract	Contracts	14-article NEC3-style contract, 4-status clause classification, gap identification, modification recommendations
T2	AB v AP Residential — Delay Schedule	Delays	As-planned vs as-built analysis, EOT entitlement per event, concurrent delay assessment, contractor/employer responsibility
T3	Cologne Residential — 18-Activity Schedule Generation	Schedules	Create baseline schedule from historical DB: name all 18 activities, assign durations (justified from benchmark data), set predecessor relationships using P6-style FS/SS/FF notation. Output raw JSON.

Curveball Tests — CB1 to CB3 (generalisation tier)

CB1 — Contracts: VAN-MIX-2025-011 Finnish mixed-use — 21 clauses under YSE 1998 general conditions, 3 critical omissions, unfamiliar jurisdiction and contract law
CB2 — Delays: Grim Tide Offshore Wind Farm — FIDIC Yellow Book Force Majeure analysis, concurrent delay under English law (Adyard principle), EOT entitlement across 3 events
CB3 — Schedules: Northbrook Solar Energy Park 50MW — non-building EPC schedule, 18 activities, SS+lag critical path (Activity 8 critical despite SS relationship), external DNO constraint risk

Scoring Methodology

LLM-as-judge evaluation against golden answer JSON files. Criteria weighted by domain: Contracts (status classification, modification recommendations, gap identification); Delays (event classification, EOT quantum, concurrency analysis); Schedules (ES/EF/LS/LF/TF correctness, critical path accuracy). Partial credit for near-correct answers. Each scenario scored 0–100%.

Tech Stack

Component	Tool	Notes
Training	Unsloth + PyTorch 2.6.0 + Transformers 5.5.0 + PEFT	Local GPU, no cloud · 2–5× faster than vanilla PEFT
GGUF export	Unsloth built-in GGUF export	Q8_0 quantisation · KV tensor duplication for shared-layer models
Inference	LM Studio (local REST)	100% on-device
Hardware	NVIDIA GPU 16 GB VRAM, CUDA 12.6	Consumer grade
Precision	FP16 (Qwen) / BF16 (Nemotron)	Native model dtypes
OS	Windows	Limits Triton/mamba-ssm

Detailed Evaluation Criteria & Scoring

Below is a test-by-test breakdown of every evaluation task: what it tests, how the score is computed, why it matters, and known limitations that affect result interpretation.

Scoring Key: ✓ Correct = 1.0 pt ~ Partial = 0.5 pt ✗ Wrong / No answer = 0 pt Calibration warnings noted per test

Standard Tests — Core Domain Competency

T1–T3 are one comprehensive test per domain on realistic construction scenarios. Scores compared base vs fine-tuned across all pathway configurations.

T1 — Contract Clause Analysis: Hamburg Tower New Contract (Contracts Domain)

Attribute	Detail
What	Model receives a full 14-article NEC3-style construction contract. For each article, model must: (1) classify status (Accept / Accept with Modification / Flag for Careful Review / Reject / Gap Identified), (2) identify the at-risk party, (3) state the risk, and (4) recommend modifications where needed.
How scored	LLM judge vs golden answer JSON. Per article: status classification (1.0/0.5/0), party identification, risk description accuracy, modification recommendation quality. Weighted average across 14 articles.
Why	Contract review is the primary commercial use-case. Directly tests the model's ability to identify commercially unacceptable clauses before signing.
Calibration	Note: Art 9 (termination for convenience) appeared in training data — treat Art 9 scores with caution. All other articles are generalization.

T2 — Delay Attribution & EOT: AB v AP Residential (Delays Domain)

Attribute	Detail
What	Model receives a 20-activity residential project schedule with baseline and actual dates, plus a narrative of 3 delay events. Model must: (1) attribute each event (Employer/Contractor/Neutral), (2) calculate critical path impact, (3) assess concurrent delay, and (4) state EOT entitlement.
How scored	LLM judge vs golden answer. Event attribution (1.0/0.5/0 each), EOT quantum (exact/±1 day/wrong), concurrency analysis (binary), critical path impact reasoning (qualitative).
Why	Delay attribution and EOT calculation are required for claims. Key skill: recognising that Employer delays to non-critical activities don't entitle EOT.
Calibration	The exact Munich Tower schedule data appeared in delays training data — this test is contaminated for any model fine-tuned on delays domain. Base model results are clean.

T3 — Schedule Generation: Cologne Residential (Schedules Domain)

Attribute	Detail
What	Model receives 3 historical residential projects + benchmark summary and must create a complete baseline schedule for a new project: Cologne, Germany, 2022, Sand soil, EUR 35M, 1500 m², 4 floors. Must name all 18 standard activities, assign durations (justified from benchmarks), and set predecessor relationships using P6-style notation.
How scored	A (25pts) Activity completeness — all 18 standard names present. B (30pts) Duration validity — each within historical benchmark range. C (25pts) Predecessor logic — valid construction sequencing, no circular dependencies, mix of FS/SS/FF. D (20pts) Output format — valid parseable JSON with all required fields. Golden: 370 wd, CP 1→2→3→5→7→9→11→13→12→14.
Why	Schedule creation from benchmarks is the core skill the model is trained on. CPM arithmetic is tested separately in CB3. T3 isolates planning judgment: activity selection, duration calibration, sequencing logic.
Calibration	No contamination — Cologne project is synthetic, not present in training data. Historical DB (Projects 2/3/4) is embedded in the test message, same format as training.

Curveball Tests — Generalization to Unseen Projects

CB tests use completely different projects, jurisdictions, and contract forms from the training data. A model that only memorized training examples will fail here.

CB1 — Contract Generalization: VAN-MIX Finnish Mixed-Use (VAN-MIX-2025-011)

Attribute	Detail
What	21 contract clauses under Finnish YSE 1998 general conditions (EUR 45M mixed-use development, Vantaa). 3 critical omissions (performance bond, Force Majeure clause, IP ownership). Jurisdiction, terminology, and contract law entirely different from training data (Hamburg Tower was English-law NEC3).
How scored	LLM judge vs golden answer. Per clause: status classification (Accept/Modify/Reject), party at risk, risk description. Gap identification scored separately. Partial credit for adjacent status.
Calibration	Answer-leakage: the source VAN-MIX `.docx` carries a "Notes / Flaws" column stating each clause's defect — eval scripts that extracted the docx fed the model the answer key. All CB1 scores were re-run on the stripped plain-text input (`cb1_test_file.txt`); the docx-extraction runs are invalid and have been replaced.

CB2 — Delay Generalization: Grim Tide Offshore Wind Farm (FIDIC Yellow Book)

Attribute	Detail
What	3 delay events on a GBP 180M North Sea offshore wind farm: (1) exceptional marine weather (Force Majeure), (2) jack-up vessel dry-dock breakdown (Contractor risk), (3) transition piece supply chain delay from factory fire (arguable FM). Tests FIDIC Force Majeure Clause 19.1, concurrent delay analysis under English law (Adyard principle).
How scored	LLM judge vs golden answer. Per event: FM vs Contractor vs disputed classification, EOT entitlement, concurrent delay treatment, additional cost entitlement. Recommended EOT: 75 calendar days.

CB3 — Schedule Generalization: Northbrook Solar Energy Park 50MW EPC

Attribute	Detail
What	18-activity solar farm EPC schedule (Lincolnshire, UK, NEC3 Option A, GBP 35M). Non-building project type — tests model ability to reason about solar EPC logic rather than template-matching residential/building sequences. Critical path runs through PV installation workstream; Activity 8 (DC String Cabling, SS+10 relationship) is critical — counterintuitive key scoring point.
How scored	LLM judge vs golden answer. ES/EF/LS/LF/TF for all 18 activities, critical path identification (including Activity 8), project duration = 265 working days (inside 280 wd target). DNO grid connection float risk flagged.

Evaluation Pathways

All models are evaluated across 4 configurations per domain: 2 baseline tests and 2 fine-tuned tests. Thinking models run with thinking enabled throughout.

💭 Models with Thinking Support

Base Tests

① Baseline (Original Prompt)
② Baseline + Prompt Engineering

Fine-Tuned Tests

③ Fine-Tune (Original Prompt)
④ Fine-Tune + Prompt Engineering

🔇 Models without Thinking

Base Tests

① Baseline (Original Prompt)
② Baseline + Prompt Engineering

Fine-Tuned Tests

③ Fine-Tune (Original Prompt)
④ Fine-Tune + Prompt Engineering

Inference Parameters

Fixed parameters applied at evaluation time. Thinking models run at temperature 0.6; non-thinking at 0.15. All other parameters constant across models. Parameters are set in LM Studio before each evaluation session — not passed via API — ensuring the model's inference configuration is validated end-to-end.

Thinking Models

Temperature	0.6
Top P	0.9
Min P	0.06
Thinking	Enabled

Non-Thinking Models

Temperature	0.15
Top P	0.9
Min P	0.06
Thinking	Disabled

Runtime Environment

Inference server	LM Studio
Quantisation	Q8_0 GGUF
GPU	RTX 4090 Laptop 16GB
Streaming	SSE (no read timeout)

Evaluation Rubric — v1.0

Pre-Computed Golden Answers

Every test scenario is paired with a golden-answer JSON file authored before any model is evaluated. The golden answer specifies the correct output for every field, the scoring criteria for that field (full, partial, or zero credit), and the key discriminators — the items designed to separate genuine domain understanding from surface pattern-matching.

LLM-as-Judge Evaluation

Scoring uses no keyword matching and no exact string comparison. A judge LLM reads the complete model output — including the full reasoning trace in the reasoning_content field — and scores it against the golden-answer criteria. A model can therefore earn partial credit for correct reasoning under an incorrect label, and lose credit for a correct label produced through circular or empty reasoning.

Key Discriminators

Each domain carries one or two items designated as key discriminators. These are heavily weighted because passing them is a binary signal: the model either understands the core concept or it does not. Examples include Article 10 in Contracts (the only Rejected article, requiring an explicit Rule 3 invocation), DEL-003 in Delays (employer-caused but non-critical, cost-only claim), and Activity 4 on the Critical Path in Schedules (a counterintuitive Finish-to-Finish chain dependency).

Partial Credit — Label × Reasoning

Two axes are scored independently per item: the correctness of the output label or numeric value, and the correctness of the supporting reasoning. A model that emits the wrong label but with sound reasoning earns partial credit, and a model that emits the correct label through circular or empty reasoning earns a reduced score. This separation prevents inflated scores from lucky guesses.

Contracts Domain — 100 Points

Component	Weight	Scoring Method
A. Status Label Accuracy	50 pts	Per-article weighted scores with partial credit ladder (see below)
B. DB Clause ID Accuracy	20 pts	Correct ID = full; correct category wrong ID = 50%; wrong category = 0%
C. Reasoning Quality	30 pts	5 key articles × 6 pts: correct rule cited + clause element + DB reference

Article	Weight	Why
Art 10 ⭐ Key Discriminator	8 pts	Only Rejected article — requires explicit Rule 3 invocation
Art 7	6 pts	Two-clause split, non-trivial classification
Art 6, 3, 9	4 pts each	Multi-condition or rare-label clauses
Other 9 articles	2 pts each	Standard single-condition clauses

Partial Credit Ladder (per article):

Predicted	vs Accepted	vs Modification	vs Requires Review	vs Rejected
Accepted	100%	0%	25%	0%
Modification	25%	100%	50%	50%
Requires Review	25%	50%	100%	25%
Rejected	0%	50%	25%	100%

Schedules Domain — 100 Points

Component	Weight	Scoring Method
A. Project Duration	25 pts	Band scoring: ≤±20 wd=25 · ≤±50 wd=17 · ≤±100 wd=10 · ≤±150 wd=5 · >±150 wd=0
B. Critical Path	30 pts	Activity 4 on CP=6 ⭐ · Activity 9 NOT on CP=6 ⭐ · other 7 CP activities=2 each · wrong inclusion=−1
C. CPM Structural Validity	25 pts	All activities present (5) · input durations correct (8) · no negative TF (5) · complete backward pass (4) · ES/EF consistency (3)
D. Relationship Type Handling	20 pts	FF chain recognised (8) · SS chain recognised (8) · multi-predecessor merge (4)

Delays Domain — 100 Points

Component	Weight	Scoring Method
A. Event Identification	35 pts	4 events × 9 pts: activity ID (2) + duration ±10 wd (3) + responsibility (4)
B. Critical Path Reasoning + EOT	40 pts	DEL-002 on CP (10) · DEL-003 NOT on CP (10) ⭐ · EOT=0 recommendation (12) · float reasoning (8)
C. Output Quality	25 pts	Valid JSON (5) · delay cascade / concurrency (7) · cost vs EOT distinction (8) · recovery events (5)

Curveball Tests — CB Key Discriminators

The curveball tests apply the same rubric as the standard tests. The scenarios change; the scoring weights do not. The CB-specific key discriminators carry the same binary logic but probe different domain traps, and they are summarised below.

CB1 — Contracts: VAN-MIX Finnish YSE 1998 (21 clauses)

The principal discriminator on CB1 is C-002, which provides only for amicable negotiation — no arbitration, no court jurisdiction, no timeframes. The clause therefore offers no enforceable resolution mechanism, which is worse than DB3's rejected arbitration; Rule 3 applies and DB3 must be matched. Models that flag C-002 as "Requires Review" earn partial credit; models that accept it score zero on the clause. Secondary discriminators include C-017 (CAR insurance at EUR 22M, equal to 49 % of the EUR 45M contract value, where the DB8 standard is full contract value, and the value discrepancy must be identified) and the missing Performance Bond category (DB11 is absent entirely, and a full-marks model flags it in missing_db_categories).

CB2 — Delays: Grim Tide Offshore Wind (FIDIC Yellow Book)

The 10-point key discriminator is DEL-OW-002 (vessel breakdown), an equipment failure that is Contractor risk under FIDIC 4.15, not Force Majeure: zero EOT, zero cost. Models that label the vessel breakdown as Force Majeure fail this discriminator outright. An 8-point secondary discriminator covers the concurrent period of 1–26 November, during which a weather Force Majeure event and the Contractor vessel breakdown overlap for 26 calendar days. Under English law (the Adyard principle), the Contractor receives EOT for the concurrent period because the weather would have caused the same delay, but no additional cost. Models that deny all weather EOT because of the concurrent Contractor fault, or that award cost for the concurrent period, both fail. The golden EOT range is 35–75 calendar days (35 days for weather Force Majeure only; 75 days for weather plus a partial TP Force Majeure award).

CB3 — Schedules: Northbrook Solar 50MW EPC (18 activities)

The 6-point key discriminator is that Activity 8 is critical with TF = 0. DC String Cabling starts 10 days after PV Module Installation (SS+10), giving ES8 = 175 and EF8 = 220 = EF7 = 220, so both activities finish on the same day with zero float. The result is counterintuitive because Activity 8 is a secondary cabling activity rather than a structural one. Models that mark it non-critical miss the SS+10 mechanism entirely. A secondary discriminator concerns Activity 12 (Grid Connection): arithmetically TF = 105 working days, but UK Distribution Network Operator approvals routinely take three to six months, and the external constraint makes that float unreliable. A full-marks model flags the DNO risk explicitly. The golden project duration is 265 working days, inside a 280-working-day target with a 15-day contingency.

Links — Reproducibility Bundle

Everything needed to reproduce the findings independently. Each ZIP contains a README.md documenting its layout, formats, and usage instructions.

Evaluation Artefacts

All 6 test prompts (T1, T2, T3, CB1, CB2, CB3) + their golden answers + the unified Prompt-Engineering system prompts (per-domain).

Contents: contracts/ (T1+CB1+DB+golden), delays/ (T2+CB2+DB+golden), schedules/ (T3+CB3+DB+golden), pe_prompts/ (4 unified PE prompts).

⬇ Download eval_artefacts.zip (67 KB)

Fine-Tuning Data

Train + validation JSONL splits per domain. Schedules is the v3 pure planner-pattern dataset. Format: chat-template (system + user + assistant messages).

Contents: contracts/ (~140 train + val), delays/ (~74 train + val), schedules/ (77 train + 11 val, v3).

⬇ Download finetuning_data.zip (126 KB)

Scripts — Full Pipeline

Training (Unsloth + QLoRA), data conversion (chat-template per model), stream merge + GGUF Q8/MXFP4, and evaluation (4 pathways × T+CB) for the 3 Schedules v3 models + earlier Contracts/Delays cycles.

Contents: training/ (Gemma/Qwen/GPT-OSS), data_conversion/ (4 converters + v3 reshape), merge_gguf/ (stream merge), evaluation/ (per-model pathway eval scripts).

⬇ Download scripts.zip (44 KB)