Testing the limits of SLM fine-tuning on synthetic construction data — contracts, delays & schedules — 100% local inference, no cloud
SLMs act as workers in a digital factory, while LLMs serve as consultants for complex challenges. Agentic AI doesn't require a Swiss Army knife when a single sharp tool will do.— NVIDIA Developer Blog: How Small Language Models Are Key to Scalable Agentic AI
NVIDIA's case is clear: the future of scalable agentic AI is specialised small language models, not ever-larger general-purpose ones. SLMs are 10–30x cheaper to run, can be fine-tuned in hours rather than weeks, run at the edge without cloud dependency, and — when properly trained — produce more reliable structured outputs than frontier models for narrow, repetitive tasks.
The construction industry is a compelling testbed for this thesis. Contract review, delay analysis, and schedule management are high-value, repetitive, and structurally well-defined — exactly the kind of task NVIDIA argues SLMs can own. A specialist construction AI agent running on-device could analyse contracts in real time, flag delay risks before they escalate, and validate schedules without sharing sensitive project data with a cloud provider.
This research programme tests that hypothesis directly: can small language models (3B–20B parameters), fine-tuned on synthetic construction data, reliably perform these tasks? And what are the real limits of the approach?
We evaluated 8 small language models ranging from 3B to 20B parameters on a synthetic construction dataset of 337 examples spanning three domains: contract clause analysis, delay attribution, and schedule analysis. Models span multiple vendors (Mistral, Microsoft, NVIDIA, Google, IBM, Alibaba, OpenAI) and architectures (pure Transformer, hybrid Mamba-2 + Attention, Mixture of Experts). All models were evaluated via the same 6-test standardised evaluation suite (3 Standard + 3 Curveball) with local inference through LM Studio.
Results reveal significant performance variation across the 3B–20B range, with model size showing surprisingly low correlation with performance. The next phase applies a specialist fine-tuning strategy — training each model on a single domain (Contracts, Delays, or Schedules) rather than all three simultaneously — informed by 10 prior trial rounds on Qwen3-4B and Nemotron-3-Nano-4B that demonstrated multi-domain fine-tuning causes catastrophic interference at this dataset scale.
The complete cycle of 8 models × 3 domains × 4 pathways (Base, Base+PE, FT, FT+PE) exposes two architecturally distinct routes to production-grade performance, and the choice between them depends on the knowledge structure of the target domain rather than on model size or training budget.
The first route — fine-tuning — is justified when the target domain admits a closed-form reasoning pattern (TIA framework and Adyard concurrency in Delays; planner-pattern selection and CPM execution in Schedules), when training data of at least sixty pattern-bearing examples is available, and when compute is sufficient for a hundred-plus optimisation steps. Fine-tuned wins in this cycle include Gemma 4 E4B (Contracts +10, Delays +21), Ministral 14B FT+PE (Delays +19), and GPT-OSS 20B FT+PE on Schedules at 92.5/100, the overall best score recorded in the programme.
The second route — a strong base model paired with deterministic scaffolding and an engineered system prompt — is equally viable when resources or training data are constrained, and in several cases is the preferred path even when fine-tuning is feasible. Qwen 3.5 9B Base+PE matched or surpassed its own fine-tune in every domain tested, including a perfect 98/100 on the CB3 Northbrook Solar EPC CPM analysis. The base reasoner, when given explicit rule injection in the system prompt and a deterministic execution layer (CPM solver, clause-database retrieval, validator loop), produces production-grade output without any weight modification, iterates faster, costs less to operate, and adapts dynamically to new clauses, projects, and regulations.
The practical recommendation is therefore domain-conditional rather than model-conditional. Fine-tuning is the preferred path for domains with stable, closed-form knowledge (delay attribution rules, CPM arithmetic, schedule grammar). Base+scaffolding+PE is the preferred path for domains with evolving knowledge (contract clauses, jurisdiction-specific precedent, company-specific commercial terms). The strongest production stack combines both: a fine-tuned planner-analyst model wrapped in retrieval-augmented scaffolding and addressed through a carefully engineered prompt.
| Model | Vendor | Params | Architecture | Context |
|---|---|---|---|---|
| Ministral 3B | Mistral | 3B | Transformer | 131K |
| Phi 4 Mini | Microsoft | 3.8B | Transformer | 131K |
| Nemotron 4B | NVIDIA | 4B | Hybrid Mamba-2 + Attn | 131K |
| Gemma 4 E4B | 4B | Transformer | 131K | |
| Granite 4 Tiny | IBM | 7B | MoE | 131K |
| Qwen 3.5 9B | Alibaba | 9B | Transformer | 131K |
| Ministral 14B | Mistral | 14B | Transformer | 131K |
| GPT OSS 20B | OpenAI | 20B | Transformer | 131K |
We use Unsloth with QLoRA (Quantised LoRA) — a highly optimised fine-tuning framework that reduces VRAM usage and increases training speed 2–5× versus standard PEFT. Base model weights are kept frozen in their original precision; only low-rank adapter matrices (LoRA) are trained. This enables full fine-tuning-quality adaptation with a fraction of the VRAM and time.
6 evaluation scenarios across 3 domains — one Standard (T1) and one Curveball (CB) per domain. The T1 tier tests competency on realistic project scenarios; the CB tier tests generalisation to novel jurisdictions, project types, and contract forms entirely unseen in training. Responses are scored by an LLM judge against pre-computed golden answers with full reasoning. All models evaluated across up to 8 pathways (Base/FT × Original/Enhanced prompt × Think/NoThink).
The research programme is structured around four hypotheses, derived from the NVIDIA SLM thesis, ten prior multi-domain trial rounds on Qwen3-4B and Nemotron-3-Nano-4B, and the specialist-strategy revision that followed.
Hypothesis A. Specialist fine-tuning — training each model on a single domain rather than all three simultaneously — will outperform multi-domain fine-tuning by avoiding the catastrophic interference observed in the prior trial rounds. A single-domain adapter learns one task grammar without competing objectives, and the dataset scale used here is insufficient to support concurrent domain mastery.
Hypothesis B. Models that already score highest in a domain's base evaluation possess the strongest foundation for fine-tuning on that domain, and the marginal lift from training should be largest for these top-ranked base models. The expectation is that fine-tuning pushes the leading bases toward the 95–100 range on their specialist domain.
Hypothesis C. Schedules will remain the hardest domain after fine-tuning. Schedule tasks demand multi-step numerical reasoning (CPM forward and backward passes, predecessor arithmetic, lag-type interpretation), and even the strongest base score is only 85.8. Fine-tuning is expected to improve format compliance and pattern recall but is unlikely, on its own, to repair arithmetic reasoning gaps.
Hypothesis D. Model size is not the primary predictor of performance. The base evaluation already shows 4B models outperforming 14B and 20B models on several domains. Architecture (thinking capability, hybrid state-space + attention, mixture-of-experts), pre-training data quality, and fine-tuning data design are expected to matter more than raw parameter count for construction-domain reasoning.
These 8 models represent the latest thinking-capable SLMs in the 3B–14B "Small Language Model" range, with GPT-OSS-20B as the only "small-to-medium" exception. Selection criteria: open-weights, local inference capability on 16 GB VRAM, instruction-following, and structured JSON output support.
| Model | Size | Thinking | Rationale |
|---|---|---|---|
| Ministral 3B | 3B | No | Smallest Mistral model — tests the absolute floor for construction domain capability |
| Phi 4 Mini | 3.8B | Yes | Microsoft's compact reasoning model — strong structured output at minimal parameter count |
| Nemotron 4B | 4B | No | NVIDIA's hybrid Mamba-2 architecture — tests SSM vs Transformer for construction tasks |
| Gemma 4 E4B | 4B | Yes | Google's latest 4B model with thinking — direct comparison to other 4B models |
| Granite 4 Tiny | 7B | No | IBM's enterprise MoE model — tests sparse expert routing for domain specialisation |
| Qwen 3.5 9B | 9B | Yes | Alibaba's mid-range reasoning model — builds on Qwen3-4B trial run findings |
| Ministral 14B | 14B | Yes | Largest SLM in programme — tests whether 14B yields meaningfully better FT results |
| GPT OSS 20B | 20B | Yes | OpenAI's first open-weight model — baseline from frontier lab at small-to-medium scale |
The following parameters apply generically across all models during fine-tuning. Specific per-model values will be determined based on base evaluation results and domain assignment.
| Parameter | Description / Meaning |
|---|---|
| LoRA Rank (r) | Dimension of the low-rank adapter matrices. Higher rank = more capacity but higher forgetting risk. Typical range: 8–32. |
| LoRA Alpha | Scaling factor for LoRA updates. Usually set to 2x the rank (e.g., r=8, alpha=16). Controls how aggressively the adapter modifies the base weights. |
| Learning Rate | Step size for weight updates. Too high = catastrophic forgetting. Too low = no learning. Typical range: 1e-5 to 5e-5. |
| Max Epochs | Maximum number of full passes through the training data. Combined with early stopping to prevent overfitting. |
| Early Stop Patience | Number of epochs without eval loss improvement before stopping training. Prevents wasted compute and overfitting. |
| Grad Clip | Maximum gradient norm (typically 1.0). Prevents exploding gradients — critical for SSM/Mamba layer stability. |
| Warmup Ratio | Fraction of training steps with linearly increasing LR. Prevents early destabilisation. Typical: 0.10–0.15. |
| LoRA Target Modules | Which model layers receive LoRA adapters. Typically attention (q/k/v/o_proj) and MLP (gate/up/down_proj). SSM layers require special handling. |
| Batch x Accumulation | Effective batch size = batch_size x gradient_accumulation_steps. Constrained by 16 GB VRAM. Typically 1 x 4 = 4. |
| Dataset size | Number of training examples. Current: 337 (291 train / 46 val). Domain split: 164 contracts, 85 delays, 88 schedules (v3). |
| Precision | Training dtype. FP16 or BF16 autocast, matched to native model precision. BF16 preferred for stability. |
| Model | Size | Thinking? | Base Contracts | Base Delays | Base Schedules | FT Contracts | FT Delays | FT Schedules | Top-3 Domains |
|---|---|---|---|---|---|---|---|---|---|
| Ministral 3B | 3B | N/A | 49/100 | 63/100 | 60/100 | — | 62/100 ↑ | — | Delays |
| Phi 4 Mini Reasoning | 3.8B | Yes | 59/100 | 57/100 | 56/100 | — | — | — | — |
| Nemotron 4B | 4B | Yes | 76/100 | 51/100 | 71/100 | 70/100 ↓ | — | — | Contracts |
| Gemma 4 E4B | 4B | Yes | 82/100 | 62/100 | 79/100 | 92/100 ↑ | 75/100 ↑ | 84.5/100 ↑ | Contracts Schedules Delays |
| Granite 4 Tiny | 7B | N/A | 27/100 | 57/100 | 65/100 | — | — | — | — |
| Qwen 3.5 9B | 9B | Yes | 89/100 | 55/100 | 86/100 | 78/100 ↓ | — | 76.5/100 ↓ | Schedules Contracts |
| Ministral 14B | 14B | Yes | 66/100 | 66/100 | 62/100 | — | 79/100 ↑ | — | Delays |
| GPT OSS 20B | 20B | Yes | 78/100 | 53/100 | 78/100 | — | — | 89/100 ↑ | Schedules |
| Rank | Model | Score |
|---|---|---|
| 1st | Qwen 3.5 9B | 89/100 |
| 2nd | Gemma 4 E4B | 82/100 |
| 3rd | Nemotron 4B | 76/100 |
| Rank | Model | Score |
|---|---|---|
| 1st | Ministral 14B | 66/100 |
| 2nd | Ministral 3 3B | 63/100 |
| 3rd | Gemma 4 E4B | 62/100 |
| Rank | Model | Score |
|---|---|---|
| 1st | Qwen 3.5 9B | 86/100 |
| 2nd | Gemma 4 E4B | 79/100 |
| 3rd | GPT OSS 20B | 78/100 |
Strong FIDIC delay causation reasoning (CB2 = 78/100, key discriminator passed). Perfect activity completeness in T3 (25/25) and all durations within benchmark ranges (22/30). The model fails on DB clause ID lookup (all null across both contracts tests), predecessor circular dependencies in T3 (three deadlock chains penalise the C section to 0/25), CPM arithmetic in CB3 (negative total floats, duration overstated), and the FIDIC 19.4 cost rule (Force Majeure is time-only, not cost). Extended thinking (45K–56K tokens) helps classification but does not repair arithmetic or graph construction. Full sub-scores are recorded on the Phi 4 Mini tab.
Contracts score 76/100 combined on clean inputs (T1 = 85, CB1 = 66); the earlier 79 (with "C-002 Rejected, 20/21 correct") came from an answer-leaked docx CB1 run. On clean CB1, C-002 returns "Requires Review" (KEY DIS FAIL), and the model exhibits heavy Accept-bias. Delays performance is catastrophically weak (T2 = 35/100, CB2 = 67/100): DEL-003 misclassified, EOT outside range, no FIDIC citations in T2. Schedules generation is strong, but the SS+10 lag is misread as FS+10, producing a 55-working-day inflation in CB3. Full sub-scores on the Nemotron 4B tab.
Contracts weak on clean inputs (T1 = 63, CB1 = 34); the earlier "CB1 89/100" and "+26 point T→CB recovery" came from an answer-leaked docx run. Clean CB1 shows severe Accept-bias (17 of 21 clauses Accepted) and a C-002 KEY DIS FAIL. Delays performance is moderate: DEL-OW-002 (Contractor, FIDIC 4.15) is correctly classified in both tests, but the FM label is inconsistent and EOT falls outside range. Recurring invalid JSON output (// comments, arithmetic expressions in values) appears across delays tests. Schedules CPM is systematically broken: T3 circular dependencies collapse C to 0; the CB3 SS+10 misread inflates duration to 296 vs golden 265.
IBM's enterprise MoE underperforms its 7B class on contracts (T1 = 33/100: four hallucinated articles, Art10 marked Accepted, KEY DIS FAIL), but shows a unique T2 strength — the first model to correctly classify DEL-003 as employer non-critical with zero EOT entitlement. Schedule generation is solid (T3 = 78/100, no circular dependencies). CB3 CPM is broken: Activity 8 is not on the critical path (SS+10 misread as FS+10, ES = 235 not 175). Very brief CB responses suggest a capacity limit at this MoE scale.
Base overall 74/100. Contracts: T1 = 93/100 (all 14 articles correct, no hallucinations, Art10 Rejected ✓, strong DB IDs); CB1 = 70/100 on the clean plain-text input. The CB3 breakthrough is the first correct SS+10 lag application in the programme (ES8 = 175), achieving exact 265-working-day duration. Thinking traces (10K–12K characters) show thorough per-item reasoning. The principal weakness is T2 Delays (49/100): DEL-003 placed on the critical path, and a backward-pass error produces negative TF values (−5).
Highest overall base score (77/100). CB3 is outstanding at 96/100 — exact 265-working-day duration, perfect critical path, correct SS+10 (ES8 = 175) and SS+15, and no negative TF values (fixes Gemma's backward-pass error). Contracts is the strongest in the programme: T1 = 88/100 with Art10 Rejected and thorough 41K-character reasoning; CB1 = 89/100 on clean input. The CB1 key discriminator nonetheless fails — C-002 is labelled "Requires Review" because the model correctly identifies DB3 as "completely rejected" but argues that a negotiation-only clause does not match DB3's arbitration-specific entry, missing that Rule 3 applies to the whole Dispute Resolution category. Delays remains consistently weak (T2 = 56, CB2 = 53). T3 circular predecessor dependencies (three chains) kill the C section.
T1 contracts key discriminator passes (Art10 Rejected), but CB1 C-002 fails on the clean plain-text input (Requires Review, not Rejected; the earlier "C-002 Rejected" was an answer-leaked docx run; clean CB1 = 54/100). T2 event identification is perfect (35/35 A-section) — the best event recall in the programme. CB2 is strong (70/100): DEL-OW-002 Contractor correct, Adyard principle cited, weather EOT maintained, EOT = 35 within acceptable range. Weaknesses include T2 KEY DIS fail (DEL-003 placed on CP, EOT = 45 not 0), T3 circular dependencies 15↔16↔17 (C = 0/25), and CB3 SS+10 lag misread (ES8 = 165 not 175). Uniquely, CB3 C = 25/25 — no negative TF, cleaner backward pass than Gemma's. reasoning_content is empty across all tests and all outputs are wrapped in a markdown code fence.
OpenAI's first open-weight model evaluated locally via LM Studio with reasoning_effort=low. T1 contracts key discriminator passes (Art10 Rejected, explicit Rule 3, strong T1 reasoning with Art10/7/6/3 all 6/6), but CB1 C-002 fails on the clean plain-text input (Requires Review, not Rejected). T3 has no circular dependencies — one of only three models to achieve this (with Granite and Nemotron). CB3 SS+10 is correctly applied in the forward pass (ES8 = 175), matching Gemma and Qwen, but a backward-pass error gives TF8 = 40 and Activity 8 non-critical (KEY DIS FAIL). Delays is the weakest domain (53/100): T2 DEL-003 placed on CP, EOT = 1.5 months vs golden 0; CB2 is severely truncated (29 reasoning tokens, 758 characters), missing the concurrent delay analysis and inverting the FIDIC 19.4 FM cost rule. reasoning_effort=low is insufficient for delay analysis and CPM backward-pass depth.
The smallest model fine-tuned in the programme (3B, non-thinking). The Delays cycle proceeds base 47 → FT 57 (+10) → FT+PE 62 (+15 vs base). FT+PE is the strongest configuration: the engineered prompt's explicit FIDIC and Adyard rule injection eliminated an invalid-JSON output defect and corrected the DEL-OW-002 contractor responsibility classification. Even a non-thinking 3B can be lifted into the mid-60s on Delays when the base has a workable foundation. Contracts and Schedules were not fine-tuned for this model; Delays was its strongest base domain (63), so the fine-tune targeted that strength.
On clean inputs the Contracts fine-tune scored 70 against base 76 — a −6 regression. The earlier "FT learned the rejection hierarchy (C-002)" claim was an artefact of an answer-leaked docx CB1 run; on clean CB1 both base and fine-tune fail C-002 (Requires Review, not Rejected). The real T1 regressions — article hallucination, over-triggered modification — persist. FT+PE recovered the score to 82, clearing base 76, which is a genuine best-of-both-worlds outcome unique to Nemotron in this cycle. The lesson for hybrid Mamba-2 architectures with limited training data is that fine-tuning alone is risky; FT + engineered prompt is the safer stack.
The most fine-tuned model in the programme. Contracts moved from 82 to 92 (+10) — the only contracts fine-tune that improves on its base — driven by CB1 70→94 (+24); the fine-tune passes the C-002 key discriminator that the base fails and handles the Finnish YSE 1998 curveball cleanly, provided it is evaluated in thinking mode at ≥8k context. Delays moved from 62 to 75 with FT+PE (+13); this configuration returned the first correct partial EOT for the disputed TP fire in CB2 (38 days vs golden ~40). Schedules v3 moved from 79 to 84.5 with FT vanilla (+5.5); the v3 dataset rework (77 planner-pattern examples, T3+CB3 contamination eliminated) raised T3 from v2's 65 to 83 and removed hallucinations such as "Berlin". Prompt engineering actively hurt Gemma's Schedules CB3 by over-prescribing rules that broke the backward pass. The overall pattern is that Gemma absorbs the fine-tune pattern cleanly across all three domains, while PE helps Delays and hurts Schedules.
Contracts regressed by −11 (89→78). At temp = 0.6 the fine-tune emits the invalid string "Accepted subject to modification" on 7 of 14 T1 articles — a format defect rather than a reasoning failure. FT+PE rescues the score to 87 but still trails base 89 and base+PE 94. Schedules v3 follows a similar pattern: FT vanilla 76.5 (−2.5 vs base 79), FT+PE 73 (worse still). Crucially, Base+PE on Schedules reaches 87.5 combined (+8.5), including a perfect CB3 98/100 (exact PD = 265, exact CP, all flags consistent). The verdict for Qwen on both domains is to skip fine-tuning entirely and operate the model in Base+PE mode: strong base reasoning combined with explicit CPM rule injection produces production-grade output without any weight modification overhead.
The 14B Reasoning model on Delays: base 66 → FT 70 → FT+PE 79 (+19 vs base). FT+PE was the only configuration to compute the Adyard offset exactly right (42d vessel breakdown − 26d concurrent window = 16d contractor LD). The combination of 14B reasoning capacity, planner-pattern training data, and explicit FIDIC rule injection in the prompt produced the strongest Delays score in the programme. Training required the Unsloth path (raw PEFT crashed on the 4-bit Ministral3 base); per-epoch evaluation was disabled because the accelerate fp32 conversion OOMs on logits at sequence length 4096; and stream merge was mandatory for the locally-trained 4-bit adapter, since Unsloth's save_pretrained_merged corrupts on this combination.
The 20B MoE base (3.6B active) was fine-tuned on the Schedules v3 dataset on a cloud A100, the local 4090 Laptop being blocked twice (Unsloth fused-CE incompatibility on SM89, and the transformers MXFP4 hard training guard). FT+PE reached 92.5 combined (T3 = 87, CB3 = 98), the overall programme winner. Both FT vanilla and FT+PE produce a perfect CB3 98/100: PD = 265 exact, CP = [1, 2, 3, 5, 6, 7, 8, 14, 15, 17, 18] matching golden exactly, Activity 8 critical, SS+10 and SS+15 correctly applied, Activity 1 critical = true (where Gemma produced TF = −15). T3 FT+PE uses multi-DB blending (P1/P2/P4/P5/P8) with per-activity scale rationale. The complete cloud cycle cost approximately $2 and 60 minutes wall-clock; the MXFP4 GGUF (13 GB) was deployed locally for LM Studio. The pattern is the inverse of Qwen: large MoE bases combined with planner-pattern fine-tuning and an engineered prompt stack additively rather than regressing.
No single pathway wins all three domains. Contracts rewards Base+PE because the task is dominated by knowledge retrieval and fine-tuning bakes in a stale snapshot of a company's clause database. Delays and Schedules reward FT+PE because the task is closed-form reasoning that transfers cleanly into fine-tuned weights. The strongest production stack selects the pathway per domain on the basis of whether the underlying task is dynamic-knowledge or closed-form-reasoning. The cross-model winners are summarised below.
| Domain | Best Model | Best Path | Score | Δ vs Best Base |
|---|---|---|---|---|
| Contracts | Qwen 3.5 9B | Base + PE | 94/100 | +5 |
| Delays | Ministral 3 14B | FT + PE | 79/100 | +19 |
| Schedules | GPT-OSS 20B | FT + PE | 92.5/100 | +13.5 |
Across the 3B–20B range tested, raw parameter count showed weak correlation with final scores. The 4B Gemma E4B fine-tuned contracts (92) beats both the 14B Ministral (best contracts 66 base) and the 20B GPT-OSS (best contracts 78 base). The 9B Qwen Base+PE contracts (94) is the best contracts configuration overall. On Schedules, the 20B GPT-OSS FT+PE (92.5) does win, but Qwen 9B Base+PE (87.5) and Gemma 4B FT (84.5) are within striking distance at a fraction of the inference cost. Architecture (thinking capability, hybrid state-space, mixture-of-experts), pre-training data quality, and fine-tuning data design matter more than parameter count for construction-domain SLM performance.
The three contracts fine-tunes split unevenly: Gemma 4 E4B improved (+10 combined, +24 on the CB1 Finnish YSE 1998 curveball), while Nemotron 3 Nano 4B (−6) and Qwen 3.5 9B (−11) regressed. The "trades generalisation for specialisation" pattern does not hold uniformly. Gemma's fine-tune generalised better, but Qwen's invalid-status-string format defect at temp = 0.6 and Nemotron's article hallucination wiped out fine-tuning gains.
The root cause is that contracts requires reasoning combined with retrieval against a company's evolving clause database. Fine-tuning encodes a snapshot; weights trained today are stale by next quarter as precedents shift, commercial strategy changes, and new clauses close. The architecture-correct path is a capable base (Qwen 9B Base+PE = 94 is the programme high) combined with retrieval-augmented scaffolding (clause-database injection, tool-calling, MCP) and a carefully engineered prompt. Fine-tuning bakes in what the company decided in the past; what is needed is teaching how the model reasons, and for contracts that capability is already present in the base. Fine-tuning should be reserved for the rare model–domain combination where it demonstrably clears the base; in this cycle Gemma 4 E4B contracts is the only such example.
Three models completed the full Delays fine-tune cycle (Ministral 3 3B, Gemma 4 E4B, Ministral 3 14B), each with the same six T2-envelope training additions and the same engineered prompt. Every model improved with fine-tuning, and FT+PE was the best configuration in every case.
| Model | Base | Base + PE | FT | FT + PE | FT+PE Δ vs base |
|---|---|---|---|---|---|
| Ministral 3 3B (non-thinking) | 47 | 56 | 57 | 62 | +15 |
| Gemma 4 E4B (thinking) | 54 | 66 | 54 | 75 | +21 |
| Ministral 3 14B (thinking) | 60 | 74 | 70 | 79 | +19 |
Delays behaves differently from Contracts because it is pure reasoning — TIA framework, float arithmetic, responsibility classification, Adyard concurrency — with no live clause database and no jurisdiction-specific knowledge that goes stale. Training data teaches how to reason, which transfers cleanly to weights. The single strongest configuration, Gemma 4 E4B FT+PE, returned the first correct partial EOT for the disputed TP fire in CB2 (38 days vs golden ~40). Ministral 14B FT+PE was the only configuration to compute the Adyard offset exactly (42d vessel breakdown minus 26d concurrent window = 16d contractor LD).
The Schedules v3 cycle (planner-pattern dataset, 77 train + 11 val, T3 and CB3 task-pattern contamination eliminated) produced the strongest results of the programme. Three configurations achieved a perfect CB3 score of 98/100 on the Northbrook Solar 50MW EPC CPM analysis.
| Configuration | T3 (Cologne generation) | CB3 (Solar EPC CPM) | Combined |
|---|---|---|---|
| Gemma 4 E4B FT vanilla | 83 | 86 | 84.5 |
| Qwen 3.5 9B Base + PE | 77 | 98 | 87.5 |
| GPT-OSS 20B FT vanilla | 80 | 98 | 89 |
| GPT-OSS 20B FT + PE | 87 | 98 | 92.5 |
Three patterns emerged across model sizes. The small dense model (Gemma 4 E4B) prefers FT vanilla and is actively hurt by PE. The medium dense model (Qwen 3.5 9B) prefers Base+PE and regresses on CB3 under fine-tuning due to an A2 SS bug. The large MoE model (GPT-OSS 20B) shows additive stacking: both FT vanilla and FT+PE achieve perfect CB3, with PE adding +7 on T3. The v3 dataset rework — replacing v1's mixed format and v2's 8-envelope minority signal with 77 pure planner-pattern examples — was the key unlock. The planner pattern encodes activity selection, sequence logic, and duration grounding; the deterministic scaffold computes the CPM math. The model learns the correct abstraction. The CB3 Northbrook Solar critical path (1→2→3→5→6→7→8→14→15→17→18) was matched exactly by all three top configurations, demonstrating that small, medium, and large SLMs can all reach production-grade CPM analysis when paired with the right pathway.
| Domain | T Score (standard) | CB Score (curveball) | Combined | Fine-Tuned | Δ |
|---|---|---|---|---|---|
| Contracts | 63/100 | 34/100 | 49/100 | — | — |
| Delays | 55/100 | 71/100 | 63/100 | 62/100 ↑ | +15 (FT+PE) |
| Schedules | 62/100 | 58/100 | 60/100 | — | — |
| Test | Project | Domain | Base (Rubric) | FT | Δ |
|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 63/100 A=33 B=16 C=14 | — | — |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 55/100 A=32 B=13 C=10 | — | — |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 62/100 A=25 B=30 C=0 D=7 | — | — |
Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.
| Test | Project | Domain | Base | FT | Δ |
|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts | Contracts | 34/100 A=17 B=9 C=8 | — | — |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 71/100 A=26 B=28 C=17 | — | — |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 58/100 A=17 B=16 C=19 D=6 | — | — |
CB1 re-run on the standardised plain-text input — the earlier 44/100 came from the answer-leaked docx extraction (the docx carries a "Notes / Flaws" column stating every clause's defect). Clean score: 34/100 (A=17 B=9 C=8). Severe Accepted-bias: 17/21 clauses returned "Accepted" regardless of content. C-002 marked "Requires Review (Inferred)" — KEY DIS FAIL. No key discriminator passes in either T1 or CB1. DB ID accuracy collapses without the leaked answer key (B=9/20). Training must reinforce STATUS DECISION HIERARCHY.
DEL-OW-002 Contractor (FIDIC 4.15) correctly identified in both T2 and CB2 (KEY DIS ✓). FM label inconsistent: T2 correct, CB2 "neutral" instead of Force Majeure for Event_1. EOT calculations outside expected range in both tests. Concurrency analysis attempted (Adyard principle named in CB2) but EOT totals incorrect. Recurring invalid JSON output (// comments, arithmetic expressions, markdown in values) in both T2 and CB2 — significant quality issue.
Activity generation strong: all 18 activities with durations in range in both T3 and CB3. CPM calculation fails consistently. T3: circular predecessor dependencies (4 distinct chains) make schedule logically invalid → C=0. CB3: SS+10 lag misread as FS+10 (adding lag to predecessor's EF not ES), inflating duration to 296 vs 265 golden (31 wd error). Activity 8 fails KEY DIS in CB3 (critical=false, TF=-6.01). Negative total float values (activities 8,11,13,14,16) indicate systematic CPM logic error.
1. JSON output validity — eliminate // comments and arithmetic expressions in values across all domains. 2. CPM lag type interpretation — SS+lag means add lag to predecessor's ES, not EF. 3. Label/reasoning consistency — output status must match reasoning conclusion. 4. EOT range calibration — total EOT should reflect net impact of overlapping events, not sum of individual durations.
FT trained locally on RTX 4090 16GB: 74 examples (68 original + 6 full-schedule T2-envelope), 10 epochs, eval loss 2.51→1.30 monotonic, ~9 min. Q8_0 GGUF deployed. Eval params: temp=0.15, top_p=0.9, min_p=0.06 (non-thinking model). Scoring = LLM-as-judge holistic 0–100 (not Rubric v1.0 — separate cycle, run after the formal rubric grading).
| Configuration | T2 | CB2 | Combined | Δ vs base |
|---|---|---|---|---|
| Base | 48 | 46 | 47 | — |
| Base + Prompt Engineering | 62 | 50 | 56 | +9 |
| Fine-Tune | 58 | 56 | 57 | +10 |
| Fine-Tune + Prompt Engineering | 66 | 57 | 62 | +15 |
Base vanilla computed a 257-day total delay (used Concrete's finish as project baseline) and emitted invalid JSON with // comments and bare arithmetic. FT got 30 days (golden 29) with clean tia_findings + project_summary envelope. The 6 added full-schedule training examples (Riverside / Oakfield / Hillcrest / Brunswick / Granville / Tamar — distinct projects from Munich) taught the T2 I/O contract: full dual-schedule input → array + summary output. CB2 vessel-breakdown LD party also corrected (base "client" → FT "contractor").
FT vanilla T2 marked contractor delays as EOT-entitled (recommended 45 days vs golden 0). FT + Enhanced PE corrected this: Concrete contractor → LD contractor ✓; recommended EOT 5 (golden 0); LD party = contractor ✓. PE prompt's explicit EOT/LD direction table did the work.
D&W Installation marked critical-path (golden non-critical → 0 EOT vs completion). CB2 TP fire: golden ~40-day partial/negotiated EOT — every config went 0 or 77. PE-introduced regressions: base+PE flipped Concrete Skeleton responsibility (vanilla had it right); FT+PE format regressed (markdown fences, // comments, bare 20+24+38 expressions). Small model can't reliably follow "raw JSON only" instructions.
| Domain | T Score (standard) | CB Score (curveball) | Combined | Fine-Tuned | Δ |
|---|---|---|---|---|---|
| Contracts | 67/100 | 51/100 | 59/100 | — | — |
| Delays | 35/100 | 78/100 | 57/100 | — | — |
| Schedules | 67/100 | 45/100 | 56/100 | — | — |
| Test | Project | Domain | Base (Rubric) | FT | Δ |
|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 67/100 | — | — |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 35/100 | — | — |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 67/100 | — | — |
Novel jurisdiction / sector / contract form — tests generalisation beyond training data.
| Test | Project | Domain | Base | FT | Δ |
|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts | Contracts | 51/100 A=32 B=9 C=10 | — | — |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 78/100 | — | — |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 45/100 | — | — |
45K–56K reasoning tokens generated per domain, yet CPM arithmetic still fails (negative floats, 55 wd overstatement). Extended thinking improves systematic clause-by-clause analysis (CB2 = 78/100) but cannot prevent arithmetic accumulation errors across 18+ interdependent activities. CPM quality is bounded by arithmetic precision, not reasoning depth. Fine-tuning on schedule examples must reinforce the forward/backward pass algorithm explicitly — reasoning chain alone is insufficient.
The model correctly assesses clause status (Accepted / Modification / Requires Review / Rejected) in most cases but returns null for all DB clause ID matches. These are separate cognitive tasks: status assessment requires legal reasoning; ID lookup requires memorised format and database awareness. A fine-tuned model needs training examples where correct DB IDs appear in the output — the base model has zero exposure to the internal clause database and cannot infer IDs from first principles.
CB2 = 78/100 is the highest score across all 6 tests. The model correctly classified FM vs Contractor-risk delay events (vessel breakdown = FIDIC 4.15, key discriminator passed), identified the concurrent delay window, and computed EOT within the golden range. The only failure was FIDIC 19.4 cost rule (FM = time only, no cost). This suggests delays is the highest-leverage fine-tuning domain: strong base reasoning + one key rule to reinforce = potentially 90+ score.
T1 (in-distribution) = 67/100; CB1 (Finnish YSE 1998) re-run on the clean plain-text input = 51/100 — the earlier 63 came from the answer-leaked docx extraction (its "Notes / Flaws" column states every clause's defect). A 16-point T→CB gap, not the 4-point gap the leaked run implied. C-002 assessed as "Requires Review (Inferred)" not "Rejected" — key discriminator still fails. The clean CB1 output was also structurally malformed: no clause_id field, scrambled justification-to-clause alignment, C-003/C-021 missing, C-008 duplicated — scored positionally (A=32 B=9 C=10). Fine-tuning must reinforce both DB ID matching and strict output schema.
| Domain | T Score (standard) | CB Score (curveball) | Combined | T Score FT | CB Score FT | Combined FT | Δ |
|---|---|---|---|---|---|---|---|
| Contracts | 85/100 | 67/100 | 76/100 | 77/100 | 63/100 | 70/100 | −6 |
| Delays | 35/100 | 67/100 | 51/100 | — | — | — | — |
| Schedules | 80/100 | 62/100 | 71/100 | — | — | — | — |
| Test | Project | Domain | Base (Rubric) | FT | Δ |
|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 85/100 | 77/100 | −8 |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 35/100 | — | — |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 80/100 | — | — |
Novel jurisdiction / sector — tests generalisation beyond training data.
| Test | Project | Domain | Base | FT | Δ |
|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts | Contracts | 67/100 A=32 B=17 C=18 | 63/100 A=33 B=14 C=16 | −4 |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 67/100 A=26 B=28 C=13 | — | — |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 62/100 A=10 B=29 C=15 D=8 | — | — |
Enhanced system prompt (explicit output schema, clause-count enforcement, anti-Accept-bias calibration, worked example) run on Base and Fine-Tune — same prompt for all configs. Scored against golden, Rubric v1.0.
| Configuration | T1 | CB1 | Combined | Δ vs base |
|---|---|---|---|---|
| Base | 85/100 | 67/100 | 76/100 | — |
| Base + Prompt Engineering | 80/100 | 70/100 | 75/100 | −1 |
| Fine-Tune | 77/100 | 63/100 | 70/100 | −6 |
| Fine-Tune + Prompt Engineering | 86/100 | 77/100 | 82/100 | +6 |
The enhanced prompt fixed CB1's C-002 key discriminator (base+PE Rejects it correctly, CB1 67→70) but cost points on T1 (80 vs 85) — it over-matched Art 2 and still failed Art 10's Rule 3. PE cannot fix the 4B base's core reasoning gaps; it sharpens what is already there.
T1 key discriminator passed: Art10 Rejected ✓. CB1 re-run on the standardised plain-text input scores 67/100 — the earlier 80 came from the non-standard docx-extraction run. C-002 KEY DIS FAIL — "Requires Review (Inferred)" not Rejected. Heavy Accept-bias on the Finnish contract: six golden-Modification clauses (C-008/011/016/017/018/019) marked Accepted. T1 (English NEC3) holds at 85; CB1 (Finnish YSE 1998) does not generalise as cleanly.
CB2 recovery after weak T2: both FIDIC key discriminators correct — DEL-OW-001 classified FM (FIDIC 19.1) and DEL-OW-002 classified Contractor risk (FIDIC 4.15). EOT=77cd slightly above 35–75 range. Demonstrates FIDIC Yellow Book understanding absent in T2.
CB3 project duration 320 vs golden 265 wd (55 wd overrun). Root cause: Activity 8 predecessor SS+10 computed as FS+10 — early start inflated from day 175 to day 230. Despite this, Activity 8 still identified as critical (key discriminator passed). Fix requires training on mixed-lag-type CPM examples.
No float consumption analysis across T2, CB2. EOT calculations omit float as a mechanism — treats all employer delays as time-entitled. T2 classified DEL-003 as concurrent; CB2 EOT overestimated. Both test the same gap: understanding that non-critical employer delays yield cost not time.
The biggest PE swing in the programme. FT+PE (82) clears Nemotron's own base (76) — a genuine best-of-both-worlds result. The enhanced prompt's clause-count enforcement and anti-Accept-bias calibration fixed the FT's over-Requires-Review collapse: CB1 63→77, T1 77→86. Fine-Tune + PE is the recommended Nemotron contracts configuration.
The earlier docx-extraction run had the FT Rejecting C-002 (the key discriminator) — reported as a headline improvement over base. The standardised plain-text re-run does not reproduce it: FT returns "Requires Review (Inferred)", the same failure as base. The "rejection hierarchy generalised" finding was an artifact of the non-standard input, not learned behaviour.
The docx run had the FT Accepting C-001 (Finnish governing law) correctly while base flagged it. The standardised re-run reverses this: base correctly Accepts C-001, the FT downgrades it to "Acceptable subject to modification". On the clean input the FT is the weaker model on this clause, not the stronger one.
FT output 23 items for a 14-article contract. Model pattern-matched to training example length rather than counting contract clauses — invented Articles 15–23 directly from DB entries that don't exist in Hamburg Tower. Base stopped cleanly at 14. Root cause: model learned "output array ≈ DB size," not "output array = contract clause count." Fix requires explicit count instruction in Pathway 4 prompt.
T1 Art6: reasoning trace correctly identified "$3M < $5M DB standard — value difference triggers modification" but output label = Accepted. Alignment failure: model learned the reasoning format but didn't wire it to the status label consistently. Separately, Art1 triggered modification for "residential vs commercial" — cosmetic difference, not a value mismatch. Over-trained on difference detection without threshold calibration.
On standardised inputs the FT scores 70 vs base 76 (T1 77 vs 85, CB1 63 vs 66). The two positive CB1 findings (C-002 Rejected, C-001 Accepted) did not survive the switch from docx-extraction to plain-text input — they were never learned behaviour. What remains is real and on the un-changed T1 run: article hallucination and over-triggered modification. Cross-clause arithmetic — C-017 CAR insurance EUR 22M = 49% of EUR 45M contract price — remains unsolved by both base and FT. Pathway 4 (FT + Prompt Engineering) needs: (1) explicit clause-count instruction, (2) explicit Rule 3 / "Completely rejected" handling, (3) explicit value-comparison step referencing the contract total price.
| Domain | T Score (standard) | CB Score (curveball) | Combined | T Score FT | CB Score FT | Combined FT | Δ |
|---|---|---|---|---|---|---|---|
| Contracts | 93/100 | 70/100 | 82/100 | 89/100 | 94/100 | 92/100 | +10 |
| Delays | 49/100 | 74/100 | 62/100 | 70/100 | 80/100 | 75/100 | +21 (FT+PE) |
| Schedules | 67/100 | 91/100 | 79/100 | 83/100 | 86/100 | 84.5/100 | +5.5 |
| Test | Project | Domain | Base (Rubric) | FT | Δ |
|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 93/100 A=49 B=18 C=26 | 89/100 A=45 B=18 C=26 | −4 |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 49/100 A=30 B=8 C=11 | — | — |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 67/100 A=25 B=22 C=0 D=20 | 83/100 v3 FT vanilla | +16 |
Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.
| Test | Project | Domain | Base | FT | Δ |
|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts | Contracts | 70/100 A=35 B=17 C=18 | 94/100 A=47 B=19 C=28 | +24 |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 74/100 A=32 B=29 C=13 | — | — |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 91/100 A=25 B=30 C=20 D=16 | 86/100 v3 FT vanilla | −5 |
Enhanced system prompt (explicit output schema, clause-count enforcement, anti-Accept-bias calibration, worked example) run on Base and Fine-Tune — same prompt for all configs. Scored against golden, Rubric v1.0.
| Configuration | T1 | CB1 | Combined | Δ vs base |
|---|---|---|---|---|
| Base | 93/100 | 70/100 | 82/100 | — |
| Base + Prompt Engineering | 90/100 | 80/100 | 85/100 | +3 |
| Fine-Tune | 89/100 | 94/100 | 92/100 | +10 |
| Fine-Tune + Prompt Engineering | 88/100 | 89/100 | 89/100 | +7 |
v3 = pure planner-pattern training data (77 train + 11 val examples, 100% planner schema). T3 + CB3 evaluations with separate PE prompts per task. All 4 pathways tested.
| Pathway | T3 (Cologne generation) | CB3 (Solar EPC CPM) | Combined | Δ vs Base |
|---|---|---|---|---|
| Base v1 vanilla | 67/100 | 91/100 | 79/100 | — |
| FT v3 vanilla | 83/100 | 86/100 | 84.5/100 | +5.5 |
| Base + PE v3 | 77/100 | 72/100 | 74.5/100 | −4.5 |
| FT + PE v3 | 82/100 | 60/100 | 71/100 | −8 |
FT trained locally on RTX 4090 16GB via Unsloth: 74 examples, 10 epochs, ~40 min, seq 4096. Stream merge (tensor-by-tensor BF16 base + LoRA delta). 7.5B Q8 text GGUF + 1B mmproj F16 deployed. Eval params: temp=0.6, top_p=0.9, min_p=0.06 (thinking). Scoring = LLM-as-judge holistic 0–100.
| Configuration | T2 | CB2 | Combined | Δ vs base |
|---|---|---|---|---|
| Base | 50 | 58 | 54 | — |
| Base + Prompt Engineering | 70 | 62 | 66 | +12 |
| Fine-Tune | 50 | 58 | 54 | flat |
| Fine-Tune + Prompt Engineering | 70 | 80 | 75 | +21 |
The enhanced prompt lifted the Gemma base mainly on the CB1 curveball (70→80) — the schema and anti-Accept-bias calibration sharpened its Finnish-contract analysis. T1 dipped slightly (93→90). Net +3 combined: PE helps a capable base, most where the base was weakest.
T1=93/100: all 14 articles correct statuses, no hallucinated articles, Art10=Rejected with correct DB3 citation, strong reasoning traces (10.9K chars) — the strongest base T1 in the programme. CB1=70/100 on the standardised plain-text input (the earlier 78 came from the non-standard docx-extraction run): KEY DIS FAIL — C-002 returned Requires Review (Inferred) not Rejected. C-008/C-015/C-019 over-accepted (golden=Mod); C-017/C-018 under-rated to Requires Review instead of matching DB8. T1 (English NEC3) generalises better than CB1 (Finnish YSE 1998).
T2=49/100: D&W redesign correctly classified as employer-caused, but placed on the critical path and included in EOT recommendation (27 days). Golden = 0 EOT (DEL-003 is employer delay but NOT on CP — cost-only claim). Concrete Works classified as "concurrent" not "Contractor" (losing B sub-criterion). CB2=74/100: Event 2 Contractor correctly identified (KEY DIS PASS). FM cost rule error — claims weather FM gives cost entitlement but FIDIC 19.4 = time only. Concurrent period Adyard reasoning incorrect (denies weather EOT during overlap when contractor should retain it).
CB3=91/100: Only model in the programme to correctly apply SS+10 lag (ES8 = ES7+10 = 175, not EF7+10 = 245). Achieves exact 265 wd project duration (golden) and perfect critical path topology — Activity 8 correctly on CP (B section 30/30). However, backward pass error propagates negative TF values (−5) to Activities 8, 14, 15 — invalid CPM. T3=67/100: all 18 activities present, all durations within benchmark ranges, but circular dependency Act6↔Act7 (6 predecessors include "7FF" AND 7 predecessors include "6FS") collapses C section to 0/25. Same circular dep pattern seen in Phi4 and Ministral.
Gemma 4 E4B thinking traces average 7K–12K chars, significantly more than other models. The reasoning depth directly contributes to: (1) correct Art9 "Requires Review" classification in T1 (rare label — non-thinking models default to Modification); (2) correct DB clause IDs in both T1 and CB1 without null fallback; (3) correct SS+10 forward pass in CB3 (the only model to get this right). The backward pass CPM error and T3 circular dependency suggest that reasoning depth helps classification/retrieval but does not fix systematic graph-construction errors shared across all models.
Gemma's fine-tune was the strongest config in the programme (92) and never regressed — there was nothing for PE to reclaim. The enhanced prompt slightly hurt it (CB1 94→89, T1 89→88). PE is a floor-raiser, not a ceiling-raiser: on Gemma the recommended contracts config is the fine-tune alone, no PE.
Earlier save_pretrained_merged on the 4-bit base produced correct tensor shapes but corrupted values — the model emitted only <pad>/<unused> garbage at inference. Re-merged with explicit tensor math against the BF16 base (442 LoRA pairs: 294 LM + 112 vision + 36 audio; towers got zero deltas as expected for text-only training). The clean merge produces coherent output and, given <|think|> in the system prompt, opens the <|channel>thought channel itself — so the FT can think; the "no thinking" was the merge corruption, not training.
Re-run with the thought channel active (<|think|> in system, /v1/completions, Q8_0 GGUF at 16k context): T1 89/100, CB1 94/100, Combined 92 vs base 86 — a +6 improvement, not the −13 regression the earlier run showed. The 73/100 was a no-thinking eval; base Gemma 4 E4B is a thinking model, so that comparison was never valid. With thinking on the FT produces full 11.8K / 10.3K-char reasoning traces and complete clause arrays (14/14, 21/21), finish_reason=stop.
CB1 78→94 (+16). The FT passes all four CB1 key discriminators: C-002 Rejected with DB3 Rule 3 invoked explicitly — base failed this, returning Requires Review — plus C-017 / C-015 / C-012 modifications with correct DB IDs. 19/21 statuses exact, 20/21 DB IDs exact. The rejection hierarchy learned on English NEC3 training data generalised cleanly to a novel jurisdiction and contract form — the same positive transfer the Nemotron FT showed on C-002.
T1 93→89 (−4), entirely in the A section (45/50 vs base 49/50). Two status slips: Art 1 over-flagged Accepted→Modification (treated a "residential vs commercial" descriptive difference as a value mismatch) and Art 9 Termination returned Requires Review instead of Modification (missed the DB6 match, emitted a null DB ID). Reasoning quality held — C=26/30, every key article 6/6 except Art 9.
With an apples-to-apples (thinking) eval, the Gemma 4 E4B contracts FT does not regress — it improves +6 combined, driven by a +16 swing on the curveball. Two operational lessons: (1) a thinking base model must be evaluated in thinking mode or the comparison is invalid; (2) the ~4k-token contract prompts overflow LM Studio's 4096 default context — the model must be loaded at ≥8k (16k used here) or the thought channel never closes and content comes back empty.
T3 FT v3 cites P1 Frankfurt as historical reference (correct DB project, vs v2's hallucinated "Berlin"). duration_wd 309 (closer to golden 370 than v2's 420). Format clean, no broken syntax. v3 planner-only training (T3+CB3 task-pattern dropped) eliminated overfit. CB3 FT v3 maintains CPM strength (PD=265 ✓, CP matches golden, A8 critical ✓).
FT+PE CB3 dropped to 60/100 (vs FT vanilla 86): explicit "TF ≥ 0" rule overwhelmed model — emitted critical=true with negative TF and dropped critical key on some activities. PE T3 prompts triggered multi-DB blending (good) but kept P1 Frankfurt's circular 6↔7 dep verbatim (bad). For Gemma 4 v3, the verdict: vanilla FT, leave PE off.
Three iterations: v1 (113 mixed Qwen-think format), v2 (113 with 8 planner envelopes added — minority signal drowned), v3 (77 planner-only — T3+CB3 examples dropped to avoid pattern contamination, eval measures true generalisation). Avg asst length 4075→1017 chars (−75%). Training: 100 steps / 10 epochs / 8.7 min / final loss 0.90 (vs v2 0.95).
CB2 80/100 — best CB2 of all 8 configs across both Gemma and Ministral 3B cycles. Weather neutral FM EOT-yes cost-no ✓. Vessel contractor 42d minus 26 concurrent = 16 days LD ✓ (Adyard correct). TP disputed FM → 38d partial/negotiated ✓ (golden ~40) — first config in the entire programme to call partial, not 0 or 77. Recommended EOT 73 vs golden 75 — within 2 days.
FT vanilla scored ~same as base vanilla (54 / 54). Gemma 4 E4B base is the strongest text model in this programme; FT on 74 examples didn't pull it past where it already sat. PE (the floor-raiser) did the lifting: +12 on base, +21 on FT.
save_pretrained_merged failed → stream merge
Unsloth's merge corrupted weights for the locally 4-bit-trained delays adapter (cloud-trained contracts adapter worked). convert_lora_to_gguf.py choked on Mistral3-style vision-tower tensors. Final path: tensor-by-tensor stream merge — merged = base_fp16 + (α/r) · (B @ A) applied to 442 LoRA targets one at a time. No Linear4bit dequant, no PEFT vision-tower injection.
| Domain | T Score (standard) | CB Score (curveball) | Combined | Fine-Tuned | Δ |
|---|---|---|---|---|---|
| Contracts | 33/100 | 20/100 | 27/100 | — | — |
| Delays | 50/100 | 63/100 | 57/100 | — | — |
| Schedules | 78/100 | 52/100 | 65/100 | — | — |
| Test | Project | Domain | Base (Rubric) | FT | Δ |
|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 33/100 A=15 B=9 C=9 | — | — |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 50/100 A=19 B=18 C=13 | — | — |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 78/100 A=25 B=28 C=14 D=11 | — | — |
Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.
| Test | Project | Domain | Base | FT | Δ |
|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts | Contracts | 20/100 A=15 B=1 C=4 | — | — |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 63/100 A=26 B=27 C=10 | — | — |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 52/100 A=17 B=14 C=13 D=8 | — | — |
T1 output contained 18 articles from a 14-article contract — 4 hallucinated. Nearly all articles defaulted to "Accepted" regardless of content. Art10 (client-side jurisdiction clause, KEY DIS) marked Accepted and matched to DB15 (Permits) — completely wrong. Art8 correctly identified as matching DB3 (rejected dispute clause) but still marked Accepted. CB1 re-run on the clean plain-text input: 20/100 — the earlier 29 was an answer-leaked docx run (the docx "Notes / Flaws" column had handed it C-002's defect). KEY DIS now FAILS — C-002 Accepted, not Rejected. DB ID matching near-total collapse (B=1/20 — model returned "0" for all 21 IDs). Severe Accept-bias (16/21 Accepted).
Unique result: DEL-003 (D&W redesign, employer-caused) correctly classified as non-critical with no EOT entitlement (eot_entitlement=false) — no other scored model achieved this on T2. However, DEL-001 (mobilization) and DEL-004 (ceramics) not identified, and project summary contradicts event-level analysis (overall_eot_entitlement=true despite DEL-003 correctly zero). CB2 very brief (1370 chars, 3s generation time) — Adyard principle named but applied incorrectly, Event 2 EOT=42 instead of 0.
T3 is the strongest result (78/100): all 18 activities ✓, all durations within benchmark range ✓, no circular predecessor dependencies ✓ (only model besides Nemotron to avoid this). Benchmark justifications reference project parameters (sand soil, 4 floors, 1500m²). CB3 CPM calculation broken: Activity 8 SS+10 misread as FS (ES=235=EF7, not ES7+10=190) → not on critical path (KEY DIS FAIL). Activity 7 LS=175 < ES=180 (invalid CPM). Activity 14 TF=5 but marked critical=true (inconsistency). Duration=295 vs 265 golden (30 wd error, within ≤50 band).
1. Contracts status defaulting — model must learn to apply the STATUS DECISION HIERARCHY rather than marking everything Accepted. 2. Article count discipline — never output more articles than the contract contains. 3. CPM lag type — SS+lag means add lag to predecessor ES, not EF. 4. Delays completeness — all events in the as-built schedule must be assessed, not just the most obvious ones. 5. Consistent project summary — summary conclusions must align with event-level analysis.
| Domain | T Score (standard) | CB Score (curveball) | Combined | T Score FT | CB Score FT | Combined FT | Δ |
|---|---|---|---|---|---|---|---|
| Contracts | 88/100 | 89/100 | 89/100 | 76/100 | 79/100 | 78/100 | −11 |
| Delays | 56/100 | 53/100 | 55/100 | — | — | — | — |
| Schedules | 75/100 | 96/100 | 86/100 | 86/100 | 67/100 | 76.5/100 | −9.5 |
| Test | Project | Domain | Base (Rubric) | FT | Δ |
|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 88/100 A=44 B=18 C=26 | 76/100 A=32 B=18 C=26 | −12 |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 56/100 | — | — |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 75/100 | 86/100 v3 FT vanilla | +11 |
Novel jurisdiction / sector — tests generalisation beyond training data.
| Test | Project | Domain | Base | FT | Δ |
|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts | Contracts | 89/100 A=43 B=20 C=26 | 79/100 A=39 B=18 C=22 | −10 |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 53/100 A=23 B=18 C=12 | — | — |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 96/100 A=25 B=30 C=25 D=16 | 67/100 v3 FT vanilla | −29 |
Enhanced system prompt (explicit output schema, clause-count enforcement, anti-Accept-bias calibration, worked example) run on Base and Fine-Tune — same prompt for all configs. Scored against golden, Rubric v1.0.
| Configuration | T1 | CB1 | Combined | Δ vs base |
|---|---|---|---|---|
| Base | 88/100 | 89/100 | 89/100 | — |
| Base + Prompt Engineering | 97/100 | 90/100 | 94/100 | +5 |
| Fine-Tune | 76/100 | 79/100 | 78/100 | −11 |
| Fine-Tune + Prompt Engineering | 84/100 | 90/100 | 87/100 | −2 |
v3 = pure planner-pattern training data (77 train + 11 val, 100% planner schema). Same dataset as Gemma 4 v3. Qwen 3.5 9B trained at seq=4096, 100 steps / 10 epochs / 28.2 min / final loss 1.39. All 4 pathways tested with separate PE prompts per task.
| Pathway | T3 (Cologne generation) | CB3 (Solar EPC CPM) | Combined | Δ vs Base |
|---|---|---|---|---|
| Base v1 vanilla | 67/100 | 91/100 | 79/100 | — |
| FT v3 vanilla | 86/100 | 67/100 | 76.5/100 | −2.5 |
| Base + PE v3 | 77/100 | 98/100 | 87.5/100 | +8.5 ✓ |
| FT + PE v3 | 81/100 | 65/100 | 73/100 | −6 |
The enhanced prompt took the Qwen base to the top score in the contracts programme: T1 88→97 (14/14 DB IDs, 30/30 reasoning), CB1 89→90. Base + PE (94) beats every other Qwen configuration including the fine-tune. For Qwen, the play is to skip fine-tuning entirely and prompt-engineer the base.
Art10 correctly Rejected with explicit Rule 3 invocation and DB3 cited. Longest thinking output in the T1 programme (41K chars) — thorough per-clause reasoning. Art3 DB ID = DB0 correct (vs Gemma's DB1 error). Three status errors (Arts 4/9/13) stem from over-strict interpretation: requires explicit "Modify to X" instruction, does not treat "Ensure..." notes as modification triggers.
Model correctly identifies DB3 is "completely rejected" in reasoning but argues C-002 (negotiation-only, no arbitration) does not match the DB3 arbitration-specific entry. Misses Rule 3: applies to the Dispute Resolution category regardless of specific mechanism. Labels C-002 "Requires Review (Inferred)" not Rejected. 19K-char reasoning trace — thorough but wrong on the pivotal clause.
Exact 265 wd duration, perfect critical path [1→2→3→5→6→7→8→14→15→17→18]. SS+10 correctly applied (ES8=175), SS+15 correctly applied (ES13=110). Backward pass also correct — no negative TF values, fixing the error Gemma made. Multi-predecessor merge logic correct throughout. The standout CPM result of the base programme.
T2: DEL-003 (employer-caused redesign) placed on critical path, EOT=47 vs golden 0 — key discriminator fail. CB2: weather FM classified as foreseeable contractor risk, EOT=0 outside the 35–75 cd range. T3: circular predecessor chains in MEP second-fix activities (Act10↔11, Act12↔13, Act17↔18) → C section 0/25 — same failure pattern as Phi4, Ministral, Gemma.
PE reclaimed most of the fine-tune's −11 contracts regression: CB1 79→90, T1 76→84. The enhanced prompt's exact-status-string reminder also killed the temp-0.6 invalid-status defect. But FT+PE (87) still trails base (89) and base+PE (94) — PE rescues the regressed fine-tune without making it the better choice.
The FT systematically outputs "Accepted subject to modification" — not the valid "Acceptable subject to modification" — on 7 of 14 T1 articles. Intent is unambiguous (the modification status) but it fails the exact-string requirement the rubric explicitly tests, scored at 50% credit per affected article. A-section 44→32. Art10 key discriminator still PASSES (Rejected + DB3 + Rule 3).
FT produces full reasoning traces (8.7K chars T1, 15.2K chars CB1) with systematic per-clause rule application. Re-run at the standardised params, CB1 emitted all 21 clauses — the 17/21 truncation seen in an earlier run did not recur. The cloud pipeline (train → tensor-math merge → GGUF) is clean for Qwen's transformer architecture.
CB1 17/21 statuses correct. C-002 KEY DIS FAIL — Requires Review not Rejected; the FT, like base, has no Rejected training examples so cannot apply Rule 3. Status drift on 3 clauses: C-001 over-modified (golden Accepted), C-018 under-rated (RR vs Modification), C-019 over-accepted. DB IDs strong (19/21). C-017 CAR-insurance 49%-undervalue still not computed.
The ~140-example SFT regressed the model on contracts vs base (78 vs 86). The headline T1 loss is a format defect — an invalid status string at temp=0.6 — not a reasoning failure; the reasoning traces remain sound. CB1 loss is smaller (−4) and shared with base (C-002 Rule 3). Pathway 4 (FT + Prompt Engineering) needs: (1) exact-status-string enforcement in the prompt, (2) explicit Rule 3 / "Completely rejected" handling, (3) explicit value-comparison step for cross-clause arithmetic.
Qwen 3.5 9B base + PE CB3: PD 265 ✓ matches golden exactly, CP 1→2→3→5→6→7→8→14→15→17→18 ✓ matches golden exactly, all TF ≥ 0, critical flags consistent w/ TF=0, A8 critical ✓ (KEY DIS PASS), SS+10 + SS+15 correctly applied. The explicit CPM rules in the PE prompt let the strong base reasoner execute deterministic math without confusion. For Qwen on schedules, skip fine-tuning entirely and prompt-engineer the base.
T3 FT v3: CP includes 9, 10, 11 (better coverage than Gemma); SS+lag predecessor chains used aggressively. duration_justifications cite historical ranges from DB but not specific project names (Gemma cited P1 Frankfurt). Skeleton 250 wd (Gemma 200) — both over actual ~144 but explained.
Qwen FT CB3 PD=284 (golden 265, +7%): A2 ES=15 (should be 0 because 1SS), cascade breaks rest. CP collapsed to [15, 18]. Same bug appears in FT+PE CB3 (PD 315). Bug not present in base vanilla or base+PE → FT specifically corrupted SS handling. Future v4 should add SS micro-cases to training.
Top pathway per model: Gemma 4 = FT vanilla 84.5 / Qwen 3.5 9B = Base+PE 87.5. PE helps Qwen base dramatically (vs Gemma where PE hurt). Different model architectures respond differently to explicit CPM rule injection — Qwen base reasons crisply with structure, Gemma base benefits more from FT pattern absorption.
| Domain | T Score | CB Score | Combined | FT |
|---|---|---|---|---|
| Contracts | 78/100 | 54/100 | 66/100 | — |
| Delays | 61/100 | 70/100 | 66/100 | FT+PE 79 |
| Schedules | 62/100 | 62/100 | 62/100 | — |
| Overall | 67/100 | 62/100 | 65/100 | — |
| Test | Project | Domain | A | B | C | D | Total |
|---|---|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 43/50 | 15/20 | 20/30 | — | 78/100 |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 35/35 | 13/40 | 13/25 | — | 61/100 |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 25/25 | 30/30 | 0/25 | 7/20 | 62/100 |
Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.
| Test | Project | Domain | A | B | C | D | Total |
|---|---|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use (21 clauses) | Contracts | 22/50 | 16/20 | 16/30 | — | 54/100 |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 26/35 | 28/40 | 16/25 | — | 70/100 |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 17/25 | 12/30 | 25/25 | 8/20 | 62/100 |
T1: Art10 correctly Rejected with DB3 cited ✓. CB1 re-run on the clean plain-text input: 54/100 (A=22 B=16 C=16). The earlier 77/100 was an answer-leaked docx run — the docx carries a "Notes / Flaws" column stating every clause's defect. C-002 returned "Requires Review (Inferred)" not Rejected — KEY DIS FAIL. With the answer key removed the model collapses to heavy Accept-bias: ~15/21 clauses marked Accepted, including most golden-Modification and golden-Requires-Review clauses. DB ID accuracy holds up better than status (B=16/20). C-015 DLP and C-017 CAR-insurance undervalue both missed.
All 4 delay events correctly identified with right activity, duration within ±10 wd, and correct responsibility assignment — first model in the programme to score 35/35 in T2 section A. DEL-001 Mobilization (employer, 5 wd), DEL-002 Concrete (contractor, 35 wd), DEL-003 Doors/Windows (employer, 52 wd), DEL-004 Ceramic (contractor, 18 wd) all captured. CB2 also strong (70/100): DEL-OW-002 Contractor correctly identified via FIDIC 4.15, EOT=35 cd within 35–75 range, Adyard principle cited with weather EOT maintained during concurrent period.
Backward pass complete for all 18 activities with all TF ≥ 0. Gemma's CB3 had negative TF (−5) for acts 8/14/15 from backward pass error. Ministral 14B avoids this. Input durations correctly carried through, ES+D=EF consistent throughout. The CPM structural validity section is the strongest individual sub-score in Ministral 14B's CB3 result.
T2: DEL-003 (Doors/Windows employer redesign) critical=true → KEY DIS fail; EOT=45 not 0. CB2: DEL-OW-001 weather classified as "neutral" (partially correct) but cost_entitlement=true — wrong per FIDIC 19.4 (FM = time only, no cost). Adyard application partially correct: weather EOT maintained during concurrent period but concurrent period reasoning confused. Consistent with delays weakness across the programme.
T3: 3-way circular dep 15↔16↔17 (Painting needs Elec Second Fix SS, Elec Second Fix needs Plumbing Second Fix FF, Plumbing Second Fix needs Painting FF) → C=0/25. Same second-fix chain failure as Phi4/Gemma/Qwen. All 18 durations in range and justified — generation quality fine, predecessor graph broken. CB3: SS+10 misread: ES8=165 (used Act7 start) instead of 175 (Act7 start + 10). Duration=290 vs 265 golden (25 wd over). Activity 8 TF=50, critical=false → KEY DIS fail. All outputs wrapped in markdown code fence (D=7/20 in T3).
FT trained locally on RTX 4090 16GB via Unsloth (raw transformers+PEFT+BNB crashed "CUDA driver error" mid-forward — Unsloth's native Mistral3 patching worked). 4-bit NF4 QLoRA, seq 4096, 10 epochs, 28.3 min, final train loss 1.2138. Stream merge (tensor-by-tensor, name-swap fix for adapter model.language_model ↔ base language_model.model). Q6_K GGUF (11.1 GB, matches base quant) via F16→llama-quantize two-step. Eval params: temp=0.6, top_p=0.9, min_p=0.06 (thinking). Scoring = LLM-as-judge holistic 0–100.
| Configuration | T2 | CB2 | Combined | Δ vs base |
|---|---|---|---|---|
| Base | 58 | 62 | 60 | — |
| Base + Prompt Engineering | 72 | 75 | 74 | +14 |
| Fine-Tune | 70 | 70 | 70 | +10 |
| Fine-Tune + Prompt Engineering | 78 | 80 | 79 | +19 |
First config across the entire 14B cycle to correctly compute the Adyard offset. Weather 35 FM EOT-yes ✓. Concurrent period → EOT yes, no cost ✓ (Adyard). Vessel breakdown net contractor LD = 42 − 26 = 16 days ✓. TP fire: granted full 77 days EOT (golden ~40 partial) — only the Gemma FT+PE called the partial correctly.
Same pattern as Gemma: PE = floor-raiser, FT alone respectable, FT+PE = the win. Base+PE (74) already beats FT-alone (70), reinforcing that this task responds more to better prompting than to small-dataset SFT for a strong 14B base. FT+PE (79) extends the win further — the two stack additively here.
D&W Installation marked critical (golden non-critical → 0 EOT). TP fire: 0 or 77, never the negotiated ~40 (only Gemma FT+PE got partial). PE-introduced regression: base+PE and FT+PE both flipped Concrete responsibility employer↔contractor (vanilla configs had it right). These exact same discriminators failed across both Gemma and Ministral 14B — strong signal they need explicit handling in training data, not in prompts.
The 14B Reasoning model fits 16 GB VRAM at 4-bit NF4 + seq 4096 + grad checkpointing (~14 GB used steady). Raw transformers + PEFT + BNB crashes "CUDA driver error: device not ready" in both BNB dequant and SDPA softmax — Unsloth's native Ministral3 patching bypasses this. Per-epoch eval disabled (accelerate fp32 conversion OOMs on logits [B, 4096, 131072]). Stream merge is the reliable path on locally 4-bit-trained adapters.
v3 = pure planner-pattern dataset (77 train + 11 val, harmony format). Local laptop training blocked twice (Unsloth fused-CE SM89 incompat + transformers MXFP4 training guard) — fell back to Lambda Labs 1× A100 40GB ($1-2 total). Unsloth path worked on A100 (compute 8.0, no SM89 issue). Final loss 9.43. MXFP4 GGUF (13 GB) pushed to HF private repo (AshrafMMahdy/gpt-oss-20b-schedules-ft-v3) then downloaded locally for LM Studio. Inference: temp=0.6, top_p=0.9, min_p=0.06, reasoning_effort=low, max_tokens=30k.
| Pathway | T3 (Cologne generation) | CB3 (Solar EPC CPM) | Combined | Δ vs Base+PE |
|---|---|---|---|---|
| FT v3 vanilla | 80/100 | 98/100 | 89/100 | +7 |
| FT + PE v3 | 87/100 | 98/100 | 92.5/100 | +10.5 |
| Base + PE v3 | 81/100 | 83/100 | 82/100 | — |
| Domain | T Score (standard) | CB Score (curveball) | Combined | T Score FT | CB Score FT | Combined FT | Δ |
|---|---|---|---|---|---|---|---|
| Contracts | 87/100 | 69/100 | 78/100 | — | — | — | — |
| Delays | 51/100 | 54/100 | 53/100 | — | — | — | — |
| Schedules | 81/100 | 74/100 | 78/100 | 80/100 | 98/100 | 89/100 | +11 |
| Test | Project | Domain | A | B | C | D | Total |
|---|---|---|---|---|---|---|---|
| T1 | Hamburg Tower — Contract Clause Analysis (14 articles) | Contracts | 42/50 | 19/20 | 26/30 | — | 87/100 |
| T2 | AB v AP Residential — Delay Attribution & EOT | Delays | 30/35 | 10/40 | 11/25 | — | 51/100 |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | 25/25 | 24/30 | 12/25 | 20/20 | 81/100 |
Novel jurisdiction / sector / contract form — tests framework generalisation beyond training data.
| Test | Project | Domain | A | B | C | D | Total |
|---|---|---|---|---|---|---|---|
| CB1 | VAN-MIX-2025-011 — Finnish YSE 1998 mixed-use contracts | Contracts | 34/50 | 17/20 | 18/30 | — | 69/100 |
| CB2 | Grim Tide Offshore Wind — FIDIC FM & concurrent delay | Delays | 25/35 | 24/40 | 5/25 | — | 54/100 |
| CB3 | Northbrook Solar 50MW — EPC CPM schedule | Schedules | 25/25 | 12/30 | 25/25 | 12/20 | 74/100 |
T1: Art10 correctly Rejected with explicit Rule 3 invocation; reasoning quality strong (Art10/7/6/3 all 6/6 in C section). CB1 re-run on the clean plain-text input: C-002 returned "Requires Review (Inferred)" not Rejected — KEY DIS FAIL. On clean input GPT OSS no longer passes the CB1 discriminator; CB1 70→69, Contracts combined 79→78.
All 18 activities present with correct names. All 18 durations within benchmark ranges. No circular deps (only Granite, Nemotron, and GPT OSS 20B achieve this). Valid JSON output, no markdown fence penalty. All-FS predecessor chain is weak but avoids the 15↔16↔17 second-fix trap seen in Phi4, Gemma, Qwen, Ministral.
Model correctly applied SS+10 lag type: Act7 starts at 165, ES8=165+10=175, EF8=220. Project duration exactly matches golden (265 wd). Same forward pass accuracy as Gemma and Qwen. CB3 C section perfect (25/25): all positive TF, correct ES/EF consistency, backward pass populated.
DEL-003 (Doors and Windows, employer-caused) marked impact_on_completion=true. Golden: DEL-003 is non-critical → cost-only claim, zero EOT. Model recommends 1.5-month EOT to contractor. Only 88 reasoning tokens in T2 — insufficient analysis depth for CP reasoning.
Forward pass correct (ES8=175, EF8=220) but backward pass propagates LF8=260 instead of 220. Model used project-end-routed backward path ignoring Act14 actual constraint (LF14=240). Result: TF8=40, critical=false. Critical path reduced to [15,17,18] only (3 of 11 golden activities). Positive TF throughout (no negatives) — cleaner than Gemma's backward pass but critical path identification wrong.
CB2: 29 reasoning tokens, 758 chars output — no concurrent delay analysis (Nov 1-26 Adyard window not identified), FIDIC 19.4 FM cost rule wrong (cost=yes, should be time-only). T2: 88 reasoning tokens — DEL-003 critical path analysis skipped. reasoning_effort=low adequate for contract label lookup but insufficient for EOT and CP reasoning tasks requiring iterative float computation.
FT and FT+PE both produce perfect CB3 (98/100): PD=265 exact, CP=[1,2,3,5,6,7,8,14,15,17,18] exact golden match, A8 critical ✓, SS+10/SS+15 correctly applied, A1 critical=true (Gemma got TF=-15 here). T3 FT+PE 87/100 uses multi-DB blending (P1/P2/P4/P5/P8) with scale rationale per activity. The 20B MoE base + planner-pattern FT + PE prompt = best combination of all 3 models tested.
Local RTX 4090 Laptop (compute 8.9, SM89) hit two independent blockers: (1) Unsloth fused CE Triton kernel CUDA driver error on Ada Lovelace, (2) transformers MXFP4 quantizer's hard training guard. A100 (compute 8.0) bypassed both. Pipeline: SCP scripts+data → instance setup → download weights → train 100 steps/10 epochs (35.6 min) → MXFP4 native merge via Unsloth save_pretrained_merged save_method="mxfp4" → convert_hf_to_gguf.py → HF private repo upload → local download → LM Studio deploy. Total cycle ~60 min, ~$2 cost.
FT vanilla used mostly FS predecessors with no SS+lag parallelism, inflating duration. PE prompt fixed this: FT+PE T3 = 474 wd (-30% vs vanilla, +28% vs golden), with multi-DB blending (P1/P2/P4/P5/P8) and FF chains. Base+PE T3 = 302 wd (-18% under golden) — closest to target, but all-18-activities-critical CP overcall.
MXFP4 GGUF (13 GB): HF AshrafMMahdy/gpt-oss-20b-schedules-ft-v3 (private). LM Studio deploy: gpt-oss-20b-schedules-ft. Scripts: train_gpt_oss_schedules.py (Unsloth), convert_gpt_oss_schedules_data.py (harmony converter), eval_gpt_oss_schedules_pathways.py. See Methodology → Links section for reproducibility bundle.
| Domain | Description | Train | Val |
|---|---|---|---|
| Contracts | Hamburg Tower ground truth + CUAD open-source + cross-contract synthetic + status classification drills | 140 | 24 |
| Delays | TIA, float absorption, Force Majeure classification, concurrent delay, FIDIC/JCT/NEC analysis | 74 | 11 |
| Schedules (v3) | Pure planner pattern — activity selection, sequence logic (FS/SS/FF + lag), duration justification cited to historical reference projects. Deterministic CPM scaffold computes ES/EF/LS/LF/TF. | 77 | 11 |
| Total | 291 | 46 | |
All models queried via LM Studio local API (http://localhost:1234/v1/chat/completions). Full scenario inputs (complete contracts, delay schedules, CPM networks) submitted as-is. Responses evaluated by LLM judge against pre-computed golden answers with full reasoning traces. No keyword matching — judgement is against structured golden answer JSON.
| Test | Project | Domain | Description |
|---|---|---|---|
| T1 | Hamburg Tower — New Contract | Contracts | 14-article NEC3-style contract, 4-status clause classification, gap identification, modification recommendations |
| T2 | AB v AP Residential — Delay Schedule | Delays | As-planned vs as-built analysis, EOT entitlement per event, concurrent delay assessment, contractor/employer responsibility |
| T3 | Cologne Residential — 18-Activity Schedule Generation | Schedules | Create baseline schedule from historical DB: name all 18 activities, assign durations (justified from benchmark data), set predecessor relationships using P6-style FS/SS/FF notation. Output raw JSON. |
LLM-as-judge evaluation against golden answer JSON files. Criteria weighted by domain: Contracts (status classification, modification recommendations, gap identification); Delays (event classification, EOT quantum, concurrency analysis); Schedules (ES/EF/LS/LF/TF correctness, critical path accuracy). Partial credit for near-correct answers. Each scenario scored 0–100%.
| Component | Tool | Notes |
|---|---|---|
| Training | Unsloth + PyTorch 2.6.0 + Transformers 5.5.0 + PEFT | Local GPU, no cloud · 2–5× faster than vanilla PEFT |
| GGUF export | Unsloth built-in GGUF export | Q8_0 quantisation · KV tensor duplication for shared-layer models |
| Inference | LM Studio (local REST) | 100% on-device |
| Hardware | NVIDIA GPU 16 GB VRAM, CUDA 12.6 | Consumer grade |
| Precision | FP16 (Qwen) / BF16 (Nemotron) | Native model dtypes |
| OS | Windows | Limits Triton/mamba-ssm |
Below is a test-by-test breakdown of every evaluation task: what it tests, how the score is computed, why it matters, and known limitations that affect result interpretation.
T1–T3 are one comprehensive test per domain on realistic construction scenarios. Scores compared base vs fine-tuned across all pathway configurations.
| Attribute | Detail |
|---|---|
| What | Model receives a full 14-article NEC3-style construction contract. For each article, model must: (1) classify status (Accept / Accept with Modification / Flag for Careful Review / Reject / Gap Identified), (2) identify the at-risk party, (3) state the risk, and (4) recommend modifications where needed. |
| How scored | LLM judge vs golden answer JSON. Per article: status classification (1.0/0.5/0), party identification, risk description accuracy, modification recommendation quality. Weighted average across 14 articles. |
| Why | Contract review is the primary commercial use-case. Directly tests the model's ability to identify commercially unacceptable clauses before signing. |
| Calibration | Note: Art 9 (termination for convenience) appeared in training data — treat Art 9 scores with caution. All other articles are generalization. |
| Attribute | Detail |
|---|---|
| What | Model receives a 20-activity residential project schedule with baseline and actual dates, plus a narrative of 3 delay events. Model must: (1) attribute each event (Employer/Contractor/Neutral), (2) calculate critical path impact, (3) assess concurrent delay, and (4) state EOT entitlement. |
| How scored | LLM judge vs golden answer. Event attribution (1.0/0.5/0 each), EOT quantum (exact/±1 day/wrong), concurrency analysis (binary), critical path impact reasoning (qualitative). |
| Why | Delay attribution and EOT calculation are required for claims. Key skill: recognising that Employer delays to non-critical activities don't entitle EOT. |
| Calibration | The exact Munich Tower schedule data appeared in delays training data — this test is contaminated for any model fine-tuned on delays domain. Base model results are clean. |
| Attribute | Detail |
|---|---|
| What | Model receives 3 historical residential projects + benchmark summary and must create a complete baseline schedule for a new project: Cologne, Germany, 2022, Sand soil, EUR 35M, 1500 m², 4 floors. Must name all 18 standard activities, assign durations (justified from benchmarks), and set predecessor relationships using P6-style notation. |
| How scored | A (25pts) Activity completeness — all 18 standard names present. B (30pts) Duration validity — each within historical benchmark range. C (25pts) Predecessor logic — valid construction sequencing, no circular dependencies, mix of FS/SS/FF. D (20pts) Output format — valid parseable JSON with all required fields. Golden: 370 wd, CP 1→2→3→5→7→9→11→13→12→14. |
| Why | Schedule creation from benchmarks is the core skill the model is trained on. CPM arithmetic is tested separately in CB3. T3 isolates planning judgment: activity selection, duration calibration, sequencing logic. |
| Calibration | No contamination — Cologne project is synthetic, not present in training data. Historical DB (Projects 2/3/4) is embedded in the test message, same format as training. |
CB tests use completely different projects, jurisdictions, and contract forms from the training data. A model that only memorized training examples will fail here.
| Attribute | Detail |
|---|---|
| What | 21 contract clauses under Finnish YSE 1998 general conditions (EUR 45M mixed-use development, Vantaa). 3 critical omissions (performance bond, Force Majeure clause, IP ownership). Jurisdiction, terminology, and contract law entirely different from training data (Hamburg Tower was English-law NEC3). |
| How scored | LLM judge vs golden answer. Per clause: status classification (Accept/Modify/Reject), party at risk, risk description. Gap identification scored separately. Partial credit for adjacent status. |
| Calibration | Answer-leakage: the source VAN-MIX .docx carries a "Notes / Flaws" column stating each clause's defect — eval scripts that extracted the docx fed the model the answer key. All CB1 scores were re-run on the stripped plain-text input (cb1_test_file.txt); the docx-extraction runs are invalid and have been replaced. |
| Attribute | Detail |
|---|---|
| What | 3 delay events on a GBP 180M North Sea offshore wind farm: (1) exceptional marine weather (Force Majeure), (2) jack-up vessel dry-dock breakdown (Contractor risk), (3) transition piece supply chain delay from factory fire (arguable FM). Tests FIDIC Force Majeure Clause 19.1, concurrent delay analysis under English law (Adyard principle). |
| How scored | LLM judge vs golden answer. Per event: FM vs Contractor vs disputed classification, EOT entitlement, concurrent delay treatment, additional cost entitlement. Recommended EOT: 75 calendar days. |
| Attribute | Detail |
|---|---|
| What | 18-activity solar farm EPC schedule (Lincolnshire, UK, NEC3 Option A, GBP 35M). Non-building project type — tests model ability to reason about solar EPC logic rather than template-matching residential/building sequences. Critical path runs through PV installation workstream; Activity 8 (DC String Cabling, SS+10 relationship) is critical — counterintuitive key scoring point. |
| How scored | LLM judge vs golden answer. ES/EF/LS/LF/TF for all 18 activities, critical path identification (including Activity 8), project duration = 265 working days (inside 280 wd target). DNO grid connection float risk flagged. |
All models are evaluated across 4 configurations per domain: 2 baseline tests and 2 fine-tuned tests. Thinking models run with thinking enabled throughout.
Fixed parameters applied at evaluation time. Thinking models run at temperature 0.6; non-thinking at 0.15. All other parameters constant across models. Parameters are set in LM Studio before each evaluation session — not passed via API — ensuring the model's inference configuration is validated end-to-end.
| Temperature | 0.6 |
| Top P | 0.9 |
| Min P | 0.06 |
| Thinking | Enabled |
| Temperature | 0.15 |
| Top P | 0.9 |
| Min P | 0.06 |
| Thinking | Disabled |
| Inference server | LM Studio |
| Quantisation | Q8_0 GGUF |
| GPU | RTX 4090 Laptop 16GB |
| Streaming | SSE (no read timeout) |
Every test scenario is paired with a golden-answer JSON file authored before any model is evaluated. The golden answer specifies the correct output for every field, the scoring criteria for that field (full, partial, or zero credit), and the key discriminators — the items designed to separate genuine domain understanding from surface pattern-matching.
Scoring uses no keyword matching and no exact string comparison. A judge LLM reads the complete model output — including the full reasoning trace in the reasoning_content field — and scores it against the golden-answer criteria. A model can therefore earn partial credit for correct reasoning under an incorrect label, and lose credit for a correct label produced through circular or empty reasoning.
Each domain carries one or two items designated as key discriminators. These are heavily weighted because passing them is a binary signal: the model either understands the core concept or it does not. Examples include Article 10 in Contracts (the only Rejected article, requiring an explicit Rule 3 invocation), DEL-003 in Delays (employer-caused but non-critical, cost-only claim), and Activity 4 on the Critical Path in Schedules (a counterintuitive Finish-to-Finish chain dependency).
Two axes are scored independently per item: the correctness of the output label or numeric value, and the correctness of the supporting reasoning. A model that emits the wrong label but with sound reasoning earns partial credit, and a model that emits the correct label through circular or empty reasoning earns a reduced score. This separation prevents inflated scores from lucky guesses.
| Component | Weight | Scoring Method |
|---|---|---|
| A. Status Label Accuracy | 50 pts | Per-article weighted scores with partial credit ladder (see below) |
| B. DB Clause ID Accuracy | 20 pts | Correct ID = full; correct category wrong ID = 50%; wrong category = 0% |
| C. Reasoning Quality | 30 pts | 5 key articles × 6 pts: correct rule cited + clause element + DB reference |
| Article | Weight | Why |
|---|---|---|
| Art 10 ⭐ Key Discriminator | 8 pts | Only Rejected article — requires explicit Rule 3 invocation |
| Art 7 | 6 pts | Two-clause split, non-trivial classification |
| Art 6, 3, 9 | 4 pts each | Multi-condition or rare-label clauses |
| Other 9 articles | 2 pts each | Standard single-condition clauses |
Partial Credit Ladder (per article):
| Predicted | vs Accepted | vs Modification | vs Requires Review | vs Rejected |
|---|---|---|---|---|
| Accepted | 100% | 0% | 25% | 0% |
| Modification | 25% | 100% | 50% | 50% |
| Requires Review | 25% | 50% | 100% | 25% |
| Rejected | 0% | 50% | 25% | 100% |
| Component | Weight | Scoring Method |
|---|---|---|
| A. Project Duration | 25 pts | Band scoring: ≤±20 wd=25 · ≤±50 wd=17 · ≤±100 wd=10 · ≤±150 wd=5 · >±150 wd=0 |
| B. Critical Path | 30 pts | Activity 4 on CP=6 ⭐ · Activity 9 NOT on CP=6 ⭐ · other 7 CP activities=2 each · wrong inclusion=−1 |
| C. CPM Structural Validity | 25 pts | All activities present (5) · input durations correct (8) · no negative TF (5) · complete backward pass (4) · ES/EF consistency (3) |
| D. Relationship Type Handling | 20 pts | FF chain recognised (8) · SS chain recognised (8) · multi-predecessor merge (4) |
| Component | Weight | Scoring Method |
|---|---|---|
| A. Event Identification | 35 pts | 4 events × 9 pts: activity ID (2) + duration ±10 wd (3) + responsibility (4) |
| B. Critical Path Reasoning + EOT | 40 pts | DEL-002 on CP (10) · DEL-003 NOT on CP (10) ⭐ · EOT=0 recommendation (12) · float reasoning (8) |
| C. Output Quality | 25 pts | Valid JSON (5) · delay cascade / concurrency (7) · cost vs EOT distinction (8) · recovery events (5) |
The curveball tests apply the same rubric as the standard tests. The scenarios change; the scoring weights do not. The CB-specific key discriminators carry the same binary logic but probe different domain traps, and they are summarised below.
The principal discriminator on CB1 is C-002, which provides only for amicable negotiation — no arbitration, no court jurisdiction, no timeframes. The clause therefore offers no enforceable resolution mechanism, which is worse than DB3's rejected arbitration; Rule 3 applies and DB3 must be matched. Models that flag C-002 as "Requires Review" earn partial credit; models that accept it score zero on the clause. Secondary discriminators include C-017 (CAR insurance at EUR 22M, equal to 49 % of the EUR 45M contract value, where the DB8 standard is full contract value, and the value discrepancy must be identified) and the missing Performance Bond category (DB11 is absent entirely, and a full-marks model flags it in missing_db_categories).
The 10-point key discriminator is DEL-OW-002 (vessel breakdown), an equipment failure that is Contractor risk under FIDIC 4.15, not Force Majeure: zero EOT, zero cost. Models that label the vessel breakdown as Force Majeure fail this discriminator outright. An 8-point secondary discriminator covers the concurrent period of 1–26 November, during which a weather Force Majeure event and the Contractor vessel breakdown overlap for 26 calendar days. Under English law (the Adyard principle), the Contractor receives EOT for the concurrent period because the weather would have caused the same delay, but no additional cost. Models that deny all weather EOT because of the concurrent Contractor fault, or that award cost for the concurrent period, both fail. The golden EOT range is 35–75 calendar days (35 days for weather Force Majeure only; 75 days for weather plus a partial TP Force Majeure award).
The 6-point key discriminator is that Activity 8 is critical with TF = 0. DC String Cabling starts 10 days after PV Module Installation (SS+10), giving ES8 = 175 and EF8 = 220 = EF7 = 220, so both activities finish on the same day with zero float. The result is counterintuitive because Activity 8 is a secondary cabling activity rather than a structural one. Models that mark it non-critical miss the SS+10 mechanism entirely. A secondary discriminator concerns Activity 12 (Grid Connection): arithmetically TF = 105 working days, but UK Distribution Network Operator approvals routinely take three to six months, and the external constraint makes that float unreliable. A full-marks model flags the DNO risk explicitly. The golden project duration is 265 working days, inside a 280-working-day target with a 15-day contingency.
Everything needed to reproduce the findings independently. Each ZIP contains a README.md documenting its layout, formats, and usage instructions.
All 6 test prompts (T1, T2, T3, CB1, CB2, CB3) + their golden answers + the unified Prompt-Engineering system prompts (per-domain).
Contents: contracts/ (T1+CB1+DB+golden), delays/ (T2+CB2+DB+golden), schedules/ (T3+CB3+DB+golden), pe_prompts/ (4 unified PE prompts).
⬇ Download eval_artefacts.zip (67 KB)
Train + validation JSONL splits per domain. Schedules is the v3 pure planner-pattern dataset. Format: chat-template (system + user + assistant messages).
Contents: contracts/ (~140 train + val), delays/ (~74 train + val), schedules/ (77 train + 11 val, v3).
⬇ Download finetuning_data.zip (126 KB)
Training (Unsloth + QLoRA), data conversion (chat-template per model), stream merge + GGUF Q8/MXFP4, and evaluation (4 pathways × T+CB) for the 3 Schedules v3 models + earlier Contracts/Delays cycles.
Contents: training/ (Gemma/Qwen/GPT-OSS), data_conversion/ (4 converters + v3 reshape), merge_gguf/ (stream merge), evaluation/ (per-model pathway eval scripts).
⬇ Download scripts.zip (44 KB)