Hypotheses
FAMILY_STORAGE_OPTIMIZATION: Experiment Log
FAMILY_STORAGE_OPTIMIZATION
Testing strategic storage management decisions that create predictable price patterns through release timing optimization, quality preservation economics, and inventory cost management.
Experimentnotities
FAMILY_STORAGE_OPTIMIZATION: Experiment Log
Overview
Testing strategic storage management decisions that create predictable price patterns through release timing optimization, quality preservation economics, and inventory cost management.
Hypothesis Origins
- Prior experiments: FAMILY_QUALITY_PREMIUM combined model (48.2% improvement at 30-day), FAMILY_STORAGE_DECAY variants B/C (81-93% improvements in preliminary tests)
- Industry catalyst: 2024 Dutch storage crisis with 650,000 tons lost and active management under stress conditions
- Academic basis: Storage optimization literature (Williams & Wright, 1991) and quality preservation economics (Felix Instruments, 2023)
Experiment Design
- Method: Rolling-origin cross-validation
- Initial window: 156 weeks (3 years)
- Step size: 4 weeks
- Test windows: 52 weeks (1 year)
- Baselines: Naive seasonal, ARIMA, linear trend
- REAL DATA ONLY: Boerderij.nl API, CBS API, Open-Meteo
Data Sources (REAL DATA ONLY)
- Boerderij.nl API: Products NL.157.2086 (consumption), NL.157.2083 (fries) - git:31ab258
- CBS API: Table 85676NED (production/storage) - version 2024-Q4
- Open-Meteo API: Temperature and humidity for storage modeling (52.6°N, 5.7°E) - git:31ab258
Experiment Runs
Variant A: Storage Duration Optimization Model
Status: Not started - Model: Strategic release timing based on forward curves and storage costs - Features: release_pressure, storage_cost_rate, forward_price_spread, quality_spread - Horizons: 1-month, 2-month - Target: Test if active release timing optimization explains price patterns better than passive decay
Variant B: Quality-Preservation Trade-off Model
Status: Not started - Model: Economic optimization model for preservation investment vs cost - Features: quality_spread, storage_pressure, preservation_investment_proxy, grade_transition_risk - Horizons: 1-month, 2-month - Target: Test if quality preservation economics create systematic price premiums
Variant C: Inventory Cost Management Model
Status: Not started - Model: Inventory optimization balancing carrying costs, opportunity costs, and price expectations - Features: inventory_carrying_cost, opportunity_cost_rate, price_appreciation_expectation, optimal_inventory_level - Horizons: 1-month, 2-month, 9-month - Target: Test if systematic inventory management drives seasonal price patterns
Statistical Tests
- Diebold-Mariano test with Harvey-Leybourne-Newbold correction
- TOST equivalence test with SESOI = 5% improvement (0.075 EUR/100kg)
- Threshold tests for optimal release/preservation points
- Regime detection: Threshold regression (A), Bai-Perron (C)
- Bonferroni correction for multiple variants and horizons
Verdicts
Verdict v1 - 2025-08-16 - Complete Statistical Analysis
Data Versions: - Price data: Boerderij.nl API git:31ab258 (NL.157.2086 consumption, NL.157.2083 fries) - Production data: CBS API 2024-Q4 (Table 85676NED) - Weather data: Synthetic seasonal pattern (Open-Meteo API failed) - Git SHA: current - Samples: 226-229 weekly records (2020-2024)
Experiment Setup: - Method: Time Series Cross-Validation (5 folds, 20 week test windows) - Baselines: Naive, Seasonal Naive - Models: Random Forest, Gradient Boosting, Ridge Regression - Statistical Tests: Diebold-Mariano + HLN correction, TOST equivalence - SESOI: 5% improvement threshold
Results Summary (2025-11-11 rerun):
- Variant A – Storage Duration Optimization
- 1 m horizon:
storage_optimization_rf, RMSE 2.42 vs persistent baseline RMSE 1.84 (−31.9 %, DM p = 0.100, HLN p = 0.120) - 2 m horizon:
storage_optimization_rf, RMSE 2.76 vs persistent RMSE 1.17 (−57.5 %, DM p = 0.098, HLN p = 0.117) -
Verdict: INCONCLUSIVE – models underperform the random-walk baseline.
-
Variant B – Quality Preservation Trade-off
- 1 m horizon:
quality_preservation_gbm, RMSE 1.72 vs persistent RMSE 1.61 (+6.5 %, DM p = 0.098, HLN p = 0.117) - 2 m horizon:
quality_preservation_gbm, RMSE 1.68 vs persistent RMSE 1.62 (+3.9 %, DM p = 0.476, HLN p = 0.491) -
Verdict: INCONCLUSIVE – small improvements, but no statistical support (and DM/HLN well above 0.05).
-
Variant C – Inventory Cost Management
- 1 m horizon:
inventory_management_ensemble, RMSE 1.76 vs persistent RMSE 1.69 (+4.2 %, DM p = 0.748, HLN p = 0.756) - 2 m horizon:
inventory_management_ensemble, RMSE 1.81 vs persistent RMSE 1.75 (−3.5 %, DM p = 0.731, HLN p = 0.739) - Verdict: INCONCLUSIVE – effects are tiny and statistically insignificant.
Statistical Tests: - DM/HLN p-values range 0.09–0.76 → no horizon beats the strongest baseline at α = 0.05. - TOST equivalence tests also fail (improvements are outside SESOI but not significant).
Verdict: INCONCLUSIVE for every variant/horizon – no statistically significant lift over the persistent baseline.
Rationale: 1. Models either trail or only marginally beat the baselines, and DM/HLN tests reject significance. 2. Quality-preservation features show the best practical gains (~4‑7 %) but still lack statistical backing. 3. Storage-duration and inventory-cost variants currently underperform the random-walk baseline.
Practical Observations: - Even when RMSE improves (e.g., Variant B 1 m), the margin is <7 % and unstable across folds. - Persistent baseline remains very strong for these horizons; additional signals (e.g., inventories, weather shocks) may be necessary.
Limitations & Next Steps: - Weekly aggregation plus short 5‑year window limits statistical power—extend to include pre‑2015 data once aligned. - Baseline comparisons are for persistent/seasonal_naive/AR2/historical_mean; future work should log CBS/Eurostat cost indices to enrich features. - Consider re-specifying targets (Δ-price) or longer horizons before reattempting validation.
MLflow Runs:
- Variant A: run_id recorded in MLflow experiment
- Variant B: run_id recorded in MLflow experiment
- Variant C: run_id recorded in MLflow experiment
- Artifacts: Feature importance, CV results, diagnostic plots
Next Steps: - Extend analysis to longer time series for increased statistical power - Incorporate daily price data for higher frequency storage decisions - Add regime-specific analysis for crisis periods (2024 storage shortage) - Test interaction effects between quality preservation and storage duration - Validate findings with out-of-sample data from 2025
HE Notes
- Created 2025-08-16 based on successful FAMILY_QUALITY_PREMIUM and FAMILY_STORAGE_DECAY mechanisms
- Differentiated from prior passive decay models by focusing on active management decisions
- All variants use ONLY REAL DATA from proven repository APIs
- Builds directly on proven features: storage_pressure, quality_risk_premium (QUALITY_PREMIUM), stock_depletion_pct (STORAGE_DECAY)
- 2024 storage crisis provides validation opportunity for optimization under stress
Decision Log
(To be added after experiments)
Codex validatie
Codex Validation — 2025-11-10
Files Reviewed
run_experiments.pyrun_experiments_cv.pyrun_final_experiments.pyexperiment.md,hypothesis.md,literature.md
Findings
- Weather and price feeds fully real.
run_experiments.py:70-178now pulls NL.157.2086/NL.157.2083 viaBoerderijApi.get_data(nolegacy=True) and aborts if either series is empty; Open-Meteo humidity/temperature panels are mandatory (sine-wave fallback removed). - Rolling-origin CV with stats.
run_variant_experimentnow evaluates each model withrolling_origin_cv, compares against all four baselines viafit_and_predict_baseline, and logs DM/HLN/TOST metrics. - Latest runs underperform.
family_storage_optimization_results.json(11 Nov 2025) shows the best models are still equal or worse than the persistent baseline (e.g., Variant A 1 m improvement −31.9 %, Variant B 1 m +6.5 % but DM p=0.098); no horizon achieves statistical significance. - Experiment log updated.
experiment.mdnow documents the inconclusive results, replacing the earlier placeholder narrative.
Verdict
NOT VALIDATED – Although the pipeline now uses real data and reports DM/HLN/TOST metrics, every variant fails to beat the persistent baseline (most are statistically worse), so the storage optimization hypothesis remains unvalidated.