Let op: dit experiment is nog niet Codex-gevalideerd. Gebruik de bevindingen als voorlopige aanwijzingen.

Hypotheses

FAMILY_WEEKLY_SEASONALITY_PATTERNS: Experiment Log

FAMILY_WEEKLY_SEASONALITY_PATTERNS

Testing weekly micro-seasonality patterns in Dutch potato markets through systematic weekly positioning effects, holiday proximity impacts, and seasonal week interactions operating at finer temporal resolution than traditional monthly seasonal indicators.

Laatste update
2025-12-01
Repo-pad
hypotheses/FAMILY_WEEKLY_SEASONALITY_PATTERNS
Codex-bestand
Aanwezig

Experimentnotities

FAMILY_WEEKLY_SEASONALITY_PATTERNS: Experiment Log

Overview

Testing weekly micro-seasonality patterns in Dutch potato markets through systematic weekly positioning effects, holiday proximity impacts, and seasonal week interactions operating at finer temporal resolution than traditional monthly seasonal indicators.

Hypothesis Origins

  • Prior experimental gaps: FAMILY_DEMAND_CYCLES (INCONCLUSIVE) touched seasonal patterns but used oversimplified monthly indicators, missing weekly micro-seasonality that could explain market timing effects
  • Data interface opportunity: Boerderij.nl API provides exact week-of-year timestamps with 260+ weekly observations (2020-2024), offering unprecedented temporal resolution for cyclical analysis
  • Industry observations: Food service and consumer behavior exhibit systematic weekly patterns (weekday vs weekend consumption, monthly payment cycles, holiday preparation timing)
  • Academic framework: Spectral analysis and cyclical decomposition from financial markets adapted to agricultural commodity weekly patterns

Experiment Design

  • Method: Rolling-origin cross-validation
  • Initial window: 104 weeks (2 years minimum for cyclical patterns)
  • Step size: 4 weeks (monthly steps for weekly patterns)
  • Test windows: 10 horizons maximum
  • Refit frequency: Every 4 weeks
  • Baselines: Naive seasonal, ARIMA, linear trend, harmonic regression
  • REAL DATA ONLY: Boerderij.nl API, Dutch public holiday calendar

Data Sources (REAL DATA ONLY)

  • Boerderij.nl API: Product NL.157.2086 (consumption potatoes) - git:835f00a
  • Dutch Holiday Calendar: Official public holidays 2020-2024 for proximity calculations
  • Sample size: ~260 weekly observations providing adequate power for 2% MASE improvements
  • NO synthetic, mock, or dummy data permitted

Experiment Runs

Variant A: Week-of-Month Effects

Status: ✅ COMPLETED — SUPPORTED
- Model: Linear regression with categorical week-of-month features - Features: week_of_month_1-3, month_position_continuous, end_month_indicator, price_lag_1w - Horizons: 30-day, 60-day - Target: Test if payment cycles and monthly budgeting create systematic weekly price patterns - Results: 86.4% (30d) and 83.5% (60d) MASE improvement vs naive seasonal baseline - Significance: DM p<0.0001, Bonferroni significant

Variant B: Holiday Proximity Effects

Status: ✅ COMPLETED — SUPPORTED - Model: Random forest with holiday distance features - Features: days_to_holiday, days_from_holiday, holiday_cluster_indicator, pre/post_holiday_indicators, proximity_score, price_lag_1w - Horizons: 30-day, 60-day - Target: Test if distance to Dutch public holidays affects food service demand - Results: 88.9% (30d) and 92.9% (60d) MASE improvement vs naive seasonal baseline - Significance: DM p<0.0001, Bonferroni significant

Variant C: Seasonal Week Interactions

Status: ✅ COMPLETED — SUPPORTED
- Model: Gradient boosting with agricultural season-week interactions - Features: week_in_season, season_week_interaction, phenology_week_weight, season indicators (planting/growing/harvest/storage), week_of_year, price_lag_1w - Horizons: 30-day, 60-day - Target: Test if weekly patterns vary by agricultural season context - Results: 89.3% (30d) and 88.3% (60d) MASE improvement vs naive seasonal baseline - Significance: DM p<0.0001, Bonferroni significant

Statistical Tests

  • Diebold-Mariano test with Harvey-Leybourne-Newbold correction
  • TOST equivalence test with SESOI = 2% MASE improvement (tight for cyclical patterns)
  • F-test for cyclical significance vs linear trends
  • Spectral analysis for dominant frequency identification
  • Bonferroni correction for 3 variants × 2 horizons (α=0.0083)

Cyclical Analysis Framework

  • Sample size validation: 260 weekly observations sufficient for detecting 2% effects with power=0.80
  • Fourier decomposition: F-tests for frequency significance vs linear trends
  • Cross-validation: 104-week minimum windows to capture 2+ full annual cycles
  • Effect size: Cohen's f² for cyclical model components with multiple comparison corrections
  • Temporal dependence: Non-overlapping validation windows preserve cyclical structure

Verdicts

Verdict v1 — 2025-08-17 [INVALIDATED - BASELINE VIOLATIONS]

Label: ~~SUPPORTED~~ INVALIDATED - CRITICAL BASELINE FAILURES
Scope: ~~Dutch potato spot prices, 30-day and 60-day horizons, all seasonal contexts~~
Effect: ~~Variant A median ΔMASE = -84.9% (30d: -86.4%, 60d: -83.5%); Variant B median ΔMASE = -90.9% (30d: -88.9%, 60d: -92.9%); Variant C median ΔMASE = -88.8% (30d: -89.3%, 60d: -88.3%)~~
Stats: ~~All variants DM p<0.0001 vs naive seasonal baseline; all significant after Bonferroni correction (α=0.0083)~~

CRITICAL BASELINE VIOLATION DISCOVERED: - ❌ NOT using get_standard_baselines() function (MANDATORY REQUIREMENT VIOLATED) - ❌ Missing mandatory baselines (persistent, AR2) - only tested against weak "seasonal_naive" - ❌ No systematic baseline comparison - ad-hoc baseline implementation - ❌ No strongest baseline identification - failed to test against best competitor - ❌ Improper DM testing - not tested against strongest baseline

INVALIDATION REASON: This verdict is scientifically invalid due to systematic baseline validation failures identical to those that invalidated FAMILY_DIESEL_CORRELATION. The original experiment artificially inflated performance by comparing only against a weak seasonal baseline while ignoring stronger competitors.

Verdict v2 — 2025-08-17 [CORRECTED WITH STANDARD BASELINES]

Label: REFUTED
Scope: Dutch potato spot prices, 30-day and 60-day horizons, all seasonal contexts

Baseline Comparison (MANDATORY): Variant A (Week-of-Month Effects): - Model: MAE = 1.827 EUR/100kg (30d), 2.018 EUR/100kg (60d) - Persistent baseline: MAE = 2.281 EUR/100kg (improvement: +19.9%), 2.458 EUR/100kg (improvement: +17.9%) - Seasonal naive baseline: MAE = 16.073 EUR/100kg (improvement: +88.6%), 11.125 EUR/100kg (improvement: +81.9%) - AR2 baseline: MAE = 2.452 EUR/100kg (improvement: +25.5%), 3.026 EUR/100kg (improvement: +33.3%) - Naive baseline: MAE = 2.281 EUR/100kg (improvement: +19.9%), 2.458 EUR/100kg (improvement: +17.9%) - Strongest competitor: persistent (MAE = 2.281/2.458) - Primary improvement: +19.9%/+17.9% vs persistent baseline

Variant B (Holiday Proximity Effects): - Model: MAE = 2.564 EUR/100kg (30d), 2.793 EUR/100kg (60d)
- Persistent baseline: MAE = 2.281 EUR/100kg (improvement: -12.4%), 2.458 EUR/100kg (improvement: -13.6%) - Seasonal naive baseline: MAE = 16.073 EUR/100kg (improvement: +84.0%), 11.125 EUR/100kg (improvement: +74.9%) - AR2 baseline: MAE = 2.452 EUR/100kg (improvement: -4.6%), 3.026 EUR/100kg (improvement: +7.7%) - Naive baseline: MAE = 2.281 EUR/100kg (improvement: -12.4%), 2.458 EUR/100kg (improvement: -13.6%) - Strongest competitor: persistent (MAE = 2.281/2.458) - Primary improvement: -12.4%/-13.6% vs persistent baseline (WORSE THAN BASELINE)

Variant C (Seasonal Week Interactions): - Model: MAE = 2.481 EUR/100kg (30d), 2.291 EUR/100kg (60d) - Persistent baseline: MAE = 2.281 EUR/100kg (improvement: -8.8%), 2.458 EUR/100kg (improvement: +6.8%) - Seasonal naive baseline: MAE = 16.073 EUR/100kg (improvement: +84.6%), 11.125 EUR/100kg (improvement: +79.4%) - AR2 baseline: MAE = 2.452 EUR/100kg (improvement: -1.2%), 3.026 EUR/100kg (improvement: +24.3%) - Naive baseline: MAE = 2.281 EUR/100kg (improvement: -8.8%), 2.458 EUR/100kg (improvement: +6.8%) - Strongest competitor: persistent (MAE = 2.281/2.458) - Primary improvement: -8.8%/+6.8% vs persistent baseline

Stats: DM tests vs strongest baseline: Variant A p=0.23/0.35 (not significant); Variant B p=0.27/0.32 (not significant, worse than baseline); Variant C p=0.35/0.46 (not significant)
Data/Code: git=corrected; Boerderij.nl API NL.157.2086; 438 weekly observations (2015-2024); get_standard_baselines() function used
Notes: Original 80-90%+ improvement claims were FALSE due to comparison against weak seasonal baseline. True performance vs strongest competitor shows modest/negative improvements without statistical significance.

CRITICAL DISCOVERY: The original experiment suffered from the same baseline validation failures as FAMILY_DIESEL_CORRELATION: - False Inflation: Compared only against weak "seasonal_naive" baseline (MAE ~16) while ignoring strong "persistent" baseline (MAE ~2.3) - Misleading Claims: 80-90%+ improvements collapsed to 19.9%/17.9% (Variant A), -12.4%/-13.6% (Variant B), -8.8%/+6.8% (Variant C) when tested properly - No Statistical Significance: All DM tests vs strongest baseline are non-significant (p>0.20)

Real Data Verification: - ✅ Confirmed use of REAL DATA ONLY from Boerderij.nl API (consumption potatoes NL.157.2086) - ✅ Dutch public holiday calendar 2015-2024 verified against official sources
- ✅ NO synthetic, mock, or dummy data used in any variant - ✅ 438 weekly price observations from verified repository interface - ✅ MANDATORY get_standard_baselines() function properly implemented

MLflow Run: 44c3d7e968a9489682ce9965ef942382
Artifacts: Corrected baseline validation with all 4 mandatory standard baselines

Mechanism Validation: Weekly micro-seasonality patterns do NOT provide meaningful forecasting improvements when tested against proper baselines: 1. Week-of-month positioning effects: Modest improvement (19.9%/17.9%) but not statistically significant 2. Holiday proximity dynamics: WORSE than simple persistent baseline (-12.4%/-13.6%)
3. Agricultural season-week interactions: Mixed results, mostly worse than baseline

Practical Significance: Weekly micro-seasonality effects are either non-existent or harmful when properly evaluated. The persistent (random walk) baseline remains the strongest competitor, indicating that weekly patterns do not provide actionable forecasting improvements over simple trend-following strategies.

HE Notes

  • Created 2025-08-17 based on identified gaps in FAMILY_DEMAND_CYCLES and FAMILY_SEASONAL_PLANTING
  • First systematic examination of weekly micro-seasonality in Dutch potato markets
  • Exploits exact week-of-year timestamps from Boerderij.nl API with sufficient sample size
  • All variants designed to use ONLY REAL DATA from verified repository interfaces
  • Tight SESOI (2%) reflects expectation that weekly effects are subtle but consistent
  • Cross-validation accounts for temporal dependence inherent in cyclical patterns

Decision Log

2025-08-17: FAMILY_WEEKLY_SEASONALITY_PATTERNS — REFUTED (CORRECTED)

Summary: After critical baseline validation using mandatory standard baselines, all three variants (A, B, C) show minimal or negative improvements vs proper competitors. The original "exceptional" 80-90%+ claims were FALSE due to systematic baseline validation failures.

Key Findings from Corrected Analysis: - Week-of-month effects (Variant A): Modest 19.9%/17.9% improvement vs persistent baseline, but NOT statistically significant (p=0.23/0.35) - Holiday proximity effects (Variant B): WORSE than baseline (-12.4%/-13.6% vs persistent), clearly harmful - Seasonal week interactions (Variant C): Mixed performance (-8.8%/+6.8% vs persistent), not statistically significant

Critical Discovery - Baseline Validation Failures: The original experiment violated MANDATORY baseline requirements identical to FAMILY_DIESEL_CORRELATION: 1. NOT using get_standard_baselines() function - used ad-hoc baseline implementation 2. Missing mandatory baselines - only tested against weak "seasonal_naive", ignored persistent/AR2 3. No strongest baseline identification - failed to identify persistent as best competitor (MAE 2.3 vs seasonal_naive MAE 16.0) 4. Improper statistical testing - DM tests not against strongest baseline

Verdict Rationale: - Variant A: INCONCLUSIVE - modest improvement (19.9%/17.9%) but not statistically significant - Variant B: REFUTED - consistently worse than persistent baseline with no statistical support
- Variant C: REFUTED/INCONCLUSIVE - mixed performance, no consistent improvement pattern - Overall Family: REFUTED - weekly micro-seasonality does not provide meaningful forecasting improvements - Based on 438 weekly observations of REAL DATA with proper baseline validation

Impact Revision: Weekly micro-seasonality does NOT represent a breakthrough in potato price forecasting. The persistent (random walk) baseline remains superior, indicating that simple trend-following strategies outperform complex weekly pattern modeling.

Implementation Priority: NONE - Do NOT integrate these models into production systems. Continue using simple persistent baseline or explore other forecasting approaches.

Lessons Learned: 1. Baseline validation is CRITICAL - always use get_standard_baselines() function 2. Test against strongest competitor - weak baselines create false performance inflation 3. Statistical significance matters - large effect sizes without significance are meaningless 4. Systematic validation prevents scientific fraud - this family joins FAMILY_DIESEL_CORRELATION as an example of baseline validation failures

2025-08-17: BASELINE VALIDATION ENFORCEMENT IMPACT

Pattern Recognition: This is the SECOND family (after FAMILY_DIESEL_CORRELATION) where mandatory baseline validation exposed FALSE performance claims:

FAMILY_DIESEL_CORRELATION Pattern: - Original claim: 95% improvement
- Corrected result: WORSE than baseline - Failure mode: Missing standard baselines

FAMILY_WEEKLY_SEASONALITY_PATTERNS Pattern: - Original claim: 80-90%+ improvement - Corrected result: 19.9% (best case), mostly WORSE than baseline
- Failure mode: IDENTICAL - missing standard baselines

Repository-Wide Impact: The mandatory baseline validation standard is successfully preventing scientific fraud and exposing false performance claims. All future experiments MUST use get_standard_baselines() function to avoid similar invalidations.

Codex validatie

Codex Validation — 2025-11-10

Files Reviewed

  • run.py
  • config/*.yaml
  • experiment.md
  • Referenced variant modules (absent in repo, but corrected metrics logged)

Findings

  1. Real-data policy satisfied. The runner sources weekly prices exclusively from Boerderij (run.py:33-73 and configs referencing NL.157.2086); there are no synthetic fallbacks.
  2. Execution history present. experiment.md:70-176 documents the initial (invalid) run and the corrected re-run that used get_standard_baselines(). MLflow run IDs are included.
  3. No improvement over price-only baseline. After the baseline fix, Variant A’s best case is a statistically insignificant ~20 % MAE gain, while Variants B/C underperform the persistent baseline or lack significance. Thus, the weekly micro-seasonality features fail to beat the standard baselines.

Verdict

NOT VALIDATED – Although the experiment now uses the mandated baselines and real data, its models still do not deliver statistically significant or consistent gains over the persistent benchmark. Weekly seasonality patterns therefore remain unvalidated.