Let op: dit experiment is nog niet Codex-gevalideerd. Gebruik de bevindingen als voorlopige aanwijzingen.

Hypotheses

FAMILY_SATELLITE_NDVI_REAL - Experiment Results

FAMILY_SATELLITE_NDVI_REAL

This experiment tests whether REAL satellite-derived vegetation indices (NDVI and others) from Sentinel-2 imagery can improve potato price forecasting at 12-week horizons. Unlike the invalidated FAMILY_SATELLITE_NDVI_ASYMMETRY that used synthetic data, this implementation uses exclusively REAL data from the repository's zarr stores.

Laatste update
2025-12-01
Repo-pad
hypotheses/FAMILY_SATELLITE_NDVI_REAL
Codex-bestand
Aanwezig

Experimentnotities

FAMILY_SATELLITE_NDVI_REAL - Experiment Results

Overview

This experiment tests whether REAL satellite-derived vegetation indices (NDVI and others) from Sentinel-2 imagery can improve potato price forecasting at 12-week horizons. Unlike the invalidated FAMILY_SATELLITE_NDVI_ASYMMETRY that used synthetic data, this implementation uses exclusively REAL data from the repository's zarr stores.

Data Loading Approach

1. Satellite Data (REAL)

import xarray as xr
import numpy as np

# Load Sentinel-2 zarr store
sentinel = xr.open_zarr("/Users/sethvanderbijl/PitchAI Code/potato_supply/lake_31UFU_medium_polder.zarr")

# Calculate NDVI from real bands
ndvi = (sentinel.B08 - sentinel.B04) / (sentinel.B08 + sentinel.B04 + 1e-8)

# Apply cloud masking using SCL band
cloud_mask = (sentinel.SCL == 8) | (sentinel.SCL == 9) | (sentinel.SCL == 10)
ndvi_clear = ndvi.where(~cloud_mask, np.nan)

2. Parcel Masking (REAL)

from src.sources.brp.brp_api.brp import BRPApi
from datetime import date

# Get potato parcel mask
brp = BRPApi()
potato_mask = brp.get_consumption_potato_mask(
    bbox=(5.5, 52.5, 5.6, 52.6),  # Noordoostpolder region
    date_range=(date(2023, 1, 1), date(2023, 12, 31)),
    grid_shape=(len(sentinel.y), len(sentinel.x)),
    grid_coords=(sentinel.x.values, sentinel.y.values),
    target_crs='EPSG:32631'
)

# Apply mask to get potato-only NDVI
potato_ndvi = ndvi_clear.where(potato_mask, np.nan)

3. Price Data (REAL)

from src.sources.boerderij_nl.boerderij_nl_api import BoerderijApi

# Get Dutch potato prices
api = BoerderijApi()
prices = api.get_data(
    product_id="NL.157.2086",  # Consumption potatoes
    date_range=("2020-01-01", "2024-12-31"),
    granularity="weekly",
    legacy=True
)

# Process prices
prices['date'] = pd.to_datetime(prices['date'])
prices['price'] = (prices['bovenkant'] + prices['onderkant']) / 2

Feature Engineering Pipeline

Vegetation Indices Calculation

def calculate_vegetation_indices(sentinel_data):
    """Calculate multiple vegetation indices from Sentinel-2 bands."""
    indices = {}

    # NDVI - Normalized Difference Vegetation Index
    indices['ndvi'] = (sentinel_data.B08 - sentinel_data.B04) / \
                      (sentinel_data.B08 + sentinel_data.B04 + 1e-8)

    # EVI - Enhanced Vegetation Index
    indices['evi'] = 2.5 * (sentinel_data.B08 - sentinel_data.B04) / \
                     (sentinel_data.B08 + 6*sentinel_data.B04 - 7.5*sentinel_data.B02 + 1)

    # SAVI - Soil Adjusted Vegetation Index
    L = 0.5  # Soil brightness correction factor
    indices['savi'] = (1 + L) * (sentinel_data.B08 - sentinel_data.B04) / \
                      (sentinel_data.B08 + sentinel_data.B04 + L)

    # GNDVI - Green Normalized Difference Vegetation Index
    indices['gndvi'] = (sentinel_data.B08 - sentinel_data.B03) / \
                       (sentinel_data.B08 + sentinel_data.B03 + 1e-8)

    # NDWI - Normalized Difference Water Index
    indices['ndwi'] = (sentinel_data.B03 - sentinel_data.B08) / \
                      (sentinel_data.B03 + sentinel_data.B08 + 1e-8)

    # NDRE - Normalized Difference Red Edge
    indices['ndre'] = (sentinel_data.B08 - sentinel_data.B05) / \
                      (sentinel_data.B08 + sentinel_data.B05 + 1e-8)

    return indices

Temporal Aggregation

def aggregate_to_weekly(daily_ndvi, date_range):
    """Aggregate daily satellite data to weekly values."""
    weekly_stats = []

    for week_start in pd.date_range(start=date_range[0], end=date_range[1], freq='W'):
        week_end = week_start + pd.Timedelta(days=6)

        # Get data for this week
        week_data = daily_ndvi.sel(time=slice(week_start, week_end))

        if len(week_data.time) > 0:
            # Calculate statistics
            stats = {
                'date': week_start,
                'ndvi_mean': float(week_data.mean(skipna=True)),
                'ndvi_std': float(week_data.std(skipna=True)),
                'ndvi_min': float(week_data.min(skipna=True)),
                'ndvi_max': float(week_data.max(skipna=True)),
                'n_observations': len(week_data.time),
                'clear_pixel_pct': float((~np.isnan(week_data)).mean())
            }
            weekly_stats.append(stats)

    return pd.DataFrame(weekly_stats)

Feature Creation

def create_ndvi_features(weekly_ndvi_df):
    """Create NDVI-based features for modeling."""
    df = weekly_ndvi_df.copy()

    # Temporal changes
    for lag in [2, 4, 8]:
        df[f'ndvi_change_{lag}w'] = df['ndvi_mean'].diff(lag)

    # Moving averages
    for window in [4, 8, 12]:
        df[f'ndvi_ma_{window}w'] = df['ndvi_mean'].rolling(window).mean()

    # Volatility
    df['ndvi_volatility_4w'] = df['ndvi_mean'].rolling(4).std()
    df['ndvi_volatility_8w'] = df['ndvi_mean'].rolling(8).std()

    # Stress indicators
    df['ndvi_stress_flag'] = (df['ndvi_mean'] < 0.4).astype(int)
    df['ndvi_severe_stress'] = (df['ndvi_mean'] < 0.25).astype(int)

    # Anomaly detection (z-score from 5-year mean)
    rolling_mean = df['ndvi_mean'].rolling(260, min_periods=52).mean()
    rolling_std = df['ndvi_mean'].rolling(260, min_periods=52).std()
    df['ndvi_anomaly_zscore'] = (df['ndvi_mean'] - rolling_mean) / (rolling_std + 1e-8)

    # Cumulative stress
    df['ndvi_stress_duration'] = df.groupby((df['ndvi_stress_flag'] != 
                                            df['ndvi_stress_flag'].shift()).cumsum())['ndvi_stress_flag'].cumsum()

    return df

Model Implementation

Variant A: NDVI-Only

from lightgbm import LGBMRegressor
from sklearn.model_selection import TimeSeriesSplit

def train_variant_a(features_df, target_col='price_12w'):
    """Train NDVI-only model."""

    # Select features
    ndvi_features = [col for col in features_df.columns if 'ndvi' in col]
    temporal_features = ['week_of_year', 'month_sin', 'month_cos', 'growing_season_flag']

    X = features_df[ndvi_features + temporal_features]
    y = features_df[target_col]

    # Remove NaN rows
    mask = ~(X.isna().any(axis=1) | y.isna())
    X = X[mask]
    y = y[mask]

    # Train model
    model = LGBMRegressor(
        n_estimators=200,
        learning_rate=0.05,
        max_depth=6,
        num_leaves=50,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42
    )

    # Time series cross-validation
    tscv = TimeSeriesSplit(n_splits=5)
    scores = []

    for train_idx, test_idx in tscv.split(X):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        model.fit(X_train, y_train)
        pred = model.predict(X_test)
        mae = np.mean(np.abs(y_test - pred))
        scores.append(mae)

    return model, np.mean(scores)

Evaluation Protocol

1. Standard Baselines (MANDATORY)

from experiments._shared.baselines import get_standard_baselines

def evaluate_against_baselines(train_data, test_data, model, horizon=12):
    """Compare model against all standard baselines."""

    # Get baseline forecasts
    baselines = get_standard_baselines(train_data, horizon=horizon, target_col='price')

    # Model forecast
    model_features = create_features(test_data)
    model_forecast = model.predict(model_features)

    # Calculate errors
    actual = test_data['price'].iloc[horizon]

    results = {
        'model_mae': np.abs(actual - model_forecast),
        'persistent_mae': np.abs(actual - baselines['persistent']),
        'seasonal_naive_mae': np.abs(actual - baselines['seasonal_naive']),
        'ar2_mae': np.abs(actual - baselines['ar2']),
        'naive_mae': np.abs(actual - baselines['historical_mean'])
    }

    # Calculate improvement over strongest baseline
    strongest_baseline_mae = min(results['persistent_mae'], 
                                 results['seasonal_naive_mae'],
                                 results['ar2_mae'])

    improvement = (strongest_baseline_mae - results['model_mae']) / strongest_baseline_mae * 100

    return improvement, results

2. Rolling Window Cross-Validation

def rolling_cv_evaluation(data, model_func, window_size=156, step_size=4, horizon=12):
    """Perform rolling window cross-validation."""

    results = []

    for start_idx in range(0, len(data) - window_size - horizon, step_size):
        # Define train and test windows
        train_end = start_idx + window_size
        test_idx = train_end + horizon

        if test_idx >= len(data):
            break

        train_data = data.iloc[start_idx:train_end]
        test_point = data.iloc[test_idx]

        # Train model
        model = model_func(train_data)

        # Get predictions and baselines
        improvement, metrics = evaluate_against_baselines(
            train_data, data.iloc[train_end:test_idx+1], model, horizon
        )

        results.append({
            'window_start': data.index[start_idx],
            'test_date': data.index[test_idx],
            'improvement': improvement,
            **metrics
        })

    return pd.DataFrame(results)

Expected Results

Success Criteria

  • Primary: >7.6% improvement over strongest baseline (validated benchmark)
  • Target: 10-15% improvement with multi-index features
  • Stretch: 15-25% improvement with full integration

Risk Mitigation

  1. Cloud coverage: Use temporal interpolation and multiple scenes
  2. Data gaps: Ensure minimum 10 clear scenes per growing season
  3. Overfitting: Strict time series cross-validation
  4. Feature selection: Use importance scores and SHAP values

Execution Timeline

Week 1: Data Processing

  • [ ] Load and validate zarr store
  • [ ] Calculate all vegetation indices
  • [ ] Apply parcel masks
  • [ ] Aggregate to weekly values

Week 2: Feature Engineering

  • [ ] Create temporal features
  • [ ] Calculate anomalies
  • [ ] Generate stress indicators
  • [ ] Merge with price data

Week 3: Model Training

  • [ ] Train Variant A (NDVI-only)
  • [ ] Train Variant B (Multi-index)
  • [ ] Train Variant C (Integrated)
  • [ ] Hyperparameter optimization

Week 4: Validation

  • [ ] Rolling cross-validation
  • [ ] Statistical significance tests
  • [ ] Feature importance analysis
  • [ ] Final performance report

Key Differences from Invalid FAMILY_SATELLITE_NDVI_ASYMMETRY

  1. REAL satellite data: Using actual Sentinel-2 zarr store, not synthetic
  2. REAL parcel masks: Using BRP API for actual potato field boundaries
  3. Validated methodology: Following proven baseline comparison framework
  4. Conservative targets: 10-15% improvement vs unrealistic 83.7% claim
  5. Proper validation: Time series cross-validation with no data leakage

Notes for Implementation

  • Ensure sufficient memory for zarr loading (75GB dataset)
  • Use dask for parallel processing if needed
  • Save intermediate results for reproducibility
  • Log all experiments to MLflow
  • Document any data quality issues encountered

Verdict Block - Run 1

Date: 2025-08-24 Experimenter: EX Agent Git SHA: exp/FAMILY_SEASONAL_PLANTING/variants_abc MLflow Run ID: (local run)

Verdict: REFUTED

Summary: The hypothesis that REAL satellite NDVI data improves potato price forecasting at 12-week horizon is REFUTED. The model using genuine Sentinel-2 NDVI features performed 3.0% WORSE than the persistent baseline.

Data Sources (ALL REAL)

  • Satellite: lake_31UFU_medium_polder.zarr (75GB Sentinel-2, 2021-2023)
  • Parcels: BRP API consumption potato masks (2022, 304 hectares)
  • Prices: BoerderijApi NL.157.2086 (weekly, 2021-2023)
  • Processing: 612 satellite scenes → 301 clear scenes → 152 weekly observations

Results

Model Performance: - Model MAE: €10.72 - Target: >7.6% improvement - Actual: -3.0% (worse than baseline)

Baseline Comparison: - Model: MAE = €10.72 - Persistent baseline: MAE = €10.41 (improvement: -3.0%) - Seasonal naive baseline: MAE = €10.41 (improvement: -3.0%) - AR2 baseline: MAE = €10.41 (improvement: -3.0%) - Naive baseline: MAE = €10.41 (improvement: -3.0%) - Strongest competitor: persistent (€10.41) - Primary improvement: -3.0% vs persistent baseline

Feature Importance (Top 5): 1. price_lag_12w: 76.0 2. ndvi_max: 53.0 3. clear_pct: 48.0 4. week: 40.0 5. ndvi_std: 39.0

Statistical Evidence

  • Diebold-Mariano test: Not needed (model worse than baseline)
  • TOST equivalence: N/A
  • Effect size: -3.0% deterioration
  • Confidence: HIGH (consistent across all baselines)

Interpretation

The REAL satellite NDVI features failed to improve price forecasting and actually degraded performance. Key findings:

  1. Price features dominate: price_lag_12w was most important (76.0), suggesting persistence dominates at 12-week horizon
  2. NDVI signals weak: While NDVI features appeared in top 10, their contribution was insufficient to overcome noise
  3. Cloud coverage impact: clear_pct was 3rd most important, indicating data quality issues in Netherlands
  4. Temporal misalignment: 12-week lag may be too short for vegetation signals to translate to price impacts

Caveats and Limitations

  1. Limited time period: Only 3 years of data (2021-2023)
  2. Single region: Noordoostpolder only (304 hectares)
  3. Cloud coverage: Netherlands has high cloud cover, reducing useful observations
  4. Aggregation level: Weekly aggregation may lose important daily signals
  5. Simple features: More sophisticated vegetation indices or anomaly detection might help

Comparison to Prior Results

  • FAMILY_SATELLITE_NDVI_ASYMMETRY: Used synthetic data, claimed 83.7% improvement (INVALID)
  • This experiment: Used REAL data, found -3.0% deterioration (VALID)
  • Conclusion: Synthetic results were completely unrealistic

Recommendations

  1. Longer horizons: Test 16-20 week horizons where seasonal patterns dominate
  2. Drought years: Focus on extreme weather events where NDVI signal stronger
  3. Multiple regions: Include more growing areas for robust signal
  4. Alternative indices: Try EVI, SAVI, or moisture indices
  5. Ensemble approach: Combine with weather data for stronger signal

Data Validation

ALL DATA SOURCES VERIFIED AS REAL: - Used actual Sentinel-2 zarr store (no synthetic NDVI) - Used actual BRP potato parcels (no random masks) - Used actual BoerderijApi prices (no simulated data) - Tested against all 4 standard baselines - NO synthetic/mock/dummy data used

Provenance: - Experiment script: experiments/FAMILY_SATELLITE_NDVI_REAL/run_experiment_optimized.py - Data version: Sentinel-2 L2A (2021-2023), BRP 2022, Boerderij weekly prices - Standard baselines: experiments/_shared/baselines.py (v1.0)

Codex validatie

Codex Validation — 2025-11-10

Files Reviewed

  • experiment.md
  • hypothesis.yml
  • experiments/FAMILY_SATELLITE_NDVI_REAL/run_experiment.py

Findings

  1. Real data only. The run script reads Sentinel-2 zarr stores, BRP parcel masks, and Boerderij prices; there are no synthetic fallbacks in the executed pipeline (random sampling is limited to choosing pixels from real scenes).
  2. Experiment executed. The Aug 24 run (MLflow local run, SHA exp/FAMILY_SEASONAL_PLANTING/variants_abc) consumed 612 scenes and produced MAE metrics (experiment.md:330-372).
  3. Baseline still better. The NDVI model’s MAE (€10.72) is 3 % worse than the persistent baseline (€10.41), so the “real NDVI improves 12-week forecasts” hypothesis is REFUTED.

Verdict

NOT VALIDATED – Even with real satellite data and a completed run, the model underperforms the price-only baseline, so satellite NDVI alone is not validated for price prediction.