Vertex AI AutoML Guide (Advanced / Optional) (Needs Testing)¶

This guide is optional. The main pipeline uses BQML (BigQuery ML) with a regression model for all ML tasks — no additional services required. This Vertex AI path is for advanced users who want Google's fully managed AutoML platform.

Note: The main BQML pipeline uses a regressor (predicts outage_ratio_6h_max, a continuous 0.0–1.0 value). This Vertex AI guide demonstrates the classification approach (binary outage_event) as an alternative. Both targets are available in bqml_training_data.

This guide covers training an outage prediction model using Vertex AI AutoML Tables as an alternative to BigQuery ML.

When to Use Vertex AI vs BQML¶

Use BQML (the default in this repo) when you want everything in SQL with no additional infrastructure. BQML boosted trees perform well and the entire workflow stays inside BigQuery.

Use Vertex AI AutoML when you want Google's fully managed model search, which tries hundreds of architectures and hyperparameter combinations. It typically produces the best performance but costs more and requires additional GCP setup.

Prerequisites¶

Vertex AI API enabled in your GCP project
BigQuery dataset with training data already built (run correlation + --phase ml-data)
Service account with roles: roles/aiplatform.user, roles/bigquery.dataViewer

Step 1: Export Training Data¶

The BQML training table (bqml_training_data) works directly as a Vertex AI data source. No export needed — Vertex AI can read from BigQuery.

The table contains both target columns:

outage_ratio_6h_max — for regression (continuous 0.0–1.0)
outage_event — for classification (binary 0/1)

Step 2: Create Dataset in Vertex AI¶

# Using gcloud CLI
gcloud ai datasets create \
  --display-name="outage-prediction-training" \
  --metadata-schema-uri=gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml \
  --project=$GCP_PROJECT \
  --region=us-central1

Then import from BigQuery:

gcloud ai datasets import $DATASET_ID \
  --source=bq://$GCP_PROJECT.$DATASET_NAME.bqml_training_data \
  --import-schema-uri=gs://google-cloud-aiplatform/schema/dataset/ioformat/tabular_io_format_1.0.0.yaml

Step 3: Train AutoML Model¶

For classification (binary outage yes/no):

gcloud ai models create \
  --display-name="outage-predictor-automl-classifier" \
  --dataset=$DATASET_ID \
  --target-column=outage_event \
  --training-budget-milli-node-hours=1000 \
  --model-type=CLOUD \
  --predefined-split-column-name=data_split \
  --project=$GCP_PROJECT \
  --region=us-central1

For regression (continuous outage severity):

gcloud ai models create \
  --display-name="outage-predictor-automl-regressor" \
  --dataset=$DATASET_ID \
  --target-column=outage_ratio_6h_max \
  --training-budget-milli-node-hours=1000 \
  --model-type=CLOUD \
  --predefined-split-column-name=data_split \
  --project=$GCP_PROJECT \
  --region=us-central1

Alternatively, use the Console UI:

Go to Vertex AI → Training in the Cloud Console
Click Create and select AutoML
Choose the BigQuery table as your data source
Set target column: outage_event (classification) or outage_ratio_6h_max (regression)
Set data_split as the predefined split column
Set budget to 1 node-hour (sufficient for this dataset size)
Click Start Training

Training typically takes 1–3 hours.

Step 4: Evaluate the Model¶

In the Console, navigate to the trained model and view the Evaluate tab.

For classification — key metrics:

Metric	Target
Precision	≥ 0.65
Recall	≥ 0.85
F1	≥ 0.75
AUC-ROC	≥ 0.80

For regression — key metrics:

Metric	What it means
MAE	Average prediction error in outage ratio units
R²	Variance explained (closer to 1.0 = better)
RMSE	Root mean squared error

Also check:

Feature importance — Wind speed and shear should rank highest
Confusion matrix (classifier) — False negatives are more costly than false positives (missed outages vs unnecessary crew staging)

Step 5: Deploy for Batch Prediction¶

For this use case, batch prediction is more appropriate than online endpoints (forecasts run daily, not per-request):

gcloud ai batch-predictions create \
  --display-name="outage-prediction-$(date +%Y%m%d)" \
  --model=$MODEL_ID \
  --input-format=bigquery \
  --input-path=bq://$GCP_PROJECT.$DATASET_NAME.bqml_training_data \
  --output-format=bigquery \
  --output-path=bq://$GCP_PROJECT.$DATASET_NAME.vertex_predictions \
  --project=$GCP_PROJECT \
  --region=us-central1

Step 6: Join Predictions Back to Pipeline¶

For classifier output:

CREATE OR REPLACE VIEW `your-project.your_dataset.view_vertex_risk` AS
SELECT
  p.county_fips,
  p.valid_ts,
  p.predicted_outage_event.scores[OFFSET(1)] AS outage_probability,
  CASE
    WHEN p.predicted_outage_event.scores[OFFSET(1)] >= 0.7 THEN 'HIGH'
    WHEN p.predicted_outage_event.scores[OFFSET(1)] >= 0.3 THEN 'MEDIUM'
    ELSE 'LOW'
  END AS ml_risk_tier
FROM `your-project.your_dataset.vertex_predictions` p;

For regressor output:

CREATE OR REPLACE VIEW `your-project.your_dataset.view_vertex_risk` AS
SELECT
  p.county_fips,
  p.valid_ts,
  p.predicted_outage_ratio_6h_max AS predicted_ratio,
  CASE
    WHEN p.predicted_outage_ratio_6h_max >= 0.15 THEN 'SEVERE'
    WHEN p.predicted_outage_ratio_6h_max >= 0.05 THEN 'OUTAGE'
    ELSE 'NORMAL'
  END AS ml_risk_tier
FROM `your-project.your_dataset.vertex_predictions` p;

Cost Estimates¶

Component	Approximate Cost
AutoML training (1 node-hour)	$20
Batch prediction (2-county, 10 days)	< $1
Batch prediction (national, daily)	$5–15/run
Model storage (per month)	< $1

Comparison: BQML vs Vertex AI AutoML¶

Aspect	BQML Boosted Tree	Vertex AI AutoML
Setup complexity	Low (SQL only)	Medium (API/Console)
Training cost	Free tier available	~$20/training run
Performance	Good	Best (tries hundreds of architectures)
Explainability	GLOBAL_EXPLAIN, feature importance	Feature importance, partial dependence
Inference	ML.PREDICT in SQL	Batch prediction or endpoint
Retraining	Re-run SQL	Re-run training job
Best for	Prototyping, SQL-first teams	Production, maximum accuracy