Saltar al contenido principal

Provider: SageMaker

El provider sagemaker ejecuta el pipeline como un SageMaker Pipeline multi-step en AWS. Escala desde miles hasta millones de filas sin cambiar el YAML β€” solo cambia el provider.

Requisitos previos​

pip install godml[aws]
# Credenciales AWS
export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
export SAGEMAKER_ROLE_ARN=arn:aws:iam::123456789012:role/SageMakerRole

ConfiguraciΓ³n mΓ­nima​

name: customer-churn
version: 1.0.0
provider: sagemaker

dataset:
uri: s3://mi-bucket/data/churn.csv # debe ser S3
target: churned

aws:
role_arn: ${SAGEMAKER_ROLE_ARN}
region: us-east-1
s3_bucket: mi-bucket

model:
type: xgboost
hyperparameters:
max_depth: 6
eta: 0.3

metrics:
- name: auc
threshold: 0.80

ConfiguraciΓ³n completa​

provider: sagemaker

aws:
role_arn: ${SAGEMAKER_ROLE_ARN}
region: us-east-1
s3_bucket: mi-bucket
s3_prefix: godml # default: "godml"
kms_key_id: ${KMS_KEY_ID} # opcional β€” cifrado KMS

compute:
preprocessing: ml.m5.large # default
training: ml.m5.2xlarge # default
evaluation: ml.m5.large # default

registry:
model_package_group: godml-churn
approval: manual # manual | auto

Pipeline generado​

godml crea automΓ‘ticamente este pipeline en SageMaker:

s3://bucket/data/churn.csv
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PreprocessingStep [ml.m5.large] β”‚
β”‚ β€’ compliance + PII masking β”‚
β”‚ β€’ train/test split 80/20 β”‚
β”‚ β†’ s3://bucket/godml/pipeline/preprocessed/ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TrainingStep [ml.m5.2xlarge] β”‚
β”‚ β€’ XGBoost built-in container β”‚
β”‚ β†’ s3://bucket/godml/pipeline/model/ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ EvaluationStep [ml.m5.large] β”‚
β”‚ β€’ AUC, F1, Precision, Recall β”‚
β”‚ β†’ evaluation.json β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓ (solo si AUC β‰₯ threshold)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RegisterModel (condicional) β”‚
β”‚ β€’ SageMaker Model Package Group β”‚
β”‚ β€’ PendingManualApproval / Approved β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tipos de instancias recomendadas​

DatasetTraining recomendadoCosto aprox.
< 100K filasml.m5.large$0.12/h
100K – 1M filasml.m5.2xlarge$0.46/h
> 1M filasml.m5.4xlarge$0.92/h
Con GPUml.g4dn.xlarge$0.74/h

Los steps se levantan y apagan solos β€” solo pagas el tiempo que corren.

IAM Role necesario​

{
"Effect": "Allow",
"Action": [
"sagemaker:*",
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket",
"iam:PassRole",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": "*"
}

En la consola de AWS: IAM β†’ Roles β†’ Create Role β†’ SageMaker β†’ AmazonSageMakerFullAccess.

Modelos soportados​

model.typeContainer AWS
xgboostSageMaker XGBoost built-in
random_forestSageMaker SKLearn
logistic_regressionSageMaker SKLearn
lightgbmSageMaker SKLearn + lightgbm

Ejecutar​

godml run -f godml.yml

godml:

  1. Construye la definiciΓ³n del Pipeline
  2. Hace upsert en SageMaker (crea o actualiza)
  3. Inicia la ejecuciΓ³n
  4. Espera y muestra el status de cada step
  5. Reporta el ARN de ejecuciΓ³n para trazabilidad

β†’ DataPrep Service