Skip to content

Fase 3: Production Hardening

← Fase 2 SSE | Índice General →


Objetivo de la Fase

Preparar el sistema para producción con retry logic, monitoring, health checks, systemd service y documentación completa. Esta fase agrega resiliencia, observabilidad y operabilidad.

Entregables principales:

  • Retry con exponential backoff
  • Cleanup automático de jobs antiguos
  • Admin endpoints para gestión de failed jobs
  • Métricas Prometheus y health checks
  • Logging estructurado y alerting
  • Systemd service con auto-restart
  • Documentación técnica completa

Subfases

3.1 Retry + Error Handling

Tareas: 4 tarjetas Estimación: 14 horas Objetivo: Implementar retry logic, cleanup y admin endpoints

Tareas principales:

  • Migración columnas retry (2h)
  • Retry logic con exponential backoff (5h)
  • Cronjob cleanup stale jobs (3h)
  • Admin endpoint failed jobs (4h)

Dependencias: Requiere completar Fase 2 completa


3.2 Monitoring + Health Checks

Tareas: 4 tarjetas Estimación: 14 horas Objetivo: Métricas, health checks, logging estructurado y alerting

Tareas principales:

  • Metrics endpoint Prometheus (4h)
  • Health check endpoint (3h)
  • Logging estructurado (3h)
  • Alert triggers (4h)

Dependencias: Requiere completar Fase 3.1


3.3 Systemd Service

Tareas: 3 tarjetas Estimación: 5 horas Objetivo: Systemd service con auto-restart y logs journalctl

Tareas principales:

  • Systemd service file (2h)
  • Auto-restart config (1h)
  • Logs journalctl (2h)

Dependencias: Requiere completar Fase 3.2


3.4 Documentación

Tareas: 5 tarjetas Estimación: 15 horas Objetivo: Documentación completa (CLAUDE.md, técnica, arquitectura, OpenAPI, CHANGELOG)

Tareas principales:

  • Actualizar server/CLAUDE.md (3h)
  • Crear docs/backend/background-jobs-system.md (4h)
  • Crear docs/architecture/background-jobs-architecture.md (3h)
  • Actualizar OpenAPI (3h)
  • Actualizar CHANGELOG.md (2h)

Dependencias: Requiere completar Fases 3.1-3.3


Dependencias entre Subfases

Fase 2 SSE completa

3.1 Retry + Error Handling

3.2 Monitoring + Health Checks

3.3 Systemd Service

3.4 Documentación

Secuencia recomendada:

  1. Validar que Fase 2 está completa y tests pasan
  2. Implementar 3.1 (retry y error handling)
  3. Implementar 3.2 (monitoring y observability)
  4. Implementar 3.3 (systemd para deployment)
  5. Implementar 3.4 (documentación completa)

Estimación Total de la Fase

Total de tarjetas: 16 tarjetas Total de horas: ~48 horas Duración estimada (1 dev full-time): 6-7 días laborables Duración estimada (1 dev part-time 50%): 12-14 días laborables


Criterios de Completitud

La Fase 3 se considera completa cuando:

3.1 Retry

  • [ ] Columnas retry_count y max_retries agregadas
  • [ ] Exponential backoff implementado (delay = base * 2^attempt)
  • [ ] Cronjob cleanup ejecuta cada hora
  • [ ] Admin endpoint permite ver/retry/delete failed jobs
  • [ ] Tests verifican retry flow completo

3.2 Monitoring

  • [ ] Endpoint /jobs/metrics retorna formato Prometheus
  • [ ] Health check verifica worker, DB, queue size
  • [ ] Logging estructurado JSON a STDOUT y archivo
  • [ ] Alert service envía notificaciones a webhook externo
  • [ ] Tests verifican formato y thresholds

3.3 Systemd

  • [ ] Service file instalado y habilitado
  • [ ] Auto-restart configurado (max 5 en 600s)
  • [ ] Logs visibles en journalctl
  • [ ] Graceful shutdown con SIGTERM
  • [ ] Tests simulan crash/restart

3.4 Documentación

  • [ ] server/CLAUDE.md actualizado con sección Background Jobs
  • [ ] docs/backend/background-jobs-system.md completo
  • [ ] docs/architecture/background-jobs-architecture.md completo
  • [ ] OpenAPI spec actualizado
  • [ ] CHANGELOG.md actualizado con versión nueva
  • [ ] Todos los docs revisados y sin dead links

Notas Técnicas

Exponential Backoff

php
// RetryStrategy.php
public function calculateNextAttempt(int retryCount): DateTime
{
    $baseDelay = 60; // 1 minuto
    $delay = min($baseDelay * (2 ** $retryCount), 3600); // max 1 hora

    return new DateTime("+{$delay} seconds");
}
Retry 0: 60s (1 min)
Retry 1: 120s (2 min)
Retry 2: 240s (4 min)
Retry 3: 480s (8 min)
Retry 4: 960s (16 min)
Retry 5: 1920s (32 min)
Retry 6+: 3600s (1 hora) ← max cap

Cleanup Stale Jobs

bash
# Crontab
0 * * * * /usr/bin/php /var/www/Bautista/server/bin/cleanup-stale-jobs.php >> /var/log/background-jobs-cleanup.log 2>&1
php
// cleanup-stale-jobs.php
$deleted = $jobModel->deleteWhere([
    'status' => ['completed', 'failed'],
    'completed_at < NOW() - INTERVAL \'30 days\''
]);

$logger->info("Cleanup: deleted {$deleted} jobs");

Prometheus Metrics

# HELP background_jobs_pending Number of pending jobs
# TYPE background_jobs_pending gauge
background_jobs_pending{schema="suc0001"} 15
background_jobs_pending{schema="suc0002"} 3

# HELP background_jobs_processing Number of processing jobs
# TYPE background_jobs_processing gauge
background_jobs_processing{schema="suc0001"} 2

# HELP background_jobs_completed_total Total completed jobs
# TYPE background_jobs_completed_total counter
background_jobs_completed_total{schema="suc0001"} 1543

# HELP background_jobs_failed_total Total failed jobs
# TYPE background_jobs_failed_total counter
background_jobs_failed_total{schema="suc0001"} 12

# HELP background_jobs_execution_seconds Average execution time
# TYPE background_jobs_execution_seconds histogram
background_jobs_execution_seconds{schema="suc0001",le="1"} 450
background_jobs_execution_seconds{schema="suc0001",le="5"} 890
background_jobs_execution_seconds{schema="suc0001",le="10"} 1200

Health Check

json
GET /jobs/health

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:00Z",
  "checks": {
    "worker": {
      "status": "up",
      "last_heartbeat": "2025-01-15T10:29:55Z",
      "latency_ms": 850
    },
    "database": {
      "status": "up",
      "connection_pool": "5/10"
    },
    "queue": {
      "status": "healthy",
      "pending_count": 15,
      "threshold": 1000
    }
  }
}

Systemd Service

ini
# /etc/systemd/system/background-jobs-worker.service
[Unit]
Description=Background Jobs Worker - Sistema Bautista
After=postgresql.service
Requires=postgresql.service

[Service]
Type=simple
User=www-data
WorkingDirectory=/var/www/Bautista/server
ExecStart=/usr/bin/php /var/www/Bautista/server/bin/worker.php
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=600
StartLimitBurst=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
bash
# Instalación
sudo cp server/systemd/background-jobs-worker.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable background-jobs-worker
sudo systemctl start background-jobs-worker

# Monitoreo
sudo systemctl status background-jobs-worker
journalctl -u background-jobs-worker.service -f

Alert Triggers

php
// AlertService.php
$alerts = [
    'high_failure_rate' => [
        'condition' => 'failed jobs > 5 in 5 minutes',
        'threshold' => 5,
        'window' => 300, // 5 min
        'cooldown' => 1800, // 30 min
    ],
    'queue_overflow' => [
        'condition' => 'pending jobs > 500',
        'threshold' => 500,
    ],
    'worker_inactive' => [
        'condition' => 'no heartbeat in 10 minutes',
        'threshold' => 600,
    ],
];

// Webhook a Slack/Email
$this->sendAlert([
    'level' => 'critical',
    'title' => 'High failure rate detected',
    'message' => '8 jobs failed in the last 5 minutes',
    'schema' => 'suc0001',
    'timestamp' => now(),
]);

Deployment Checklist

Pre-Deployment

  • [ ] Todas las migraciones ejecutadas
  • [ ] Tests passing (unit + integration + e2e)
  • [ ] Coverage >= 85%
  • [ ] Feature flag ENABLE_BACKGROUND_JOBS configurado
  • [ ] Environment variables en .env
  • [ ] Systemd service instalado

Deployment

  • [ ] Ejecutar migraciones en producción
  • [ ] Iniciar systemd service
  • [ ] Verificar health check retorna 200 OK
  • [ ] Verificar metrics endpoint funciona
  • [ ] Configurar Prometheus scraping (si aplica)
  • [ ] Configurar alerting webhooks

Post-Deployment

  • [ ] Monitorear logs journalctl primeras horas
  • [ ] Verificar jobs completan exitosamente
  • [ ] Verificar SSE funciona en producción
  • [ ] Verificar retry logic funciona
  • [ ] Configurar backup automático tabla background_jobs
  • [ ] Documentar runbook en wiki

← Fase 2 SSE | Índice General →