Fase 3: Production Hardening

Objetivo de la Fase

Preparar el sistema para producción con retry logic, monitoring, health checks, systemd service y documentación completa. Esta fase agrega resiliencia, observabilidad y operabilidad.

Entregables principales:

Retry con exponential backoff
Cleanup automático de jobs antiguos
Admin endpoints para gestión de failed jobs
Métricas Prometheus y health checks
Logging estructurado y alerting
Systemd service con auto-restart
Documentación técnica completa

Subfases

3.1 Retry + Error Handling

Tareas: 4 tarjetas Estimación: 14 horas Objetivo: Implementar retry logic, cleanup y admin endpoints

Tareas principales:

Migración columnas retry (2h)
Retry logic con exponential backoff (5h)
Cronjob cleanup stale jobs (3h)
Admin endpoint failed jobs (4h)

Dependencias: Requiere completar Fase 2 completa

3.2 Monitoring + Health Checks

Tareas: 4 tarjetas Estimación: 14 horas Objetivo: Métricas, health checks, logging estructurado y alerting

Tareas principales:

Metrics endpoint Prometheus (4h)
Health check endpoint (3h)
Logging estructurado (3h)
Alert triggers (4h)

Dependencias: Requiere completar Fase 3.1

3.3 Systemd Service

Tareas: 3 tarjetas Estimación: 5 horas Objetivo: Systemd service con auto-restart y logs journalctl

Tareas principales:

Systemd service file (2h)
Auto-restart config (1h)
Logs journalctl (2h)

Dependencias: Requiere completar Fase 3.2

3.4 Documentación

Tareas: 5 tarjetas Estimación: 15 horas Objetivo: Documentación completa (CLAUDE.md, técnica, arquitectura, OpenAPI, CHANGELOG)

Tareas principales:

Actualizar server/CLAUDE.md (3h)
Crear docs/backend/background-jobs-system.md (4h)
Crear docs/architecture/background-jobs-architecture.md (3h)
Actualizar OpenAPI (3h)
Actualizar CHANGELOG.md (2h)

Dependencias: Requiere completar Fases 3.1-3.3

Dependencias entre Subfases

Fase 2 SSE completa
    ↓
3.1 Retry + Error Handling
    ↓
3.2 Monitoring + Health Checks
    ↓
3.3 Systemd Service
    ↓
3.4 Documentación

Secuencia recomendada:

Validar que Fase 2 está completa y tests pasan
Implementar 3.1 (retry y error handling)
Implementar 3.2 (monitoring y observability)
Implementar 3.3 (systemd para deployment)
Implementar 3.4 (documentación completa)

Estimación Total de la Fase

Total de tarjetas: 16 tarjetas Total de horas: ~48 horas Duración estimada (1 dev full-time): 6-7 días laborables Duración estimada (1 dev part-time 50%): 12-14 días laborables

Criterios de Completitud

La Fase 3 se considera completa cuando:

3.1 Retry

[ ] Columnas retry_count y max_retries agregadas
[ ] Exponential backoff implementado (delay = base * 2^attempt)
[ ] Cronjob cleanup ejecuta cada hora
[ ] Admin endpoint permite ver/retry/delete failed jobs
[ ] Tests verifican retry flow completo

3.2 Monitoring

[ ] Endpoint /jobs/metrics retorna formato Prometheus
[ ] Health check verifica worker, DB, queue size
[ ] Logging estructurado JSON a STDOUT y archivo
[ ] Alert service envía notificaciones a webhook externo
[ ] Tests verifican formato y thresholds

3.3 Systemd

[ ] Service file instalado y habilitado
[ ] Auto-restart configurado (max 5 en 600s)
[ ] Logs visibles en journalctl
[ ] Graceful shutdown con SIGTERM
[ ] Tests simulan crash/restart

3.4 Documentación

[ ] server/CLAUDE.md actualizado con sección Background Jobs
[ ] docs/backend/background-jobs-system.md completo
[ ] docs/architecture/background-jobs-architecture.md completo
[ ] OpenAPI spec actualizado
[ ] CHANGELOG.md actualizado con versión nueva
[ ] Todos los docs revisados y sin dead links

Notas Técnicas

Exponential Backoff

php

// RetryStrategy.php
public function calculateNextAttempt(int retryCount): DateTime
{
    $baseDelay = 60; // 1 minuto
    $delay = min($baseDelay * (2 ** $retryCount), 3600); // max 1 hora

    return new DateTime("+{$delay} seconds");
}

Retry 0: 60s (1 min)
Retry 1: 120s (2 min)
Retry 2: 240s (4 min)
Retry 3: 480s (8 min)
Retry 4: 960s (16 min)
Retry 5: 1920s (32 min)
Retry 6+: 3600s (1 hora) ← max cap

Cleanup Stale Jobs

bash

# Crontab
0 * * * * /usr/bin/php /var/www/Bautista/server/bin/cleanup-stale-jobs.php >> /var/log/background-jobs-cleanup.log 2>&1

php

// cleanup-stale-jobs.php
$deleted = $jobModel->deleteWhere([
    'status' => ['completed', 'failed'],
    'completed_at < NOW() - INTERVAL \'30 days\''
]);

$logger->info("Cleanup: deleted {$deleted} jobs");

Prometheus Metrics

# HELP background_jobs_pending Number of pending jobs
# TYPE background_jobs_pending gauge
background_jobs_pending{schema="suc0001"} 15
background_jobs_pending{schema="suc0002"} 3

# HELP background_jobs_processing Number of processing jobs
# TYPE background_jobs_processing gauge
background_jobs_processing{schema="suc0001"} 2

# HELP background_jobs_completed_total Total completed jobs
# TYPE background_jobs_completed_total counter
background_jobs_completed_total{schema="suc0001"} 1543

# HELP background_jobs_failed_total Total failed jobs
# TYPE background_jobs_failed_total counter
background_jobs_failed_total{schema="suc0001"} 12

# HELP background_jobs_execution_seconds Average execution time
# TYPE background_jobs_execution_seconds histogram
background_jobs_execution_seconds{schema="suc0001",le="1"} 450
background_jobs_execution_seconds{schema="suc0001",le="5"} 890
background_jobs_execution_seconds{schema="suc0001",le="10"} 1200

Health Check

json

GET /jobs/health

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:00Z",
  "checks": {
    "worker": {
      "status": "up",
      "last_heartbeat": "2025-01-15T10:29:55Z",
      "latency_ms": 850
    },
    "database": {
      "status": "up",
      "connection_pool": "5/10"
    },
    "queue": {
      "status": "healthy",
      "pending_count": 15,
      "threshold": 1000
    }
  }
}

Systemd Service

ini

# /etc/systemd/system/background-jobs-worker.service
[Unit]
Description=Background Jobs Worker - Sistema Bautista
After=postgresql.service
Requires=postgresql.service

[Service]
Type=simple
User=www-data
WorkingDirectory=/var/www/Bautista/server
ExecStart=/usr/bin/php /var/www/Bautista/server/bin/worker.php
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=600
StartLimitBurst=5
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

bash

# Instalación
sudo cp server/systemd/background-jobs-worker.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable background-jobs-worker
sudo systemctl start background-jobs-worker

# Monitoreo
sudo systemctl status background-jobs-worker
journalctl -u background-jobs-worker.service -f

Alert Triggers

php

// AlertService.php
$alerts = [
    'high_failure_rate' => [
        'condition' => 'failed jobs > 5 in 5 minutes',
        'threshold' => 5,
        'window' => 300, // 5 min
        'cooldown' => 1800, // 30 min
    ],
    'queue_overflow' => [
        'condition' => 'pending jobs > 500',
        'threshold' => 500,
    ],
    'worker_inactive' => [
        'condition' => 'no heartbeat in 10 minutes',
        'threshold' => 600,
    ],
];

// Webhook a Slack/Email
$this->sendAlert([
    'level' => 'critical',
    'title' => 'High failure rate detected',
    'message' => '8 jobs failed in the last 5 minutes',
    'schema' => 'suc0001',
    'timestamp' => now(),
]);

Deployment Checklist

Pre-Deployment

[ ] Todas las migraciones ejecutadas
[ ] Tests passing (unit + integration + e2e)
[ ] Coverage >= 85%
[ ] Feature flag ENABLE_BACKGROUND_JOBS configurado
[ ] Environment variables en .env
[ ] Systemd service instalado

Deployment

[ ] Ejecutar migraciones en producción
[ ] Iniciar systemd service
[ ] Verificar health check retorna 200 OK
[ ] Verificar metrics endpoint funciona
[ ] Configurar Prometheus scraping (si aplica)
[ ] Configurar alerting webhooks

Post-Deployment

[ ] Monitorear logs journalctl primeras horas
[ ] Verificar jobs completan exitosamente
[ ] Verificar SSE funciona en producción
[ ] Verificar retry logic funciona
[ ] Configurar backup automático tabla background_jobs
[ ] Documentar runbook en wiki

← Fase 2 SSE | Índice General →

Fase 1 Mvp

Fase 2 Sse

Fase 3 Production

Joins

Retenciones

Ganancias

Catalogos

Extensiones

Presupuestos

Recursos Principales

Relaciones

Catalogos Base

Cupon Pago

Facturacion

Gestion Miembros

Grupos Familiares

Soporte Tecnico

Architecture

Backend

Api

Database

Migrations

Middleware

Security

Services

Deployment

Development

Testing

Frontend

Integrations

Payment Gateways

Retenciones

Fase 3: Production Hardening ​

Objetivo de la Fase ​

Subfases ​

3.1 Retry + Error Handling ​

3.2 Monitoring + Health Checks ​

3.3 Systemd Service ​

3.4 Documentación ​

Dependencias entre Subfases ​

Estimación Total de la Fase ​

Criterios de Completitud ​

3.1 Retry ​

3.2 Monitoring ​

3.3 Systemd ​

3.4 Documentación ​

Notas Técnicas ​

Exponential Backoff ​

Cleanup Stale Jobs ​

Prometheus Metrics ​

Health Check ​

Systemd Service ​

Alert Triggers ​

Deployment Checklist ​

Pre-Deployment ​

Deployment ​

Post-Deployment ​

Fase 3: Production Hardening

Objetivo de la Fase

Subfases

3.1 Retry + Error Handling

3.2 Monitoring + Health Checks

3.3 Systemd Service

3.4 Documentación

Dependencias entre Subfases

Estimación Total de la Fase

Criterios de Completitud

3.1 Retry

3.2 Monitoring

3.3 Systemd

3.4 Documentación

Notas Técnicas

Exponential Backoff

Cleanup Stale Jobs

Prometheus Metrics

Health Check

Systemd Service

Alert Triggers

Deployment Checklist

Pre-Deployment

Deployment

Post-Deployment