Logo

dev-resources.site

for different kinds of informations.

Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos

Published at
11/13/2024
Categories
spanish
devops
sre
cloud
Author
diek
Categories
4 categories in total
spanish
open
devops
open
sre
open
cloud
open
Author
4 person written this
diek
open
Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos

En entornos distribuidos, los fallos transitorios son inevitables: latencia de red, timeouts, servicios temporalmente no disponibles. El patr贸n Retry proporciona una estrategia robusta para manejar estos fallos temporales, permitiendo que las aplicaciones se recuperen autom谩ticamente de errores que pueden resolverse por s铆 solos.

Comprendiendo el Patr贸n Retry

El patr贸n Retry implementa una estrategia de reintentos autom谩ticos cuando una operaci贸n falla, asumiendo que la causa del fallo es temporal y puede resolverse sin intervenci贸n manual. La clave est谩 en distinguir entre fallos transitorios y permanentes, y aplicar estrategias de reintento apropiadas.

Estrategias Comunes

  1. Retry Inmediato: Reintenta la operaci贸n inmediatamente.
  2. Retry con Backoff: Incrementa el tiempo entre reintentos.
  3. Retry Exponencial: Duplica el tiempo de espera entre intentos.
  4. Retry con Jitter: A帽ade aleatoriedad para evitar thundering herd.

Implementaci贸n Pr谩ctica

Veamos diferentes implementaciones del patr贸n Retry en Python:

1. Retry Simple con Decorador

import time
from functools import wraps
from typing import Callable, Type, Tuple

def retry(
    exceptions: Tuple[Type[Exception]] = (Exception,),
    max_attempts: int = 3,
    delay: float = 1
):
    def decorator(func: Callable):
        @wraps(func)
        def wrapper(*args, **kwargs):
            attempts = 0
            while attempts < max_attempts:
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    attempts += 1
                    if attempts == max_attempts:
                        raise e
                    time.sleep(delay)
            return None
        return wrapper
    return decorator

@retry(exceptions=(ConnectionError, TimeoutError), max_attempts=3)
def fetch_data(url: str):
    # Simulaci贸n de llamada a API
    return requests.get(url)
Enter fullscreen mode Exit fullscreen mode

2. Retry con Backoff Exponencial

import random
from typing import Optional

class ExponentialBackoff:
    def __init__(
        self,
        initial_delay: float = 1.0,
        max_delay: float = 60.0,
        max_attempts: int = 5,
        jitter: bool = True
    ):
        self.initial_delay = initial_delay
        self.max_delay = max_delay
        self.max_attempts = max_attempts
        self.jitter = jitter
        self.attempt = 0

    def next_delay(self) -> Optional[float]:
        if self.attempt >= self.max_attempts:
            return None

        delay = min(
            self.initial_delay * (2 ** self.attempt),
            self.max_delay
        )

        if self.jitter:
            delay *= (0.5 + random.random())

        self.attempt += 1
        return delay

async def retry_operation(operation: Callable, backoff: ExponentialBackoff):
    last_exception = None

    while (delay := backoff.next_delay()) is not None:
        try:
            return await operation()
        except Exception as e:
            last_exception = e
            await asyncio.sleep(delay)

    raise last_exception
Enter fullscreen mode Exit fullscreen mode

3. Retry con Circuit Breaker

from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    reset_timeout: timedelta = timedelta(minutes=1)
    retry_timeout: timedelta = timedelta(seconds=10)

class CircuitBreaker:
    def __init__(self, config: CircuitBreakerConfig):
        self.config = config
        self.failures = 0
        self.last_failure = None
        self.state = "CLOSED"

    def can_retry(self) -> bool:
        if self.state == "CLOSED":
            return True

        if self.state == "OPEN":
            if datetime.now() - self.last_failure > self.config.reset_timeout:
                self.state = "HALF_OPEN"
                return True
            return False

        return True  # HALF_OPEN

    def record_failure(self):
        self.failures += 1
        self.last_failure = datetime.now()

        if self.failures >= self.config.failure_threshold:
            self.state = "OPEN"

    def record_success(self):
        if self.state == "HALF_OPEN":
            self.state = "CLOSED"
        self.failures = 0
        self.last_failure = None

async def retry_with_circuit_breaker(
    operation: Callable,
    circuit_breaker: CircuitBreaker,
    backoff: ExponentialBackoff
):
    while True:
        if not circuit_breaker.can_retry():
            raise Exception("Circuit breaker is open")

        try:
            result = await operation()
            circuit_breaker.record_success()
            return result
        except Exception as e:
            circuit_breaker.record_failure()
            if (delay := backoff.next_delay()) is None:
                raise e
            await asyncio.sleep(delay)
Enter fullscreen mode Exit fullscreen mode

Aplicaciones en la Nube

El patr贸n Retry es especialmente 煤til en escenarios cloud:

1. Comunicaci贸n entre Microservicios

from fastapi import FastAPI, HTTPException
from tenacity import retry, stop_after_attempt, wait_exponential

app = FastAPI()

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type(ConnectionError)
)
async def call_dependent_service(data: dict):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://dependent-service/api/v1/process",
            json=data,
            timeout=5.0
        )
        return response.json()

@app.post("/process")
async def process_request(data: dict):
    try:
        return await call_dependent_service(data)
    except Exception:
        raise HTTPException(
            status_code=503,
            detail="Service temporarily unavailable"
        )
Enter fullscreen mode Exit fullscreen mode

2. Operaciones con Base de Datos

from sqlalchemy import create_engine
from sqlalchemy.exc import OperationalError
from contextlib import contextmanager

class DatabaseRetry:
    def __init__(self, url: str, max_attempts: int = 3):
        self.engine = create_engine(url)
        self.max_attempts = max_attempts

    @contextmanager
    def session(self):
        attempt = 0
        while True:
            try:
                with self.engine.connect() as connection:
                    yield connection
                    break
            except OperationalError:
                attempt += 1
                if attempt >= self.max_attempts:
                    raise
                time.sleep(2 ** attempt)
Enter fullscreen mode Exit fullscreen mode

Beneficios del Patr贸n Retry

  1. Resiliencia: Maneja autom谩ticamente fallos transitorios.
  2. Disponibilidad: Mejora la disponibilidad general del sistema.
  3. Transparencia: Los reintentos son transparentes para el usuario.
  4. Flexibilidad: Permite diferentes estrategias seg煤n el caso de uso.

Consideraciones de Dise帽o

Al implementar el patr贸n Retry, considera:

  1. Idempotencia: Las operaciones deben ser seguras para reintentar.
  2. Timeouts: Establece l铆mites claros para los reintentos.
  3. Logging: Registra los reintentos para monitorizaci贸n.
  4. Backoff: Usa estrategias que eviten sobrecarga del sistema.

Conclusi贸n

El patr贸n Retry es esencial en arquitecturas distribuidas modernas. Una implementaci贸n cuidadosa, considerando la idempotencia y las estrategias de backoff, puede mejorar significativamente la resiliencia de tu sistema. Sin embargo, debe usarse juiciosamente para evitar ocultar problemas sist茅micos que requieren atenci贸n.

sre Article's
30 articles in total
Favicon
In 2025, I resolve to spend less time troubleshooting
Favicon
Observability Unveiled: Key Insights from IBM鈥檚 SRE Expert
Favicon
SSH Keys | Change the label of the public key
Favicon
Rely.io Update Roundup - December 2024
Favicon
From Ancient Firefighters to Modern SREs: Balancing Proactive and Reactive Work with Callgoose SQIBS Automation
Favicon
AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities
Favicon
Automation for the People
Favicon
we are doing DevOps job market Q&A with folks from Google, AWS, Microsoft etc.
Favicon
SRE for the SaaS
Favicon
Rely.io October 2024 Product Update Roundup
Favicon
The Pocket Guide to Internal Developer Platform
Favicon
How to Configure a Remote Data Store for Prometheus
Favicon
Day 10: ls -l *
Favicon
Why does improving Engineering Performance feel broken?
Favicon
Incident Management vs Incident Response: What You Must Know
Favicon
Retry Pattern: Manejando Fallos Transitorios en Sistemas Distribuidos
Favicon
Top Backstage alternatives
Favicon
The Vital Role of Human Oversight in AI-Driven Incident Management and SRE
Favicon
The Role of External Service Monitoring in SRE Practices
Favicon
Looking for an incident management tool?
Favicon
Rely.io October 2024 Product Update Roundup
Favicon
A Very Deep Dive Into Docker Builds
Favicon
SRE Culture Embedding Reliability into Engineering Teams
Favicon
Check out our new whitepaper: "Internal Developer Platforms and Portals, a complete overview"
Favicon
Control In the Face of Chaos
Favicon
2x Faster, 40% less RAM: The Cloud Run stdout logging hack
Favicon
Understanding and Minimizing Downtime Costs: Strategies for SREs and IT Professionals
Favicon
SRE vs DevOps: What鈥檚 the Difference and Why Does It Matter? 馃
Favicon
Rely.io September 2024 Product Update Roundup
Favicon
Best Practices for Choosing a Status Page Provider

Featured ones: