Kubernetes & scaling¶

How to run an SMSC built on siphon-smpp with high availability on Kubernetes — and, just as important, what "scaling" can and can't mean for a stateful protocol like SMPP. Read the failover model before you touch replicas.

The manifests live in deploy/k8s/ and are a template for your SIPhon binary (the one that registers the smpp addon) — see Deployment for why siphon-smpp ships templates rather than a runnable image.

kubectl apply -f configmap.yaml     # addon config + handler script + secrets
kubectl apply -f deployment.yaml    # the SMSC pods
kubectl apply -f service.yaml       # L4 load balancer on :2775
kubectl apply -f pdb.yaml           # keep a survivor during drains
kubectl apply -f hpa.yaml           # optional autoscaler (read the caveats!)

The failover model (read this first)¶

SMPP is stateful per TCP session. That single fact shapes everything about HA and scaling:

Inbound (ESMEs → you)¶

Each ESME binds to exactly one replica over a long-lived TCP connection. There is no session migration. If that replica dies:

the connection resets;
the ESME must rebind;
the load balancer steers the new connection to a surviving replica.

Well-behaved ESMEs reconnect with backoff, so the practical SLA is "rebind within a few seconds". The manifests optimise for exactly this:

spread replicas across nodes (topologySpreadConstraints) so one node loss can't take the whole SMSC down;
a PodDisruptionBudget so voluntary drains never remove the last replica;
maxUnavailable: 0 rolling updates so you never dip below desired capacity mid-roll.

Outbound (you → upstream)¶

Each replica opens its own outbound binds (the supervisor in siphon-smpp reconnects with backoff). With N replicas you present N binds to each upstream. Before you scale past one replica, confirm two things:

The two questions to answer before scaling out

Does the upstream allow multiple concurrent binds for your system_id? Many aggregators do; some permit exactly one. If yours is single-bind, N replicas will fight over the one allowed session.
Is your DLR correlation store shared across replicas? A delivery receipt can come back on any replica's outbound bind — not necessarily the one that sent the message. That's why the gateway example keys correlation in siphon.cache (a shared store), not a per-process dict. In Kubernetes that store must be shared across pods (e.g. a shared cache/DB backing siphon.cache), or receipts will be dropped as "unknown id" on the wrong replica.

Two topologies¶

Topology	How	When to use it
Active/active (these manifests)	≥2 replicas behind an L4 LB; ESMEs rebind on failover	Default. Upstream allows multiple binds; DLR correlation is shared.
Active/standby	`replicas: 1` + fast reschedule (PDB + spread), or a leader-elected single binder	Upstream permits only one bind per `system_id`, or you need strict single-egress ordering.

If you can't share DLR correlation or the upstream is single-bind, prefer active/standby and accept the failover gap rather than double-binding.

The Deployment¶

Key fields (full file in deployment.yaml):

spec:
  replicas: 2
  strategy:
    rollingUpdate:
      maxUnavailable: 0        # never drop below desired during a roll
      maxSurge: 1
  template:
    spec:
      topologySpreadConstraints:      # one node loss ≠ whole SMSC down
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector: { matchLabels: { app: smsc } }
      terminationGracePeriodSeconds: 45
      containers:
        - name: smsc
          image: your-registry/your-smsc:latest
          ports: [{ name: smpp, containerPort: 2775 }]
          readinessProbe:                # gate LB traffic on "can accept binds"
            tcpSocket: { port: smpp }
          livenessProbe:                 # restart a wedged replica
            tcpSocket: { port: smpp }
          lifecycle:
            preStop:
              exec: { command: ["sleep", "10"] }   # let the LB stop sending binds

Readiness gates the Service on the SMPP port accepting connections, so the LB only routes to replicas that can actually bind an ESME.
Liveness restarts a wedged replica.
preStop sleep + terminationGracePeriodSeconds give the pod time to stop receiving new binds and drain in-flight responses before SIGKILL. Tune the grace period to your drain time.

The Service (L4 load balancer)¶

SMPP is a long-lived TCP session, so the LB just needs to pick a healthy replica at bind time and keep the connection pinned to it — see service.yaml:

spec:
  type: LoadBalancer
  externalTrafficPolicy: Local     # preserve client IP for allow-listing
  ports:
    - { name: smpp, port: 2775, targetPort: smpp, protocol: TCP }

Don't let the LB rebalance mid-connection

Most cloud L4 LBs pin a TCP connection to one backend for its lifetime. If yours pools or rebalances mid-connection, disable that for this Service — an SMPP session can't survive being moved to another replica. Set a long idle timeout too: binds stay open for hours or days.

externalTrafficPolicy: Local preserves the client source IP so your @smpp.on_bind handler can allow-list by client_addr.

PodDisruptionBudget¶

Keep at least one replica serving during voluntary disruptions (node drain, cluster upgrade). With replicas: 2, minAvailable: 1 means drains take one replica at a time, so ESMEs always have a survivor to rebind to — pdb.yaml:

spec:
  minAvailable: 1
  selector: { matchLabels: { app: smsc } }

Autoscaling (HPA) — with caveats¶

SMPP throughput is usually CPU-bound on the script side (routing, DLR correlation, persistence), so CPU is a reasonable HPA signal. But autoscaling changes the replica set, and every new replica opens its own outbound binds — so the two questions above apply on every scale event, automatically. Only enable the HPA once you're sure the upstream tolerates a variable number of binds:

spec:
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # don't thrash binds

The scaleDown stabilization window keeps the autoscaler from repeatedly tearing down and re-establishing upstream binds.

Scale up for redundancy first, throughput second

A single node already does tens of thousands of submit_sm/s through one bind (Performance). On a standard (GIL) CPython build, aggregate throughput is capped by the per-message Python handler running on one core, so adding replicas buys you redundancy and node-failure tolerance more than raw throughput. The real throughput unlock is free-threaded CPython (see Performance), not more pods.

Configuration: ConfigMap + Secret¶

Mount the addon config and the handler script from a ConfigMap; keep upstream credentials in a Secret referenced via envFrom (configmap.yaml):

apiVersion: v1
kind: ConfigMap
metadata: { name: smsc-config }
data:
  smpp.yaml: |
    server: { bind_address: "0.0.0.0", port: 2775 }
    binds:
      - name: alpha
        host: smsc-a.example.net
        port: 2775
        system_id: ${SMPP_ALPHA_SYSTEM_ID}   # from the Secret via envFrom
        password: ${SMPP_ALPHA_PASSWORD}
        bind_type: transceiver
        max_msg_per_sec: 50
    routing: { default_chain: ["bind:alpha"] }
  smpp_script.py: |
    from siphon import smpp
    @smpp.on_bind
    async def authorise(bind):
        return bind.accept()          # replace with your credential policy
    @smpp.on_pdu("submit_sm")
    async def on_submit(pdu, session):
        return pdu.reply(message_id="replace-me")
---
apiVersion: v1
kind: Secret
metadata: { name: smsc-secrets }
type: Opaque
stringData:
  SMPP_ALPHA_SYSTEM_ID: alpha_esme
  SMPP_ALPHA_PASSWORD: changeme        # use a real secrets manager in prod

The ${VAR} references in smpp.yaml are filled from the Secret at load time; alternatively declare whole binds via SMPP_BIND_<NAME>_*.

Hot reload in-cluster¶

Because smpp.py is mounted from the ConfigMap, you can edit it, kubectl apply, and let SIPhon hot-reload the handlers — no image rebuild, no rebind. The kubelet propagates ConfigMap changes to the mounted file within about a minute; SIPhon picks up the new script on the next PDU. Keep handlers free of import-time side effects so a reload mid-traffic is safe, and keep cross-message state in the shared store.

Graceful shutdown¶

On rollout/scale-down Kubernetes sends SIGTERM. Your binary should unbind its outbound binds and stop accepting new binds, then exit. The Deployment gives it room: a preStop sleep so the LB stops sending new binds first, and terminationGracePeriodSeconds: 45 before SIGKILL. Tune the grace period to your actual drain time.

Checklist before scaling out¶

[ ] Upstream allows N concurrent binds for your system_id.
[ ] DLR correlation / session maps are in a store shared across pods.
[ ] The LB pins each TCP connection to one backend for its lifetime.
[ ] ESMEs reconnect with backoff (they must, to survive failover).
[ ] Your handler unbinds cleanly on SIGTERM within the grace period.
[ ] If any box is unchecked → use active/standby, not active/active.