The difference between "it got slow" and "it was an incident" iswhen you find out.
If you wait for the user to complain, you're already late. The good news: in a stackOdoo → PgBouncer → PostgreSQL, saturation leaves clear tracesminutes before.
This post gives you a simple system:4 early signals+ practical thresholds + what to do based on the pattern.
1) Early signal #1: queue appears in PgBouncer
What to look for
In the PgBouncer admin console:
SHOW POOLS;
The two indicators that signal the problem are:
cl_waiting > 0 sustained
maxwait rising (the oldest client has been waiting)
Interpretation:your app is trying to start transactions, but PgBouncer cannot assign them a "server" connection quickly enough. That is "saturation" in real time, even if there are still no visible errors.
Useful threshold
Warning:cl_waiting > 0 for 60–120s
Critical:maxwait > 1s sustained (already affects UX), > 5s is an incident
Immediate action
If there is a queue: don't guess. Follow step 4 (classify cause).
2) Early signal #2: the percentage of long transactions is increasing
What to look for (PostgreSQL)
Quick query to see long sessions and transactions:
SELECT pid, usename, state, xact_start, query_start, wait_event_type, wait_event, query FROM pg_stat_activity WHERE datname = current_database() ORDER BY xact_start NULLS LAST;
Warning signs
old xact_start (transactions > 60–120s during peak hours)
many sessions with wait_event_type = Lock
Interpretation:even if the CPU is "ok", long transactions hijack concurrency (and with PgBouncer, they hijack server connections).
Useful threshold
Warning:1–3 transactions > 2 min during load hours
Critical:transactions > 5–10 min (almost always blocking/monster cron)
3) Early signal #3: throughput drops but demand does not (the "slow death")
What to look for
Requests/second (or jobs/second) vs latency
PgBouncer SHOW STATS; (if you are collecting it)
Odoo: p95/p99 latency by endpoint (login, listing, write, confirmations, reports)
Classic pattern before the incident
p95 rises slowly
p99 spikes first
throughput does not rise (or falls) even with normal traffic
Interpretation:you are no longer "scaling" with load. You are in contention.
Useful threshold
Warning:p99 > 2–3x your baseline
Critical:timeout errors or massive retries
4) Early signal #4: crons start to overlap (and no one is watching)
This is brutal in Odoo.
What to look for
duration of heavy crons
actual execution time vs expected
if they are overlapping (especially if you have max_cron_threads > 1)
Pattern before the incident
cron A takes longer → cron B starts the same → both compete for locks and DB
PgBouncer starts queuing
users notice slowness "in waves"
Useful threshold
Warning:cron that goes from X min to 2X min repeatedly
Critical:backlog (crons do not finish before their next run)
5) The key classification: 3 types of saturation (and what to do)
When you detect early saturation, classify it into one of these 3.
This avoids the typical mistake of "just increase pool_size and that's it".
Type A — Saturation bypool(config/concurrency)
Symptoms
cl_waiting rises
sv_idle ~ 0
Postgres is NOT at 100%
there are no major locks, just "a lot of movement"
Actions
carefully increase default_pool_size
add reserve_pool_size for spikes
check if max_client_conn or max_db_connections are limiting you
Type B — Saturation bylocks / long transactions
Symptoms
cl_waiting rises
maxwait rises sustained
in Postgres you see wait_event_type = Lock or very old xact_start
CPU not necessarily high (it's contention)
Actions
identify the long transaction (job/cron/user action)
short duration: batching, batch commits, avoid external I/O in the transaction
add lock_timeout, statement_timeout, and idle_in_transaction_session_timeout (according to policy)
Type C — Resource Saturation(CPU/RAM/I/O) (CPU/RAM/I/O)
Symptoms
CPU at 100% or high I/O wait
latency rises everywhere
PgBouncer may show a queue, but the root problem is the host/DB
Actions
optimize queries/indexes
reduce concurrency (workers/crons) to decrease contention
improve disk/IOPS
check for bloat/autovacuum if performance drops over time
6) A minimum set of "fire-fighting" alerts
If you could only create 6 alerts, they would be these:
PgBouncer
cl_waiting > 0 for 2 min
maxwait > 1s for 2 min
PostgreSQL
transactions > 2 min (count > N)
sessions waiting for locks > N
Odoo / app
p99 latency > 2–3x baseline
error rate (timeouts/5xx) > baseline
7) The trick that buys you time: "alert on trend", not on drop
Many monitor "CPU > 90%". That comes too late.
What buys you time is alerting onbehavior change:
p99 rises 30–50% compared to the baseline
maxwait goes from 0 to 0.5s and keeps rising
crons start to last 2x
This happens before the user feels the pain.
Close
If you want to detect saturation before the user does:
measure queue in PgBouncer,
measure long transactions and locks in Postgres,
measure p95/p99 in Odoo,
andmonitor cronsas if they were users (because they are, but more dangerous).