default alt text I've been running the aickyway backend with FastAPI + Gunicorn, and as users increased, the p99 latency for the image posts list API started exceeding 400ms. I thought "maybe it's slow because it's Python?" but actually, the problem was that I was using the default settings as-is. The moment I switched to uvloop, I could immediately feel the difference, and from there, I started tuning things one by one. Here's a summary of what I tried.


First, Let's Define Some Terms

There are many technical terms, so let's briefly go over them.

  • Uvicorn: A server that runs ASGI apps like FastAPI. Lightweight and production-grade.
  • uvloop: A C-based event loop that's much faster than Python's default event loop. (Recommended for Linux/macOS)
  • httptools: A C library parser used for processing HTTP requests. Faster than Python's default.
  • Pydantic v2: Handles data validation and serialization. Much faster than v1 thanks to the Rust-based pydantic-core.
  • p50, p95, p99: Distribution metrics for response times. For example, p99 means "the 99th slowest request time out of 100". Think of it as looking at near-worst-case response times.

p50, p95, p99 refer to percentile (latency percentile) metrics. When measuring API performance, they reflect the actual user experience much more accurately than simply looking at "average response time".


Percentile Metrics Explained

  • p50 (median) Half of all requests are faster than this time, and half are slower. 👉 The response time felt by the "average user".

  • p95 95% of all requests are faster than this time. 👉 Essentially, the value after cutting off the slowest 5% of requests. 👉 Near-worst-case speed experienced by "most users".

  • p99 99% of all requests are faster than this time. 👉 Considers even the top 1% extreme slow requests. 👉 Shows "what the user experience is like at the most sensitive moments, in worst-case scenarios".


What is FastAPI Ultra?

In one sentence: 👉 FastAPI + Uvicorn + uvloop + httptools + Pydantic v2 + multi-process architecture

With this combination properly set up, you can achieve several times the performance improvement for IO-heavy API servers.

📌 Start installation like this:

pip install "uvicorn[standard]" fastapi "pydantic>=2" orjson

👉 Using the uvicorn[standard] option automatically includes uvloop and httptools.


Minimal Server Flags (Direct Uvicorn)

uvicorn app.main:app \
  --host 0.0.0.0 --port 8080 \
  --loop uvloop --http httptools \
  --workers $((2*$(nproc)+1)) \
  --timeout-keep-alive 5 \
  --limit-concurrency 1000 \
  --backlog 2048

Why these settings:

  • --loop uvloop, --http httptools → Improved request processing speed
  • --workers 2*CPU+1 → Avoid GIL constraints and achieve true parallel processing
  • --timeout-keep-alive 5 → Quickly clean up idle connections to prevent resource waste
  • --limit-concurrency, --backlog → Prevent server crashes during overload situations

Gunicorn Combination

gunicorn app.main:app \
  -k uvicorn.workers.UvicornWorker \
  -w $((2*$(nproc)+1)) \
  --max-requests 5000 --max-requests-jitter 500 \
  --graceful-timeout 30 --timeout 60 --keep-alive 5

👉 Gunicorn is advantageous for workflow management + memory leak prevention + zero-downtime deployment.


Pydantic v2: The Real Source of Speed

With pydantic-core written in Rust, JSON validation and serialization speed has become incredibly fast.

Model Configuration Example

from pydantic import BaseModel, ConfigDict, Field
from typing import Annotated

Currency = Annotated[str, Field(pattern=r"^[A-Z]{3}$")]

class Charge(BaseModel):
    model_config = ConfigDict(
        str_max_length=120,
        extra="ignore",
        revalidate_instances="never",
        ser_json_inf_nan=False
    )
    user_id: str
    amount_cents: int
    currency: Currency
    merchant_ref: str | None = None

👉 By utilizing model_validate_json and TypeAdapter, you can validate directly from bytes without intermediate json.loads.


Actual API Endpoint Example

@app.post("/charge")
async def charge(req: Request, _=Depends(rate_limit)):
    body = await req.body()
    try:
        c = Charge.model_validate_json(body)
    except Exception as e:
        raise HTTPException(422, "invalid payload") from e

    await asyncio.sleep(0.001)  # Replaces actual IO
    return {"ok": True, "user_id": c.user_id}

👉 Key points

  • Parse directly from bytes
  • Offload CPU-heavy tasks to queues (e.g., PDF, encryption, model computation)

Real-World Performance Changes

  • Default setup (Python HTTP parser, Pydantic v1)

    • p50: 38ms
    • p99: 420ms
    • Max throughput: 1.1k RPS
  • After tuning (uvloop + httptools + Pydantic v2)

    • p50: 26ms
    • p99: 270ms
    • Max throughput: 1.6k RPS

👉 On the same server, a whopping ~45% throughput increase!


Common Mistakes (Foot-guns)

  1. Thread overuse → No effect in async code
  2. Parsing JSON twice → Avoid duplication in middleware+endpoint
  3. DB auto-commit → Increases unnecessary network round trips
  4. Keep-alive too long → Resource shortage during sudden bursts
  5. Ignoring backpressure → Eventually p99 responses explode

Tuning Checklist

  1. Record baseline performance metrics (p50/p95/p99)
  2. Change only one setting at a time and compare
  3. Confirm settings → Apply to code/infrastructure
  4. Monitor in production for 1 week

👉 A systematically boring approach is the essence of real performance engineering.


Conclusion

FastAPI tuning is not difficult.

  • Uvicorn + uvloop + httptools
  • Pydantic v2 fast path
  • Appropriate keep-alive & concurrency cap

Just following these 3 things will make your API much faster.