2024.12.11·9 min·kafka

Kafka gotchas I wish someone had told me

Kafka is powerful and Kafka is subtle. I've been running it in production for four years and I still learn new things the hard way. Here are the gotchas that caught me, so they don't catch you.

Consumer rebalancing is the villain

You set up a consumer group. Everything hums. Then at 2pm on a Tuesday, a consumer takes too long to process a batch, its session times out, and kafka triggers a rebalance. Now every consumer in the group stops processing, revokes its partitions, and re-joins. During the rebalance, nothing is consumed. Your lag spikes. Your alerts fire. Your dashboards turn red.

The fix: tune max.poll.interval.ms to be longer than your worst-case processing time. Then tune session.timeout.ms to be shorter than that. The session heartbeat keeps the consumer alive; the poll interval tells kafka how long processing might take. Get these wrong and you're in rebalance hell.

# Consumer config that actually works
max.poll.interval.ms = 300000   # 5 min — must exceed processing time
session.timeout.ms   = 30000    # 30s — heartbeat keeps session alive
heartbeat.interval.ms = 10000   # 10s — send heartbeats frequently
max.poll.records     = 500      # don't bite off more than you can chew

Offset management traps

Auto-commit is convenient and dangerous. If you commit offsets before processing, a crash means lost messages. If you commit after processing, a crash means duplicate processing. There is no third option — you pick your guarantee.

I use manual synchronous commits at the end of each successful batch. It's slower than auto-commit but it means every committed offset corresponds to actually processed data. The throughput cost is minimal; the correctness gain is enormous.

One more thing: enable.auto.commit is true by default in most client libraries. Turn it off. You don't want kafka deciding when your data is "processed."

Partition count is a one-way door

You can increase partitions in kafka. You cannot decrease them. Ever. Choose your partition count based on your projected maximum throughput, not your current throughput. I've seen teams start with 6 partitions, hit throughput limits, add 50, and now they have 56 partitions for a topic that needs 12. There's no going back.

The practical limit for a single partition is roughly 10MB/s or 10k messages/s. Plan accordingly, add headroom, and accept that over-partitioning is cheaper than under-partitioning.

A good rule of thumb: start with expected_peak_throughput / 5MB partitions, rounded up. If that's less than 6, use 6. If it's more than 100, reconsider whether kafka is the right tool for that particular use case.

The lag myth

Consumer lag is the most watched metric in kafka. It's also the most misleading. Lag going up doesn't always mean your consumer is slow — it might mean your producer just spiked. Lag going down doesn't always mean you're healthy — you might be skipping messages.

Watch lag per partition, not aggregate. Watch processing rate alongside it. And always correlate lag spikes with deployment events — a bad deploy is the most common cause of sudden lag.

Replication isn't backup

I've seen teams treat kafka's replication factor as a backup strategy. It's not. Replication protects against broker failure, not against human error. If you accidentally delete a topic, replication makes sure it's deleted on all replicas simultaneously. Use proper backups — kafka's kafka-console-consumer with --from-beginning piped to a file, or a dedicated backup tool like kafka-backup.

Kafka is a tool, not a religion. Use it when you need durable, ordered, replayable event streams. Don't use it as a database, a queue, or a service mesh.

← all poststhanks for reading