Simple Checklist Before Deploying Kafka to Production

Johan Stavik March 26, 2026 ·16 writeups ·joined Oct 2025

10 min read

Running Apache Kafka in production is a different challenge altogether. While Kafka excels at handling massive streams of records across clusters, deploying it without following Kafka topic best practices can lead to performance bottlenecks and security vulnerabilities. This guide walks you through essential steps including broker selection, security implementation, and monitoring setup to ensure your Kafka deployment is scalable, secure, and enterprise-ready.

Infrastructure and Cluster Configuration

Proper infrastructure configuration forms the foundation of a stable Kafka deployment.

Choosing the Right Number of Brokers

Production Kafka clusters require at least three brokers to maintain high availability. This configuration supports a replication factor of 3 and a minimum in-sync replicas (minISR) of 2, allowing a single broker failure without affecting producers using acks=all.

Hardware specifications matter as much as broker count. Each broker should have 64 GB RAM, with Kafka's heap size not exceeding 6 GB. The remaining memory serves as file system cache, which Kafka leverages heavily. Allocate 24 cores per broker, prioritizing core count over clock speed since Kafka benefits more from concurrency than raw processing power.

Network throughput often becomes the bottleneck before disk, particularly with multiple consumer groups reading from the cluster. Modern production clusters benefit from 25 GbE minimum network interfaces, with 40 GbE or 100 GbE appropriate for large-scale deployments.

Setting Up Storage and Retention

Use multiple drives to maximize throughput, avoiding shared drives with application logs or OS activity. RAID 10 provides the best balance for production, offering improved read and write performance with data protection. Run Kafka on XFS or ext4, and use NVMe drives for clusters handling over 100 MB/sec per broker.

Retention policies determine how long Kafka keeps messages before deletion. Time-based retention uses log.retention.hours, defaulting to 168 hours (7 days). Size-based retention through log.retention.bytes caps partition size, with a default of -1 indicating no limit.

Storage capacity planning requires accounting for replication: storage per broker = (daily_throughput x retention_days x replication_factor) / broker_count + 20% overhead.

KRaft Mode vs ZooKeeper

KRaft mode eliminates ZooKeeper dependency by integrating metadata management directly into Kafka using the Raft consensus protocol. Production KRaft deployments require at least three controllers, each needing minimum 4 GB RAM and an SSD of at least 64 GB.

KRaft offers several advantages: metadata operations become more efficient through direct integration, controller election during failures completes faster, and the system removes the operational overhead of maintaining a separate ZooKeeper ensemble.

Kafka Topic Design Best Practices

Topic design decisions directly influence Kafka's ability to handle your workload effectively.

Planning Partition Count

Partition count affects throughput and consumer parallelism. The formula for calculating partitions starts with throughput requirements: max(t/p, t/c), where t is target throughput, p is producer throughput per partition, and c is consumer throughput per partition.

A practical example: with 500 MB/s target throughput and 50 MB/s per-partition capacity with 30 expected consumers, the formula yields max(10, 30) = 30 partitions. Starting with 40-50 partitions allows for growth.

Remember that Kafka assigns each partition to one consumer thread within a consumer group, creating a hard parallelism ceiling. Having more consumers than partitions leaves some idle.

Setting Replication Factor

Production environments should use a replication factor of 3 for all topics, tolerating up to 2 broker failures without data loss. Pair this with min.insync.replicas set to 2 and producers using acks=all to ensure a majority of replicas persist writes before acknowledging success.

Naming and Retention Conventions

Consistent naming improves discoverability. The recommended pattern follows domain.subdomain.data structure, such as sales.ecommerce.shoppingcarts. Avoid version numbers in topic names and disable auto.create.topics.enable to enforce manual creation through approval processes.

For retention, time-based settings use retention.ms at the topic level, defaulting to 7 days. Log compaction offers an alternative by setting cleanup.policy=compact, retaining only the latest value per key.

Security and Access Control Setup

Securing Kafka protects against unauthorized access and data breaches. While Kafka ships with security features disabled by default, production environments require multiple layers of protection.

SSL/TLS Encryption

SSL encrypts data in transit between clients and brokers. Implementation starts with generating public/private keypairs for every broker. The keystore contains each broker's private and public keys and must be kept secure. The truststore stores trusted certificates and can be shared across all clients and brokers.

From Kafka 2.7.0 onwards, SSL keystores and truststores can be configured directly in PEM format within the configuration file. Hostname verification is enabled by default from Kafka 2.0.0, verifying the server's FQDN against the certificate's Common Name or Subject Alternative Name fields.

SASL Authentication

SASL provides authentication for verifying client identity. Production environments should always combine SASL with SSL encryption. Of the available mechanisms, SASL/SCRAM (SHA-256 or SHA-512) provides strong security using salted hashes with a minimum iteration count of 4096. GSSAPI implements Kerberos for enterprise environments, while OAUTHBEARER enables OAuth2 integration for cloud-native deployments.

Access Control Lists

ACLs control which principals can perform operations on specific resources. Without ACLs defined for a resource, Kafka restricts access to super users only. Create one principal per application and grant only required permissions. The kafka-acls.sh CLI manages ACL operations, with --producer and --consumer flags providing convenient shortcuts for common permission sets.

Also configure ulimit to allow 128,000 or more open files per broker, as each partition requires open file handles for index and data files across every log segment.

Producer and Consumer Configuration

Client configuration determines how producers write data and consumers read it.

Acknowledgment Levels and Batching

The acks parameter controls delivery guarantees. acks=0 offers maximum throughput but zero durability. acks=1 provides basic protection, while acks=all requires all in-sync replicas to acknowledge writes. Production environments should combine acks=all with replication factor 3 and min.insync.replicas=2.

Batching groups multiple records destined for the same partition into single requests. The default batch.size is 16 KB. The linger.ms property (defaulting to 5ms in Kafka 4.0) introduces deliberate delay to allow more records to accumulate, improving efficiency without significantly impacting latency.

Retry and Error Handling

Producers automatically retry transient failures. Enable enable.idempotence=true to prevent duplicate messages during retries, ensuring exactly-once semantics. Idempotence requires acks=all and max.in.flight.requests.per.connection limited to 5 or fewer.

For consumers, auto-commit creates risk of message loss. If a consumer crashes after auto-commit but before processing completes, those messages are skipped on restart. Manual commits via commitSync() block until broker acknowledgment for reliability, while commitAsync() improves throughput but requires explicit error handling.

Monitoring and Performance Optimization

Operational visibility becomes critical once cluster and client configurations are in place.

Key Metrics to Watch

Track these broker metrics continuously: request rate and latency, under-replicated partitions (should remain at zero), leader election rate, producer record send rate, and consumer lag. Consumer lag is your most important early warning signal since it shows whether consumers are keeping pace with producers.

Prometheus with JMX exporter is the standard monitoring stack for Kafka. Kafka brokers expose metrics in Prometheus-compatible format using JMX exporter JAR files, enabling real-time dashboards for CPU, memory, network I/O, and throughput.

Log Management and Testing

Log retention defaults to 168 hours via log.retention.hours. Segment size defaults to 1 GB through log.segment.bytes. Conduct retention audits monthly for high-volume topics, quarterly for operational telemetry, and annually for regulatory data.

Before going live, simulate production workloads including message keys, compression settings, and partitioning schemes. Tools like Gatling provide Kafka plugins for protocol-level testing, and tests written as code can be integrated directly into CI/CD pipelines.

Conclusion

Deploying Kafka to production requires careful attention to multiple moving parts. Start with the fundamentals: set up at least three brokers, configure a replication factor of 3, and enable SSL/TLS encryption. Once those are stable, focus on topic design and access controls. Then establish monitoring before handling real production traffic.

To ensure long-term success, many organizations also leverage top Kafka Development Services to optimize architecture, implement best practices, and accelerate deployment with expert guidance.

Each component in this checklist directly impacts your cluster's reliability and performance. You don’t need to implement everything at once, but every item you skip is a risk you’re accepting. Use this checklist as your foundation—and expert support where needed—for running Kafka reliably at scale.