Thanks for the response. I used top yesterday and determined that part of
my problem was that the kafaka shell script is pre-configured to only use
512M of RAM, and thus it wasn't using memory efficiently. That has helped
out tremendously. Adding an echo at the start of the script that it was
defaulting to such a low value probably would have saved me some time. In
the same vein, I should have inspected the launch command more closely.
The virtualization of AWS makes it difficult to truly know what your
performance is, IMHO. There are lots of people arguing on the web about
the value of bare metal versus virtualization. I am still baffled how
companies like Urban Airship are purportedly seeing bursts of 750,000
messages per second on a 3-cluster machine, but by playing with the knobs
in a controlled manner, I'm starting to better understand the relationship
and effect on the overall system.
On 5/21/13 11:44 AM, "Philip O'Toole" wrote:
As a test, why not just use a disk with provisioned IOPs of 4000? Just as
a test - see if it improves.
Also, you have not supplied any metrics regarding the VM's performance.
Is the CPU busy? Is IO maxed out? Network? Disk? Use a tool like atop,
and tell us what you find.
On May 20, 2013, at 6:43 PM, Ken Krugler wrote:
On May 20, 2013, at 10:01am, Jason Weiss wrote:
I'm using Kafka 0.7.2. I am using the default replication factor,
don't recall changing that configuration at all.
I'm using provisioned IOPS, which from attending the AWS event in NYC a
few weeks ago was presented as the "fastest storage option" for EC2. A
number of partners presented success stories in terms of throughput
provisioned IOPS. I've tried to follow that model.
In my experience directly hitting an ephemeral drive on m1.large is
faster than using EBS.
I've seen some articles where RAIDing multiple EBS volumes can exceed
the performance of ephemeral drives, but with high variability.
If you want to maximize performance, set up up a (smaller) cluster of
SSD-backed instances with 10Gb Ethernet in the same cluster group.
E.g. test with three cr1.8xlarge instances.
On 5/20/13 12:56 PM, "Scott Clasen" wrote:
My guess, EBS is likely your bottleneck. Try running on instance
disks, and compare your results. Is this 0.8? What replication
On Mon, May 20, 2013 at 8:11 AM, Jason Weiss <email@example.com
I'm trying to maximize my throughput and seem to have hit a ceiling.
Everything described below is running in AWS.
I have configured a Kafka cluster with 5 machines, M1.Large, with 600
provisioned IOPS storage for each EC2 instance. I have a Zookeeper
(we aren't in production yet, so I didn't take the time to setup a ZK
cluster). Publishing to a single topic from 7 different clients, I
max out at around 20,000 eps with a fixed 2K message size. Each
defines 10 file segments, with a 25000 message / 5 second flush
configuration in server.properties. I have stuck with 8 threads. My
producers (Java) are configured with batch.num.messages at 50, and
queue.buffering.max.messages at 100.
When I went from 4 servers in the cluster to 5 servers, I only saw an
increase of about 500 events per second in throughput. In sharp
when I run a complete environment on my MacBook Pro, tuned as
above but with a single ZK and a single Kafka broker, I am seeing
events per second. I don't think I'm network constrained in the AWS
environment (producer side) because when I add one more client, my
Pro, I see a proportionate decrease in EC2 client throughput, and the
result is an identical 20,000 eps. Stated differently, my EC2
up throughput when my local MacBook Pro joins the array of producers
that the throughput is exactly the same.
Does anyone have any additional suggestions on what else I could
try and hit our goal, 50,000 eps with a 5 machine cluster? Based on
whitepapers published, LinkedIn describes a peak of 170,000 events
second across their cluster. My 20,000 seems so far away from their
What is the relationship, in terms of performance, between ZK and
Do I need to have a more performant ZK cluster, the same, or does it
not matter in terms of maximizing throughput.
Thanks for any suggestions I've been pulling knobs and turning
this for several days now.
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
This electronic message contains information which may be confidential or privileged. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of the contents of this information is prohibited. If you have received this electronic transmission in error, please notify us by e-mail at (firstname.lastname@example.org