Folks:

Every couple of years I run into a problem where some Python code that

worked well at small scales starts burning up my CPU at larger scales,

and the underlying issue turns out to be the idiom of accumulating

data by string concatenation. It just happened again

(http://foolscap.lothar.com/trac/ticket/149 ), and as usual it is hard

to make the data accumulator efficient without introducing a bunch of

bugs into the surrounding code. So this time around I decided to try

encapsulating my preferred more efficient idiom into a reusable class.

So I present to you StringChain, which is an efficient way to

accumulate and process data in many chunks:

http://tahoe-lafs.org/trac/stringchain

Here are some benchmarks generated by running python -OOu -c 'from

stringchain.bench import bench; bench.quick_bench()' as instructed by

the README.txt file.

The N: in the left-hand column is how many bytes were in the test

dataset. The ave rate: number in the right-hand column is how many

bytes per second were processed. "naive" means the string-based idiom

sketched above and "strch" means using the StringChain class.

_buildup init_naive

N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 890, ave

rate: 58350579

N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 265, ave

rate: 34800398

N: 262144, time: best: 0.01, 2th-best: 0.01, ave: 0.01,

2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 79, ave

rate: 20745346

N: 524288, time: best: 0.05, 2th-best: 0.05, ave: 0.05,

2th-worst: 0.05, worst: 0.05 (of 5), reps/s: 20, ave

rate: 10823850

_buildup init_strch

N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 25543, ave

rate: 1674043282

N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 14179, ave

rate: 1858538925

N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 8016, ave

rate: 2101513050

N: 524288, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 4108, ave

rate: 2154215572

_consume init_naive_loaded

N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 931, ave

rate: 61037862

N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 270, ave

rate: 35454393

N: 262144, time: best: 0.01, 2th-best: 0.01, ave: 0.01,

2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 74, ave

rate: 19471963

N: 524288, time: best: 0.05, 2th-best: 0.05, ave: 0.05,

2th-worst: 0.05, worst: 0.06 (of 5), reps/s: 19, ave

rate: 10146747

_consume init_strch_loaded

N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 4309, ave

rate: 282447500

N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 2313, ave

rate: 303263357

N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 1186, ave

rate: 311159052

N: 524288, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 606, ave

rate: 317814669

_randomy init_naive

N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 479, ave

rate: 31450561

N: 131072, time: best: 0.01, 2th-best: 0.01, ave: 0.01,

2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 140, ave

rate: 18461191

N: 262144, time: best: 0.02, 2th-best: 0.02, ave: 0.02,

2th-worst: 0.03, worst: 0.03 (of 5), reps/s: 42, ave

rate: 11127714

N: 524288, time: best: 0.06, 2th-best: 0.07, ave: 0.08,

2th-worst: 0.08, worst: 0.09 (of 5), reps/s: 13, ave

rate: 6906341

_randomy init_strch

N: 65536, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 973, ave

rate: 63827127

N: 131072, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 495, ave

rate: 64970669

N: 262144, time: best: 0.00, 2th-best: 0.00, ave: 0.00,

2th-worst: 0.00, worst: 0.00 (of 5), reps/s: 239, ave

rate: 62913360

N: 524288, time: best: 0.01, 2th-best: 0.01, ave: 0.01,

2th-worst: 0.01, worst: 0.01 (of 5), reps/s: 121, ave

rate: 63811569

The naive approach is slower than the StringChain class, and the

bigger the dataset the slower it goes. The StringChain class is fast

and also it is scalable (with regard to these benchmarks at least...).

Thanks!

Regards,

Zooko