"John Salerno" <johnjsal at NOSPAMgmail.com> writes:

"Diez B. Roggisch" <deets at nospam.web.de> wrote in message

news:69g2bpF31ttuuU1 at mid.uni-berlin.de...

the above code is pretty much of a no-no because it has quadratic runtime

behavior.

What's that mean, exactly?

It means that running time will increase with the square of the

problem size. 2.000.000 items will take 4 times as long as

1.000.000 items, and 5.000.000 items will take 25 times as long

as 1.000.000 items.

Are you referring to both examples, or just the

second one?

The problem is that the lookup you did to see if an item had

already been seen, is done by a linear search of a list. The

search time is linearly proportional to the number of items on

the list 'new', but since the number of times you do that search

increases with the length of the input, the total runtime for

your function increases even more.

This thus applies to both your examples, since they really do the

same thing.

However, it actually depends on what your input is. For the

runtime to increase with the square of the input length in your

function, the number of *unique* items on the input must be a

constant fraction of the input length. By that I mean that, for

example, 5% of the input items are unique, so that if your input

is 100 items, then the list 'new' will be 5 items at the end of

the function, and if your input is 5.000.000 items, then 'new'

will become 250.000 items long.

This might not be the case in your use. If you for example know

that the input only consists of integers between 0 and 10, then

'new' will never become longer than 11 items, regardless of

whether the input is 20 or 20.000.000.000 items. The runtime of

your function then suddenly becomes linear in the problem size

instead of quadratic.

Even so, maintaining the set of already seen items as a set is

likely to be faster than keeping them as a list, even for a

fairly small number of unique items.

One small exception from this, though. If only maintaining one

data structure (the ordered list 'new' in your code) makes your

working set fit within the processor cache, while maintaining two

data structures (a list for keeping the order, and a set for

searching) will make it *not* fit within the cache, then it

*might* actually be faster to just maintain the list. Unlikely,

but not entirely impossible, and just a small change of the

problem size can change the balance again.

--

Thomas Bellman, Lysator Computer Club, Link??ping University, Sweden

"What sane person could live in this world ! bellman @ lysator.liu.se

and not be crazy?" -- Ursula K LeGuin ! Make Love -- Nicht Wahr!