FAQ
Dear All,

I am currently trying to write a simple Agglomerative Clustering
algorithm which sorts through my MP3 collection and uses associated
Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
some trouble with my algorithm and some tracks are ending up in multiple
clusters. I have a feeling this is because of (my lack of understanding
of) list behaviours within Python. This is all purely a learning
exercise, so any input would be greatly appreciated

The actual clustering method can be seen described here. Each Cluster is
stored within the 'clusters' array and has two elements: the data
containing song metadata such as artist, title and most importantly
tags, and a weight: that is, the overall Euclidean weight of the cluster
(based on the song data within).

In theory, the algorithm runs in a loop until all clusters have been
merged into a hierarchy. It takes the weight of each cluster and merges
each cluster with their closest counterpart. My code has a lot of
comments so it should be fairly easy to understand.

#-----------------------------Code Snippet
-----------------------------------------------#

def cluster(self):
'''Run the clustering operation on the files
'''

#add a cluster for each track to clusters
for song in self.music.keys():
self.clusters.append({ 'data' : [song], 'weight' : long(0) })

currentLevel = 0 #current level of hierarchy

#while there is more than one cluster in the sorting bank
# i.e. we haven't achieved hierachy yet, run the algorithm
while( len(self.clusters) > 1):

print "Currently at Level %d" % currentLevel

print "There are %d clusters at this level" % len(self.clusters)

#store the level in the levels array
self.levels.append(self.clusters)

#empty the clusters list
self.clusters = []

#find the weight of each current cluster
for c in self.levels[currentLevel]:

self.clusters.append({'data' : c['data'],
'weight' :
self.__getClusterWeight(c['data'])})

#do the actual clustering

tmp = []

for c in self.clusters:

closestCluster = None
closestDelta = float('inf')

for c2 in self.clusters:

#skip if this is the same node
if(c == c2):
continue

delta = abs(c2['weight'] - c['weight'])

if(delta < closestDelta):
closestCluster = c2
closestDelta = delta

print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
self.clusters.index(c),
'c2' :
self.clusters.index(closestCluster)}
#now merge the two clusters
self.clusters.remove(closestCluster)
self.clusters.remove(c)

c['data'].extend(closestCluster['data'])

tmp.append(c)

#increment the level of the hierarchy

self.clusters = tmp

currentLevel += 1

#--------------------------------End Code Snippet
----------------------------#

Thanks,

James

## Search Discussions

•  at Jan 24, 2011 at 11:43 am ⇧

James Ravenscroft wrote:

Dear All,

I am currently trying to write a simple Agglomerative Clustering
algorithm which sorts through my MP3 collection and uses associated
Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
some trouble with my algorithm and some tracks are ending up in multiple
clusters. I have a feeling this is because of (my lack of understanding
of) list behaviours within Python. This is all purely a learning
exercise, so any input would be greatly appreciated

The actual clustering method can be seen described here. Each Cluster is
stored within the 'clusters' array and has two elements: the data
containing song metadata such as artist, title and most importantly
tags, and a weight: that is, the overall Euclidean weight of the cluster
(based on the song data within).

In theory, the algorithm runs in a loop until all clusters have been
merged into a hierarchy. It takes the weight of each cluster and merges
each cluster with their closest counterpart. My code has a lot of
comments so it should be fairly easy to understand.
tmp = []

for c in self.clusters:

closestCluster = None
closestDelta = float('inf')

for c2 in self.clusters:

#skip if this is the same node
if(c == c2):
continue

delta = abs(c2['weight'] - c['weight'])

if(delta < closestDelta):
closestCluster = c2
closestDelta = delta

print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
self.clusters.index(c),
'c2' :
self.clusters.index(closestCluster)}
#now merge the two clusters
self.clusters.remove(closestCluster)
self.clusters.remove(c)

c['data'].extend(closestCluster['data'])

tmp.append(c)
I can't run your code because you didn't make it standalone, but I believe
that the problem (at least one of them) is that you iterate over
self.clusters and modify it from within the loop. That's a big no-no in
python. A simple example to demonstrate the effects:
import random
old = range(10)
new = []
for item in old:
... closest = random.choice(old)
... new.append((item, closest))
... old.remove(item)
... old.remove(closest)
...
old
[3, 4]
new
[(0, 8), (2, 1), (5, 7), (9, 6)]

The remedy is normally to iterate over a copy

for item in list(old):
...

but in your case that is probably not enough.
Try something along these lines:

# untested
while len(self.clusters) > 1:
c = self.clusters.pop()
# find closest index
for i, c2 in enumerate(self.clusters):
...
if ...:
closest_index = i
closest = self.clusters.pop(closest_index)
tmp.append(c + closest)
if self.clusters:
tmp.append(self.clusters[0])

Peter
•  at Jan 24, 2011 at 9:15 pm ⇧
Peter

I can't run your code because you didn't make it standalone,
Thanks for the heads up, I've made a simple version of the clusterer
which you can view on pastebin: http://pastebin.com/7HmAkmfj If you have time to look through
my code I would be very grateful!

but in your case that is probably not enough.
Try something along these lines:

# untested
while len(self.clusters) > 1:
c = self.clusters.pop()
# find closest index
for i, c2 in enumerate(self.clusters):
...
if ...:
closest_index = i
closest = self.clusters.pop(closest_index)
tmp.append(c + closest)
if self.clusters:
tmp.append(self.clusters[0])
I had a go at implementing something along the lines above and I'm still
getting very bizarre results. There does appear to be some sort of logic
to it though, if you look at the graph chart, you can see that It seems
to be doing the clustering right and then forgetting to remove the old
groupings providing this "cloning" effect for some cluster groups.

Chart: http://img826.imageshack.us/i/clusters.png/

Thanks,

James

-- James Ravenscroft Funky Monkey Software - Bespoke Web and Software
Solutions http://www.funkymonkeysoftware.com/

-------------- next part --------------
An HTML attachment was scrubbed...
•  at Jan 26, 2011 at 9:43 am ⇧

James Ravenscroft wrote:

I can't run your code because you didn't make it standalone,
Thanks for the heads up, I've made a simple version of the clusterer
which you can view on pastebin: http://pastebin.com/7HmAkmfj If you have
time to look through my code I would be very grateful!

but in your case that is probably not enough.
Try something along these lines:

# untested
while len(self.clusters) > 1:
c = self.clusters.pop()
# find closest index
for i, c2 in enumerate(self.clusters):
...
if ...:
closest_index = i
closest = self.clusters.pop(closest_index)
tmp.append(c + closest)
if self.clusters:
tmp.append(self.clusters[0])
I had a go at implementing something along the lines above and I'm still
getting very bizarre results. There does appear to be some sort of logic
to it though, if you look at the graph chart, you can see that It seems
to be doing the clustering right and then forgetting to remove the old
groupings providing this "cloning" effect for some cluster groups.
I'm sorry I can't infer the intended algorithm from the code you provide.
Perhaps you can give a short description in plain English?

However, here's the implementation of the algorithm you mention as described
on wikipedia:

http://en.wikipedia.org/wiki/Cluster_analysis#Agglomerative_hierarchical_clustering

Repeatedly merge the two closest clusters until there's only one left.

To keep track of the two merged clusters I've added a "children" key to your
node dictionaries. The intermediate states of the tree are also put into the
levels variable (I suppose that's the purpose of your levels attribute).

The main difference to your code is that

(1) list modifications occur only in the outer while loop, so both inner
loops can become for-loops again.
(2) test all pairs before choosing the pair

debug = True
def cluster(self):
'''Run the clustering operation on the files
'''
clusters = []
#add a cluster for each track to clusters
for song in self.music:
clusters.append({'data' : [song]})
levels = []
while len(clusters) > 1:
# calculate weights
for c in clusters:
c["weight"] = self.__getClusterWeight(c['data'])

# find closest pair
closestDelta = float('inf')
closestPair = None
for i, a in enumerate(clusters):
for k, b in enumerate(clusters):
if i == k:
break
delta = abs(a['weight'] - b['weight'])
if delta < closestDelta:
closestPair = i, k
closestDelta = delta

# merge closest pair
hi, lo = closestPair
left = clusters[lo]
right = clusters.pop(hi)
clusters[lo] = {"data": left["data"] + right["data"],
"children": [left, right]}

# keep track of the intermediate tree states
levels.append(list(clusters))

if self.debug:
print ("stage %d" % len(levels)).center(40, "-")
for c in clusters:
print c["data"]

# store tree
self.clusters = clusters
self.levels = levels

Note that there's some potential for optimisation. For example, you could
move the weight calculation out of the while-loop and just calculate the
weight for the newly merged node.

Peter

## Related Discussions

Discussion Overview
 group python-list categories python posted Jan 24, '11 at 10:02a active Jan 26, '11 at 9:43a posts 4 users 2 website python.org

### 2 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase