Some tools or processes generate output, usually in some form of CSV output which holds a line entry for a given type of object, and any number of associated properties, all of the same type. This can be viewed as a compressed form of a table describing a many to many relationship, sometimes called a junction table or as I prefer in the context of this example, a pairing table.
Consider a list of people and the operating systems they work with. I have included the trailing empty entries that can occur in csv files.
example = [
['Tom', 'Windows', 'Linux', ''],
['Dick', 'Windows', 'OSX', 'AIX'],
['Harry', 'Linux', 'BSD', '']
]
Although this might originally be stored somehwere as a pairing table already, in a database somewhere, if we do not control the source, or whatever mechanism generates the output, how do we extract the underlying structure? Let's say we want to end up with a representation like this:
Windows | Linux | OSX | BSD | AIX | |
---|---|---|---|---|---|
Tom | X | X | |||
Dick | X | X | X | ||
Harry | X | X |
Because the columns in the original data are not aligned we need to extract the original pairing table to enable a pivot transformation.
def pairinglist(csvlist):
# clear out the empty strings from csv conversion
nonemptylist = map(lambda line: filter(None, line), csvlist)
# generate all combinations as sublists of tuples
pairings = map(lambda line: [(line[0], pairing) for pairing in line[1:]], nonemptylist)
# flatten the sublists
return [pairing for line in pairings for pairing in line]
for entry in example:
print entry
['Tom', 'Windows', 'Linux', ''] ['Dick', 'Windows', 'OSX', 'AIX'] ['Harry', 'Linux', 'BSD', '']
for entry in pairinglist(example):
print entry
('Tom', 'Windows') ('Tom', 'Linux') ('Dick', 'Windows') ('Dick', 'OSX') ('Dick', 'AIX') ('Harry', 'Linux') ('Harry', 'BSD')
The list of tuples generated by the function can be fed to a csvwriter, or some other tool. I tend to use Excel pivottables for this kind of data, but some python visualization could also be used.