Now that we have a more or less correct list of road names, we need to decide on what features we'll need for the classifier, and how we are going to extract them. Here are some features I'm planning to use:
So, it looks like what we'll need are "road name" and the "road tag", dropping all modifiers like "North" or "First". We'll proceed to divide up the dataframe of full road names accordingly.
(Note: this is fairly pedestrian stuff. You may wish to go on to the classification steps.)
import pandas as pd
df = pd.read_csv("singapore-roadnames-final.csv")
# drop the column of numbers
df.drop("Unnamed: 0", inplace=True, axis=1)
df
# we'll be using final_name: the name column will be for combining
# our final classification info back with the geojson file
# with all the geographic data
name | final_name | |
---|---|---|
0 | Orchard Road | Orchard Road |
1 | Hougang Avenue 1 | Hougang Avenue 1 |
2 | Scotts Road | Scotts Road |
3 | Keng Lee Road | Keng Lee Road |
4 | Newton Road | Newton Road |
5 | Sarkies Road | Sarkies Road |
6 | Patterson Road | Paterson Road |
7 | Orchard Boulevard | Orchard Boulevard |
8 | Grange Road | Grange Road |
9 | Paterson Hill | Paterson Hill |
10 | River Valley Road | River Valley Road |
11 | Unity Street | Unity Street |
12 | Merbau Road | Merbau Road |
13 | Mohamed Sultan Road | Mohamed Sultan Road |
14 | Saiboo Street | Saiboo Street |
15 | Merchant Loop | Merchant Loop |
16 | Clemenceau Avenue | Clemenceau Avenue |
17 | Merchant Road | Merchant Road |
18 | Read Cresent | Read Crescent |
19 | Tampines Expressway | Tampines Expressway |
20 | Seletar Expressway | Seletar Expressway |
21 | Central Expressway | Central Expressway |
22 | Telok Blangah Road | Telok Blangah Road |
23 | Ayer Rajah Expressway | Ayer Rajah Expressway |
24 | Turf Club Avenue | Turf Club Avenue |
25 | Kranji Expressway | Kranji Expressway |
26 | Prinsep Street | Prinsep Street |
27 | Tanglin Road | Tanglin Road |
28 | Alexandra Road | Alexandra Road |
29 | Nicoll Highway | Nicoll Highway |
... | ... | ... |
3404 | Seletar North Link | Seletar North Link |
3405 | Ghim Moh Link | Ghim Moh Link |
3406 | Hougang Street 31 | Hougang Street 31 |
3407 | Hougang Street 32 | Hougang Street 32 |
3408 | Serangoon Lane | Serangoon Lane |
3409 | Gambir Walk | Gambir Walk |
3410 | Upper Serangoon Crescent | Upper Serangoon Crescent |
3411 | Ubi Close | Ubi Close |
3412 | Sin Ming Lane | Sin Ming Lane |
3413 | Compassvale Lane | Compassvale Lane |
3414 | Lorong 5 Realty Park | Lorong 5 Realty Park |
3415 | Wee Nam Road | Wee Nam Road |
3416 | Tampines Street 72 | Tampines Street 72 |
3417 | Changi South Lane | Changi South Lane |
3418 | Telegraph Street | Telegraph Street |
3419 | Biopolis Street | Biopolis Street |
3420 | Biopolis Link | Biopolis Link |
3421 | Plymouth Avenue | Plymouth Avenue |
3422 | Gentle Road | Gentle Road |
3423 | Leicester Road | Leicester Road |
3424 | Simon Walk | Simon Walk |
3425 | Joo Hong Road | Joo Hong Road |
3426 | Florence Close | Florence Close |
3427 | Hoot Kiam Road | Hoot Kiam Road |
3428 | Yishun Avenue 8 | Yishun Avenue 8 |
3429 | Choa Chu Kang Avenue 6 | Choa Chu Kang Avenue 6 |
3430 | Clarke Quay | Clarke Quay |
3431 | Countryside Walk | Countryside Walk |
3432 | PIE | Pan-Island Expressway |
3433 | Nepal Park | Nepal Park |
3434 rows × 2 columns
# to get an idea of the road tags/modifiers that we should eliminate,
# let's do a word frequency table for the full road names we do have
from collections import Counter
c = Counter()
for name in df.final_name:
for word in name.split():
c[word] += 1
c
Counter({'Road': 831, 'Avenue': 396, 'Jalan': 381, 'Street': 304, 'Drive': 222, 'Lane': 136, 'Lorong': 131, 'Crescent': 116, 'Walk': 108, 'Park': 94, 'West': 82, 'Link': 70, 'Woodlands': 68, 'Terrace': 68, 'Tuas': 65, 'Bukit': 62, 'Tampines': 60, 'Place': 59, '1': 58, 'Close': 57, '2': 55, 'View': 53, 'Jurong': 52, 'Geylang': 49, 'North': 49, 'Way': 48, '3': 47, 'Kang': 46, 'Chu': 44, 'East': 44, 'Hill': 43, 'Grove': 43, 'Bedok': 40, 'Central': 39, 'Pasir': 37, 'Mo': 37, 'Kio': 37, 'Ang': 37, '4': 36, 'Rise': 35, 'South': 34, 'Changi': 33, 'Ris': 32, 'Coast': 32, 'Industrial': 31, 'Batok': 30, 'Seletar': 28, '5': 28, 'Yishun': 26, 'Upper': 26, '6': 25, 'Choa': 24, 'Lim': 23, 'Green': 23, 'Telok': 22, 'Serangoon': 21, 'Tai': 21, 'Toh': 21, 'Hougang': 21, 'Gardens': 20, 'Eunos': 20, 'Mount': 19, 'Siglap': 18, 'Teck': 18, 'Toa': 17, 'Kim': 17, '8': 17, 'Payoh': 17, 'Merah': 17, 'Hwan': 16, 'Seng': 16, 'Heights': 15, '7': 15, 'Lentor': 15, 'Chuan': 15, 'Sungei': 14, 'St': 14, 'Bridge': 14, 'Old': 14, 'Holland': 14, 'Garden': 14, 'Marsiling': 13, 'Ubi': 13, 'Loop': 13, 'Gul': 13, 'Pioneer': 13, 'Boon': 12, 'Sunset': 12, 'Joo': 12, '11': 12, '12': 12, 'Limau': 12, 'Farmway': 12, 'Clementi': 12, 'Soon': 12, '9': 11, 'Thomson': 11, 'Sembawang': 11, 'Keng': 11, 'Nanyang': 11, 'Chin': 11, 'Kadut': 11, 'Blangah': 11, 'Tanah': 10, 'Taman': 10, 'Tanjong': 10, 'Ayer': 10, 'Kallang': 10, 'Faber': 10, 'Sengkang': 10, 'Kampong': 10, 'Loyang': 10, '10': 10, 'Bishan': 10, 'Tuck': 10, 'Springleaf': 10, 'Defu': 10, 'Eng': 10, 'Club': 10, 'Punggol': 10, 'Expressway': 10, 'Circus': 10, 'Guan': 10, 'Namly': 9, 'Kaki': 9, 'Tong': 9, 'Simei': 9, 'Valley': 9, 'Circle': 9, 'Marina': 9, 'Square': 9, '31': 9, 'Eastwood': 9, 'Poh': 9, 'Mayflower': 9, 'Anchorvale': 8, 'Prince': 8, 'Mimosa': 8, 'Bee': 8, 'Boulevard': 8, 'Kew': 8, 'Kurau': 8, 'Kong': 8, 'Pandan': 8, 'Lebar': 8, 'Compassvale': 8, '21': 8, '22': 8, 'Penjuru': 8, 'Sinai': 8, 'Soo': 8, 'Greenleaf': 8, 'Sector': 8, 'Yunnan': 8, 'Senoko': 7, 'Yang': 7, 'Vista': 7, 'Lee': 7, 'Paya': 7, 'Sennett': 7, '13': 7, '14': 7, 'Bahru': 7, 'Chestnut': 7, 'Island': 7, 'Frankel': 7, 'Saraca': 7, '24': 7, 'Pisang': 7, 'Marine': 7, 'Business': 7, 'Sims': 7, 'Commonwealth': 7, '32': 7, 'Chow': 7, 'Vale': 7, 'Novena': 7, 'Lok': 7, 'Stadium': 7, '41': 7, 'Buangkok': 7, 'Sunrise': 6, 'Li': 6, 'Countryside': 6, 'Goldhill': 6, 'Ridge': 6, 'Ria': 6, 'Ming': 6, '51': 6, '52': 6, 'Tan': 6, 'Batu': 6, 'Watten': 6, 'Kranji': 6, 'Rivervale': 6, 'Kent': 6, 'Admiralty': 6, 'Bay': 6, 'Hillview': 6, 'Koon': 6, 'Kechil': 6, 'Neo': 6, '23': 6, 'Begonia': 6, 'Lengkok': 6, 'Lengkong': 6, 'Ching': 6, 'Kian': 6, 'Tari': 6, 'Westwood': 6, 'Huat': 5, 'Whampoa': 5, 'Mariam': 5, 'Keat': 5, 'Benoi': 5, 'Chancery': 5, "King's": 5, 'Ho': 5, 'Hong': 5, 'Biopolis': 5, 'Estate': 5, 'Springwood': 5, 'Second': 5, 'Sun': 5, 'Keppel': 5, 'Tagore': 5, 'Whye': 5, 'Aerospace': 5, 'Chee': 5, 'Parry': 5, 'Quay': 5, '61': 5, 'Vanda': 5, 'Mugliston': 5, 'Sin': 5, 'Orchard': 5, 'Lay': 5, 'Clover': 5, 'Springside': 5, '71': 5, 'Realty': 5, 'Coronation': 5, 'Lucky': 5, 'Kismis': 5, 'Robin': 5, 'Tiew': 5, 'Stratton': 5, 'Beach': 5, 'Third': 5, 'Dover': 5, 'Harvey': 4, 'Mandai': 4, 'Phoenix': 4, 'Rhu': 4, 'Dedap': 4, 'Tiong': 4, 'Cluny': 4, 'Pier': 4, 'Parade': 4, 'Chai': 4, 'Sea': 4, 'Hospital': 4, 'Binjai': 4, 'Happy': 4, 'Panjang': 4, 'First': 4, 'Viaduct': 4, 'Rajah': 4, 'Republic': 4, 'Fernvale': 4, 'Rd': 4, 'Swee': 4, 'Kee': 4, 'Yung': 4, 'Woodgrove': 4, 'Pavilion': 4, '62': 4, '64': 4, 'New': 4, 'Seraya': 4, 'Mei': 4, 'Besar': 4, 'How': 4, 'Gateway': 4, 'Ring': 4, 'Canal': 4, 'Swiss': 4, 'Seah': 4, 'Aljunied': 4, '25': 4, 'Lian': 4, 'Farrer': 4, '72': 4, 'Pari': 4, 'College': 4, 'Palm': 4, 'Nim': 4, 'Pemimpin': 4, 'Simon': 4, 'Siang': 4, 'Ah': 4, 'Yio': 4, 'Plain': 4, 'Sing': 4, '81': 4, "George's": 4, 'Manis': 4, 'River': 4, 'Chiat': 4, 'Peakville': 4, 'Cove': 4, 'Jervois': 4, 'Lornie': 4, 'Airport': 4, 'Tanglin': 4, 'Fourth': 4, 'Cashew': 4, 'Cresent': 4, 'Timah': 4, '42': 4, 'Kuning': 3, 'Haig': 3, 'Queen': 3, '91': 3, '93': 3, '92': 3, 'Stangee': 3, 'Ocean': 3, 'Hills': 3, 'Imbiah': 3, 'Cairnhill': 3, 'Lilac': 3, 'Bank': 3, 'Bin': 3, '16': 3, "Queen's": 3, '53': 3, 'Binchang': 3, 'Thong': 3, 'Still': 3, 'Sommerville': 3, 'Surin': 3, 'Claymore': 3, 'Kilang': 3, 'Ind': 3, 'Reservoir': 3, 'Sultan': 3, 'Kandis': 3, 'Duku': 3, 'Martin': 3, 'Yan': 3, 'Leedon': 3, 'Havelock': 3, 'Court': 3, 'Turf': 3, 'Guillemard': 3, 'Neram': 3, 'Balmoral': 3, 'Genting': 3, '63': 3, '65': 3, 'Meragi': 3, 'Marymount': 3, 'Raffles': 3, 'Highway': 3, 'Bunga': 3, 'Port': 3, 'Pine': 3, 'Buona': 3, 'Gerald': 3, 'Waringin': 3, 'Sireh': 3, 'Height': 3, 'Paterson': 3, 'Lower': 3, 'Stevens': 3, 'Ferry': 3, 'Eu': 3, '73': 3, 'Breeze': 3, 'King': 3, 'Ubin': 3, 'Wan': 3, 'Wak': 3, 'Astrid': 3, 'Hassan': 3, 'Almond': 3, 'Jambol': 3, 'Burgundy': 3, 'Chatsworth': 3, 'Tiga': 3, 'Ulu': 3, 'Cross': 3, 'Mas': 3, 'Gallop': 3, 'Outram': 3, 'Bartley': 3, 'Kuang': 3, '33': 3, '34': 3, 'Fernhill': 3, 'Clemenceau': 3, 'Harbourfront': 3, 'Adam': 3, 'Sam': 3, 'Cheng': 3, '82': 3, 'Centre': 3, 'Potong': 3, 'Moh': 3, 'Hock': 3, 'Kembangan': 3, 'Villas': 3, 'Haji': 3, 'Victoria': 3, 'Kayu': 3, 'Cantonment': 3, 'Jambu': 3, 'Science': 3, '44': 3, '43': 3, 'Elizabeth': 3, 'Fudu': 2, 'Buroh': 2, 'Lakme': 2, 'Sedap': 2, 'Puteh': 2, 'Pan-Island': 2, 'Lempeng': 2, 'Wales': 2, 'Basin': 2, 'Circular': 2, 'Vernon': 2, 'Brighton': 2, 'Draycott': 2, 'Gambas': 2, 'Munshi': 2, 'Gemala': 2, 'Sari': 2, 'Lin': 2, 'Aroozoo': 2, 'Terusan': 2, 'Maxwell': 2, 'Leonie': 2, 'Emerald': 2, "Monk's": 2, 'Shipyard': 2, 'Woo': 2, 'University': 2, 'Circuit': 2, 'Sandy': 2, 'Rama': 2, 'Dua': 2, 'Hindhede': 2, 'Canberra': 2, 'See': 2, 'Tembeling': 2, 'Sen': 2, 'Pickering': 2, 'Piring': 2, 'Hillcrest': 2, 'Timor': 2, 'Ghim': 2, 'Anderson': 2, 'Winstedt': 2, 'Nam': 2, 'Alexandra': 2, 'Conway': 2, 'Hoon': 2, 'Belilios': 2, 'Merryn': 2, 'Chapel': 2, 'Sakra': 2, 'Satu': 2, 'Chempaka': 2, 'Sheares': 2, 'D': 2, 'Shenton': 2, '54': 2, 'Rochor': 2, 'Lew': 2, 'Amber': 2, 'Abdullah': 2, 'Rangoon': 2, 'Suffolk': 2, 'Inggu': 2, 'Track': 2, 'Andrew': 2, 'Cornwall': 2, 'Bulan': 2, 'Charles': 2, 'Highgate': 2, 'Rosyth': 2, 'Portsdown': 2, 'Woodleigh': 2, 'Tat': 2, 'Tay': 2, 'Duchess': 2, 'Fort': 2, 'Dunman': 2, 'York': 2, 'Montreal': 2, 'Batalong': 2, 'Hume': 2, 'Damai': 2, 'Hua': 2, 'Chwee': 2, 'Grace': 2, 'Kitchener': 2, 'Kay': 2, 'Angullia': 2, 'Bright': 2, 'Fusionopolis': 2, 'Gardenia': 2, '1A': 2, 'Sophia': 2, '15': 2, '17': 2, '19': 2, '18': 2, 'Grange': 2, 'Field': 2, 'Crawford': 2, 'Halt': 2, 'Senang': 2, 'Angsa': 2, 'Oriole': 2, 'Fifth': 2, 'Turn': 2, 'Kelulut': 2, 'Cres': 2, 'Peck': 2, 'Redhill': 2, 'Sherwood': 2, 'Marshall': 2, 'Edward': 2, 'Tock': 2, 'Sixth': 2, 'Leith': 2, 'Service': 2, 'Raya': 2, 'MacPherson': 2, 'Molek': 2, 'Melayu': 2, 'Pearl': 2, 'Auckland': 2, 'Mountbatten': 2, 'of': 2, 'Dairy': 2, 'Hoe': 2, 'Lada': 2, 'Heng': 2, 'Maple': 2, 'Temasek': 2, 'Penang': 2, 'Course': 2, 'Bencoolen': 2, 'Tengah': 2, 'Sunview': 2, 'Interchange': 2, 'Rampai': 2, 'Casuarina': 2, 'Beatty': 2, 'Jiak': 2, 'Raja': 2, 'Coleman': 2, 'Daisy': 2, '20': 2, 'Chong': 2, 'Elliot': 2, 'Newton': 2, 'Wajek': 2, 'Ampas': 2, 'Orchid': 2, 'Sinar': 2, 'Udang': 2, 'Anak': 2, 'Everitt': 2, 'Delta': 2, '8A': 2, 'Spring': 2, 'Ponggol': 2, 'Peng': 2, 'Langgar': 2, 'Recreation': 2, '75': 2, 'Hai': 2, 'Bilal': 2, 'Bintang': 2, 'Wilkie': 2, 'Mawar': 2, 'Richards': 2, 'Fulton': 2, 'Oxley': 2, 'Coastal': 2, 'Watt': 2, 'Elite': 2, 'Siloso': 2, 'Stagmont': 2, 'Market': 2, 'Dalvey': 2, 'Kerong': 2, 'Camp': 2, 'Clive': 2, 'Brookvale': 2, 'Woollerton': 2, 'Java': 2, 'Plains': 2, 'Tannery': 2, 'Yuan': 2, 'Tosca': 2, 'Florence': 2, 'International': 2, 'Choo': 2, 'Tyersall': 2, 'Malcolm': 2, 'Empress': 2, 'Tomlinson': 2, 'Piccadilly': 2, 'Sturdee': 2, 'Albert': 2, 'Limbok': 2, 'Angklong': 2, 'Zion': 2, 'Teban': 2, 'Klang': 2, 'Duxton': 2, 'Gentle': 2, 'Melati': 2, 'Leng': 2, 'Gambir': 2, 'Nicoll': 2, 'Sirat': 2, 'Regent': 2, 'Merchant': 2, 'Rosie': 2, 'Yi': 2, 'Stirling': 2, 'Barat': 2, 'Ceylon': 2, 'Exchange': 2, 'Venus': 2, 'Evergreen': 2, 'Sussex': 2, 'Tua': 2, 'Charlton': 2, 'Kiam': 2, 'Rochester': 2, 'Farm': 2, 'Layang': 2, 'Tras': 2, 'Nassim': 2, 'Merbau': 2, 'Dunearn': 2, 'Dickson': 2, 'Kramat': 2, 'Enam': 2, 'Elias': 2, 'Siak': 2, 'Sayang': 2, 'Cleantech': 2, 'Race': 2, 'Chian': 2, 'Enduf': 2, 'Boundary': 2, 'Keris': 2, 'Flora': 2, 'Wellington': 2, 'Ann': 2, 'Strathmore': 2, 'Starlight': 2, 'Oxford': 2, 'Carmen': 2, 'Cassia': 2, 'Cactus': 2, 'Bayfront': 2, 'Verde': 2, 'Salam': 2, 'Kelantan': 2, 'Canning': 2, 'Prinsep': 2, 'Cuscaden': 2, 'Weld': 2, 'Teow': 2, 'Teo': 2, 'Lake': 2, '83': 2, '40': 2, 'Emas': 2, 'Fir': 1, 'Kensington': 1, 'Rutland': 1, 'Kwong': 1, 'Redop': 1, 'Bangkit': 1, 'Cheong': 1, 'Waterloo': 1, 'Penaga': 1, 'Catterick': 1, 'Finlayson': 1, 'Exeter': 1, 'Thomas': 1, 'Tenteram': 1, 'Kinta': 1, 'Sarkies': 1, 'Windsor': 1, 'MacTaggart': 1, 'Salle': 1, 'Kilat': 1, 'Kingsmead': 1, 'Durain': 1, 'Kemajuan': 1, 'Kechot': 1, 'Hamilton': 1, 'Kesoma': 1, 'Hooper': 1, 'Carisbrooke': 1, 'Playfair': 1, 'Lapang': 1, '108': 1, '102': 1, '101': 1, '106': 1, '107': 1, '104': 1, '105': 1, '27A': 1, 'Gelegar': 1, 'Jeruju': 1, 'MacRitchie': 1, 'Peirce': 1, 'Cranborne': 1, 'Koh': 1, 'Saujana': 1, 'Teng': 1, 'Belimbing': 1, 'Brompton': 1, 'Hythe': 1, '9A': 1, 'Tenang': 1, 'Tudor': 1, 'Kasturi': 1, 'Siew': 1, 'Hemmant': 1, '94': 1, 'Fraser': 1, 'Lekar': 1, 'La': 1, 'Corporation': 1, 'Swan': 1, 'Abu': 1, 'Lembah': 1, 'Hampstead': 1, 'Irving': 1, 'Maritime': 1, 'Capricorn': 1, 'Rodyk': 1, 'Westerhout': 1, 'Pekan': 1, 'A': 1, 'Alias': 1, 'Matlock': 1, 'Napiri': 1, 'Nepal': 1, 'Xilin': 1, 'Japanese': 1, 'Miltonia': 1, 'Lloyd': 1, 'Selamat': 1, 'Robertson': 1, 'McNair': 1, 'Pennefather': 1, 'Armenian': 1, 'Syed': 1, 'Banda': 1, 'Gendang': 1, 'Ashwood': 1, 'Beechwood': 1, 'Burong': 1, 'Omar': 1, 'Bachok': 1, 'Yam': 1, 'Saunders': 1, 'Sempadan': 1, "Helier's": 1, 'Rosewood': 1, 'Karikal': 1, 'Kingswear': 1, 'Gopeng': 1, 'Prome': 1, '24A': 1, 'Pesawat': 1, 'Holt': 1, 'Meyer': 1, 'Mambong': 1, 'Seaview': 1, 'MacKenzie': 1, 'Nutmeg': 1, 'Cambridge': 1, 'Jelebu': 1, 'Walshe': 1, 'Chay': 1, 'Bristol': 1, 'Chulia': 1, 'Killiney': 1, 'Petain': 1, 'Chantek': 1, 'Pelajau': 1, 'Jelapang': 1, 'Fajar': 1, 'Katong': 1, 'Mesin': 1, 'Haigsville': 1, 'Lantana': 1, 'Craig': 1, 'Tenon': 1, 'Hillside': 1, 'Sajak': 1, 'Sek': 1, 'Denham': 1, 'Depot': 1, 'Berrima': 1, 'Penhas': 1, 'Truro': 1, 'Jasmine': 1, 'Pegu': 1, 'Ash': 1, 'Ludlow': 1, 'Dinding': 1, 'Buckley': 1, 'Ardmore': 1, 'Lakeside': 1, 'Lewis': 1, 'Berjaya': 1, 'Pelatina': 1, 'Mayne': 1, 'Concourse': 1, 'Cherpen': 1, 'Paradise': 1, 'Empat': 1, 'Dalhousie': 1, 'Sukachita': 1, 'Upavon': 1, 'Brani': 1, 'Tian': 1, 'Nanson': 1, 'Cashin': 1, 'Serenade': 1, 'Pakis': 1, 'Greja': 1, 'Andover': 1, 'Petir': 1, 'Terang': 1, 'Thiam': 1, 'Jupiter': 1, 'Koek': 1, 'Samulun': 1, 'Greenridge': 1, 'Pei': 1, 'Carnation': 1, 'one-north': 1, 'Leyden': 1, 'John': 1, 'Kubor': 1, 'Oei': 1, 'Minden': 1, 'Tunnel': 1, 'Tree': 1, 'Pendek': 1, 'Pender': 1, 'Quemoy': 1, 'Arthur': 1, 'Zapin': 1, 'Derum': 1, 'Machang': 1, 'Mohamed': 1, 'Carlisle': 1, 'Penjara': 1, 'Mangis': 1, 'Fullerton': 1, 'Papan': 1, 'Telegraph': 1, 'Jacaranda': 1, 'Muscat': 1, 'Sumang': 1, 'Telipok': 1, 'Cuppage': 1, 'Serasi': 1, 'Malan': 1, 'Senin': 1, 'Kasai': 1, 'Aruan': 1, 'Pinewood': 1, 'Kasau': 1, 'Labrador': 1, 'Elgin': 1, '25A': 1, 'Hoot': 1, 'Yasin': 1, 'Harrison': 1, 'Campbell': 1, 'Lynwood': 1, 'Tamarind': 1, 'Hood': 1, 'Lekub': 1, 'Ellis': 1, 'Arif': 1, 'Wangi': 1, 'Meng': 1, 'Bah': 1, 'Allenby': 1, 'Lothian': 1, 'Lyndhurst': 1, 'Lompang': 1, 'Biggin': 1, 'Gymkhana': 1, 'Portchester': 1, 'Leicester': 1, 'Spottiswoode': 1, 'Awang': 1, 'Roseburn': 1, 'Coniston': 1, 'Purut': 1, 'Fiji': 1, 'Russels': 1, '5A': 1, 'Wenya': 1, 'Baghdad': 1, 'Flint': 1, 'Philip': 1, 'Tenggiri': 1, 'Roberts': 1, 'Jansen': 1, 'Wing': 1, 'Banyan': 1, '58': 1, 'Scotts': 1, '50': 1, 'Lye': 1, 'Janggus': 1, 'Tawas': 1, 'Bain': 1, 'Biawak': 1, 'Mosque': 1, 'Ewe': 1, 'Fairways': 1, 'Echo': 1, 'Dusun': 1, 'Liput': 1, 'Bengkok': 1, 'Suan': 1, 'Riang': 1, 'Hu': 1, 'Wimborne': 1, 'Gumilang': 1, 'Gray': 1, 'Arcadia': 1, 'Poyan': 1, 'Whitley': 1, 'Tin': 1, 'Dakota': 1, 'Rienzi': 1, 'AVe': 1, 'Rambai': 1, 'Tyrwhitt': 1, 'Dyson': 1, 'Veerasamy': 1, 'Lavender': 1, 'Semangka': 1, 'Olive': 1, 'Kelopak': 1, 'Lemon': 1, 'Hang': 1, 'Hobart': 1, 'Blandford': 1, 'Barnabas': 1, 'Ratus': 1, 'Arguilla': 1, 'Joon': 1, 'Selimang': 1, 'Pipit': 1, 'Oakwood': 1, 'Purvis': 1, 'Keli': 1, 'Malta': 1, 'Minyak': 1, 'Sim': 1, 'Bena': 1, 'Kechubong': 1, 'Mayo': 1, 'Senandong': 1, 'Butterfly': 1, 'Alnwick': 1, 'Sembong': 1, 'Pasiran': 1, 'Worthing': 1, 'Eber': 1, 'Lokam': 1, 'Clacton': 1, 'Daliah': 1, 'Clifton': 1, 'Nipah': 1, 'Cranwell': 1, 'Emily': 1, 'Chorak': 1, 'Madras': 1, 'Braddell': 1, 'Todak': 1, 'Falkland': 1, 'Allanbrooke': 1, 'Braemar': 1, 'Peel': 1, 'Burn': 1, "Patrick's": 1, 'Netheravon': 1, 'Bayshore': 1, 'Figaro': 1, 'Bury': 1, 'Intan': 1, 'Pelikat': 1, 'Tech': 1, 'Edgefield': 1, 'Tractor': 1, 'Surrey': 1, 'Selangat': 1, 'Kerayong': 1, 'Gladiola': 1, 'Arnap': 1, 'Tao': 1, 'Tah': 1, 'Bideford': 1, 'Wilton': 1, 'Kebaya': 1, 'Corfe': 1, 'Edgedale': 1, 'Orange': 1, 'Talib': 1, 'Blackmore': 1, 'Ford': 1, 'Gajus': 1, 'Paku': 1, 'Francis': 1, 'Insaf': 1, 'Normanton': 1, 'Moonstone': 1, 'Makeway': 1, 'Lotus': 1, 'Peradun': 1, 'Mattar': 1, 'Malacca': 1, 'Technology': 1, 'Kakatua': 1, 'Said': 1, 'Mempurong': 1, 'Saik': 1, 'Hullet': 1, 'Barracks': 1, 'Davidson': 1, 'Minbu': 1, 'Greenpark': 1, 'Bras': 1, 'Wilmonar': 1, 'Pitt': 1, 'Sago': 1, 'Saga': 1, 'Basong': 1, 'Lien': 1, 'Lilin': 1, 'Daud': 1, 'Daun': 1, 'Bangau': 1, 'Danau': 1, 'Kupang': 1, 'Kerbau': 1, 'Menarong': 1, 'Essex': 1, 'Rimau': 1, 'Stokesay': 1, 'Dermawan': 1, 'Carmichael': 1, 'Besut': 1, 'Tank': 1, 'Tani': 1, 'Muswell': 1, 'Martia': 1, 'Yarwood': 1, 'Yeang': 1, 'Puay': 1, 'Crowhurst': 1, 'E8/E9': 1, 'White': 1, 'Lakeshore': 1, 'Cypress': 1, 'Anson': 1, 'Geck': 1, 'Avon': 1, 'Norfolk': 1, 'Cable': 1, 'Balestier': 1, 'Serai': 1, 'Payong': 1, 'Kelichap': 1, 'Marlene': 1, 'Hijau': 1, 'Sejarah': 1, 'Stanley': 1, 'Parsi': 1, 'Bermuda': 1, 'Halus': 1, 'Lichi': 1, 'Commerce': 1, 'G': 1, 'Rindu': 1, 'Foch': 1, 'Folkestone': 1, 'Battery': 1, 'Harbour': 1, 'Ontario': 1, 'Belibas': 1, 'Caldecott': 1, 'Royal': 1, 'Temechut': 1, "Margaret's": 1, 'Hemsley': 1, 'Fowlie': 1, 'Bowmont': 1, 'Palmer': 1, 'Synagogue': 1, 'Marne': 1, 'Ophir': 1, 'Burghley': 1, 'Enterprise': 1, 'Ali': 1, 'CleanTech': 1, 'Farleigh': 1, 'Setia': 1, 'Shunfu': 1, 'Cherry': 1, 'Seruling': 1, 'Maju': 1, 'Patong': 1, 'Da': 1, 'Hall': 1, "D'almeida": 1, 'Somme': 1, 'Tekka': 1, 'Ridley': 1, 'Dido': 1, 'Bendemeer': 1, 'Chew': 1, 'Balmeg': 1, 'Irwell': 1, 'George': 1, 'Selaseh': 1, 'Gelyang': 1, 'Khiang': 1, 'Wee': 1, 'Causeway': 1, 'Gate': 1, 'Sampan': 1, 'Minaret': 1, "Ma'mor": 1, 'Serimpi': 1, 'Adis': 1, 'Jellicoe': 1, 'H': 1, 'Ahmad': 1, 'Puntong': 1, 'Magazine': 1, 'Turl': 1, 'Villa': 1, "Michael's": 1, 'Swettenham': 1, 'Alley': 1, 'Straits': 1, 'Keong': 1, 'Leuchars': 1, 'Chepstow': 1, 'Jentera': 1, 'Tavistock': 1, 'Helena': 1, 'Parbury': 1, 'Tham': 1, 'Mistri': 1, 'Northumberland': 1, '27': 1, 'Carver': 1, 'Caseen': 1, 'Parkstone': 1, 'Larut': 1, 'Bangsawan': 1, 'Silva': 1, '60': 1, 'Gemmill': 1, 'Chartwell': 1, 'Chencharu': 1, "Andrew's": 1, 'Harlyn': 1, 'Bahasa': 1, '6C': 1, '6D': 1, '6F': 1, 'Kuak': 1, 'Perak': 1, 'Attap': 1, 'Sudan': 1, 'Kalidasa': 1, 'Merbok': 1, 'Charn': 1, 'Brickland': 1, 'Expo': 1, 'Camphor': 1, 'Rebana': 1, 'Butik': 1, 'Aida': 1, 'Lagos': 1, 'Noordin': 1, 'Bahar': 1, 'Woodland': 1, 'Baboo': 1, 'Liang': 1, 'Chiltern': 1, 'Chitty': 1, 'Sandwich': 1, 'Kledek': 1, 'Rochalie': 1, 'Crane': 1, 'Minggu': 1, 'Malu-Malu': 1, 'Sindor': 1, 'Drake': 1, 'Mulia': 1, 'Grande': 1, 'Upp': 1, 'Pin': 1, 'Artillery': 1, 'Calshot': 1, 'Jendela': 1, 'Nangka': 1, 'Low': 1, 'Moonbeam': 1, 'Nee': 1, 'Kuo': 1, 'Cedarwood': 1, 'Mesra': 1, 'Muhibbah': 1, 'Zehnder': 1, 'Anggerek': 1, "Martin's": 1, 'Ruby': 1, 'Tepong': 1, 'Ripley': 1, 'Kemuning': 1, 'Pahang': 1, 'Hyderabad': 1, 'Lateh': 1, 'Rajawali': 1, 'Norma': 1, 'Av.': 1, 'Basah': 1, 'Azam': 1, 'Tinggi': 1, 'J': 1, 'Resak': 1, 'Canterbury': 1, 'Derbyshire': 1, 'Sembilang': 1, 'Ampang': 1, 'House': 1, 'Sikudangan': 1, 'kukoh': 1, 'Gaharu': 1, 'Ava': 1, 'Boat': 1, 'Jitong': 1, 'Shaer': 1, 'Khayyam': 1, 'Verdun': 1, 'Hoy': 1, 'Lam': 1, 'Lan': 1, 'Niven': 1, 'Chua': 1, 'Read': 1, 'Huddington': 1, 'Lady': 1, 'Tambur': 1, 'Towner': 1, 'Plymouth': 1, 'Utara': 1, 'Rowell': 1, 'Aliwal': 1, 'Canada': 1, 'Station': 1, 'Rochdale': 1, 'Kenanga': 1, 'Ean': 1, 'Boscombe': 1, 'Nallur': 1, 'E5': 1, 'Kling': 1, 'Pernama': 1, 'Wilby': 1, 'Ismail': 1, 'Saint': 1, 'Willow': 1, 'Youngberg': 1, 'Unak': 1, 'Pagar': 1, 'Marigold': 1, 'Gin': 1, 'Safari': 1, 'Short': 1, 'K': 1, 'Amoy': 1, 'Leo': 1, 'Ipoh': 1, 'Refinery': 1, 'Ascot': 1, 'Iqbal': 1, 'Joan': 1, 'Clarke': 1, 'Tow': 1, 'Canton': 1, 'Mastuli': 1, 'Unggas': 1, 'Rengas': 1, 'Chiang': 1, 'Merlimau': 1, 'Greenfield': 1, 'Plaza': 1, 'Birdcage': 1, 'Maidstone': 1, 'Neythal': 1, 'Bloxhome': 1, 'Chermin': 1, 'Naung': 1, 'Dunsfold': 1, 'Maida': 1, 'Glasgow': 1, 'Parliament': 1, 'Langsat': 1, 'Tye': 1, 'Carpenter': 1, 'Dalkeith': 1, 'Birch': 1, 'Geneng': 1, 'Tempua': 1, 'Broadrick': 1, 'Lermit': 1, 'Tapisan': 1, 'Woodsville': 1, 'Laut': 1, 'Fishery': 1, 'Ridgewood': 1, 'Sankam': 1, 'Soong': 1, '26': 1, 'Pintau': 1, 'Buyong': 1, '28': 1, '29': 1, 'Wareham': 1, 'Pelatok': 1, 'Kapor': 1, 'L': 1, 'Bournemouth': 1, '2A': 1, 'Lepas': 1, 'Sendudok': 1, 'Membina': 1, 'Garlick': 1, 'Liak': 1, 'Lambeth': 1, 'Oldham': 1, 'Cavan': 1, 'Halton': 1, 'Tunggal': 1, 'Meranti': 1, 'Raglan': 1, 'Mui': 1, 'Ross': 1, 'Nemesu': 1, 'Collyer': 1, 'Ishak': 1, 'E4': 1, 'E7': 1, 'E6': 1, 'E1': 1, 'E3': 1, 'Town': 1, 'Tuah': 1, 'Tuan': 1, 'Gangsa': 1, 'Medway': 1, 'Mall': 1, 'Maude': 1, 'Chang': 1, 'Ampat': 1, 'Pasoh': 1, 'Evans': 1, 'Ee': 1, 'Hartley': 1, 'Buloh': 1, 'Pasu': 1, 'Kapal': 1, 'Elok': 1, 'One-North': 1, 'Benaan': 1, 'Bahtera': 1, 'Joran': 1, 'Pleasant': 1, 'Everton': 1, 'Enggor': 1, '18A': 1, 'Chelagi': 1, 'Ewart': 1, 'Ottawa': 1, 'Venture': 1, 'Parkway': 1, 'Tessensohn': 1, 'Greenmead': 1, 'Cecil': 1, 'Maryland': 1, '7A': 1, 'Jermin': 1, 'Purmei': 1, 'Kit': 1, 'Arab': 1, 'Im': 1, 'Bussorah': 1, 'Remis': 1, 'M': 1, 'Tho': 1, '76': 1, '74': 1, '70': 1, 'The': 1, 'Jago': 1, 'Moulmein': 1, 'Angsana': 1, 'Loke': 1, 'Robinson': 1, 'Kempas': 1, 'Ham': 1, 'Norris': 1, 'Pokok': 1, 'Hiang': 1, 'Gombak': 1, 'Hay': 1, 'Stockport': 1, 'Henderson': 1, 'Laba': 1, 'Zamrud': 1, 'French': 1, 'Kathi': 1, 'Lasia': 1, 'Wilfred': 1, 'Cosford': 1, 'Labu': 1, 'Kembang': 1, 'Bedford': 1, 'Chettiar': 1, 'Kelabu': 1, 'Eaton': 1, 'Kung': 1, 'Laurelwood': 1, 'Pereira': 1, 'Nuri': 1, 'Westlake': 1, 'Pacheli': 1, 'Kandahar': 1, 'Morse': 1, 'Create': 1, 'Simpang': 1, 'Foo': 1, 'Rose': 1, 'Blair': 1, 'Colchester': 1, 'Butterworth': 1, 'Choon': 1, 'Bugis': 1, 'Pudding': 1, 'Pheng': 1, 'Leong': 1, 'Neil': 1, 'Marang': 1, 'Hari': 1, 'Yen': 1, 'Dawson': 1, 'Yew': 1, 'Wah': 1, 'Tekukor': 1, 'Bumbong': 1, 'Gibraltar': 1, 'Penshurst': 1, 'Wat': 1, 'Serengam': 1, 'Robey': 1, 'Margaret': 1, 'Students': 1, 'Ibrahim': 1, 'Jintan': 1, 'Cedar': 1, 'Salleh': 1, 'Edgware': 1, 'N': 1, 'Mayfield': 1, 'Connaught': 1, 'Tukang': 1, 'Yoong': 1, 'Balli': 1, 'Sotong': 1, 'Tekad': 1, 'Tahar': 1, 'Mornington': 1, 'Kikis': 1, 'Tekong': 1, 'Angin': 1, 'Church': 1, 'Pebble': 1, 'Goodlink': 1, 'Cavenagh': 1, 'Quee': 1, 'Primrose': 1, 'Tumpu': 1, 'Binja': 1, 'Tanggam': 1, 'Sheng': 1, 'Padang': 1, 'Mashhor': 1, 'Rahmat': 1, 'Asuhan': 1, 'Phillip': 1, 'Hajijah': 1, 'Bernam': 1, 'Tenaga': 1, 'Samak': 1, 'Bakar': 1, 'Dafne': 1, 'Cheviot': 1, 'Kelempong': 1, 'Kebun': 1, 'Sampurna': 1, 'Somapah': 1, 'Miller': 1, 'Nipis': 1, 'Kolam': 1, 'Gembira': 1, 'Quality': 1, 'Segam': 1, 'Keruing': 1, 'Kemaman': 1, 'Segar': 1, 'Pinang': 1, 'Handy': 1, 'Mega': 1, 'Chiku': 1, 'Opal': 1, 'Kellock': 1, 'Binkiang': 1, 'Pakistan': 1, 'Chander': 1, 'Bishopsgate': 1, 'Little': 1, 'Barker': 1, 'Wakaff': 1, 'Teliti': 1, 'Melrose': 1, 'Ridout': 1, 'Setangkai': 1, 'Aviation': 1, 'Beringin': 1, 'Riviera': 1, 'Penchalak': 1, 'Bow': 1, 'Boh': 1, 'Boo': 1, 'Asas': 1, 'Ernani': 1, 'Pong': 1, 'Punai': 1, 'Sealand': 1, 'Shaw': 1, 'Elm': 1, 'Hertford': 1, 'Shan': 1, 'Queensway': 1, 'Anthony': 1, 'Merpati': 1, 'Ettrick': 1, 'link': 1, 'Lakum': 1, 'Office': 1, 'Preston': 1, 'Murray': 1, 'Wolskel': 1, 'Gapis': 1, 'Jelutong': 1, 'Mandalay': 1, 'MacKerrow': 1, 'Puaka': 1, 'Tu': 1, 'Hampshire': 1, 'Terminal': 1, 'Redwood': 1, 'Singa': 1, 'Hitam': 1, 'Mar': 1, 'Mat': 1, 'Harding': 1, 'Promenade': 1, 'Mulberry': 1, 'Andrews': 1, 'Chiselhurst': 1, 'Kelawar': 1, 'Gosport': 1, 'Borthwick': 1, 'Onan': 1, 'Makepeace': 1, 'An': 1, 'Stamford': 1, 'Shelford': 1, 'Village': 1, 'Cardiff': 1, 'Rotan': 1, 'Alwi': 1, 'Harper': 1, 'Cooling': 1, 'Libra': 1, 'Melor': 1, 'Ellington': 1, 'Nguan': 1, 'Rifle': 1, 'Deal': 1, 'Walmer': 1, 'Rendang': 1, 'Sundrige': 1, 'Arumugam': 1, 'Murai': 1, 'Sime': 1, 'Isnin': 1, 'Sentosa': 1, 'Winchester': 1, 'Howard': 1, 'Greendale': 1, 'Rumah': 1, 'Ganges': 1, 'Sesuai': 1, 'Sinaran': 1, 'Somerset': 1, 'Turut': 1, 'Berseh': 1, 'Brockhampton': 1, 'Hokien': 1, 'Chye': 1, 'Ling': 1, 'Yuk': 1, 'Hussein': 1, '39': 1, '38': 1, '30': 1, '37': 1, '36': 1, '35': 1, 'Kheam': 1, 'Greenwich': 1, 'Manila': 1, 'Songket': 1, 'Branksome': 1, 'Fu': 1, 'Asap': 1, 'Sandilands': 1, 'Hun': 1, 'Nakhoda': 1, 'Asam': 1, 'MacAlister': 1, 'Choe': 1, 'Arang': 1, 'Flanders': 1, 'Coldstream': 1, "Pearl's": 1, 'Esplanade': 1, 'Carpark': 1, 'Pergam': 1, 'Tupai': 1, 'Townshend': 1, 'Bassein': 1, 'Carpmael': 1, 'Senyum': 1, 'Nathan': 1, 'Jarak': 1, 'Horne': 1, 'Ong': 1, 'One': 1, 'Middle': 1, 'Nilam': 1, 'Dunbar': 1, 'Smith': 1, 'Remaja': 1, 'Phillips': 1, 'Cotswold': 1, 'Warwick': 1, 'Selegie': 1, 'Bird': 1, 'Wallace': 1, 'Rambutan': 1, 'Meyappa': 1, 'Hiboran': 1, 'Brooke': 1, 'Pepys': 1, 'Adat': 1, 'Hendon': 1, 'Payah': 1, 'Bridport': 1, 'Crichton': 1, 'Devonshire': 1, 'San': 1, 'Talma': 1, 'Yarrow': 1, 'Sarhad': 1, 'Semerbak': 1, 'Wallich': 1, 'Fidelio': 1, 'Zubir': 1, 'Treasure': 1, 'Legundi': 1, 'Biomedical': 1, 'Kreta': 1, 'Anamalai': 1, '84': 1, '85': 1, 'Junction': 1, 'China': 1, 'Tasmania': 1, 'Urai': 1, 'Girang': 1, 'Brizay': 1, 'Buffalo': 1, 'Durban': 1, 'Tebing': 1, 'Owen': 1, 'Saiboo': 1, 'McNally': 1, 'Palawan': 1, 'Burnfoot': 1, 'Samarinda': 1, 'Napier': 1, 'Asimont': 1, 'Lincoln': 1, 'Lebat': 1, 'Datoh': 1, 'Suasa': 1, 'Pesari': 1, 'Twenty-Fourth': 1, 'Hastings': 1, 'Limu': 1, 'Lima': 1, 'Alps': 1, 'Antoi': 1, 'Goodman': 1, 'Balam': 1, 'Pah': 1, 'Gilstead': 1, 'Dock': 1, 'Tram': 1, 'Uji': 1, 'Gelam': 1, 'Lock': 1, 'Slim': 1, 'Permata': 1, 'High': 1, 'Mohamad': 1, 'Margoliouth': 1, 'Lange': 1, 'Metropole': 1, 'Irrawaddy': 1, 'Mata': 1, 'Wilkinson': 1, 'Bhamo': 1, 'Walton': 1, 'Mon': 1, 'Cottage': 1, 'Serin': 1, 'Thrift': 1, 'Desker': 1, 'Asrama': 1, 'Indus': 1, 'Paras': 1, 'Harum': 1, 'Marican': 1, 'Camden': 1, 'Trevose': 1, 'Seni': 1, 'Siap': 1, 'Goodwood': 1, 'Unity': 1, 'Sian': 1, 'Siam': 1, 'Po': 1, 'Linden': 1, 'Hindoo': 1, 'Terubok': 1, 'Tamban': 1, 'Bahagia': 1, 'Salang': 1, 'Siok': 1, 'Vaughan': 1, 'Suka': 1, 'Hee': 1, 'Khamis': 1, 'Lily': 1, 'Temple': 1, 'Pending': 1, 'Toronto': 1, 'Mutiara': 1, 'Cheow': 1, 'Yong': 1, 'Kemboja': 1, 'Warna': 1, 'Ying': 1, 'Cyprus': 1, 'Klapa': 1, 'Inglewood': 1, 'Perindu': 1, 'Shanghai': 1, 'Chegar': 1, 'Pelangi': 1, 'Serunai': 1, 'Greenbank': 1, 'Kovan': 1, 'Lembu': 1, 'Coral': 1, 'Jelita': 1, 'Cuff': 1, 'Merak': 1, "John's": 1, 'Dryburgh': 1, 'Evelyn': 1, 'Selatan': 1, 'Chun': 1, 'Seventeenth': 1, 'Flower': 1, 'Bonham': 1, 'Erskine': 1, 'Mydin': 1, 'Mapletree': 1, 'Seagull': 1, 'Lutheran': 1, 'Martaban': 1, 'Ramsgate': 1, 'Bo': 1, 'Fan': 1, 'Rengkam': 1, 'Dorset': 1, 'Institution': 1, 'Greenwood': 1, 'Summer': 1, 'Sallim': 1, 'Nelson': 1, 'Keria': 1, 'Jamaica': 1, 'Kenarah': 1, 'Shrewsbury': 1, 'Naga': 1, 'Rabu': 1, 'Rasok': 1, 'Wishart': 1, '110': 1, 'Cowdray': 1, 'Galistan': 1, 'Ketumbit': 1, 'Sarina': 1, 'Sentul': 1, 'Dulang': 1, 'Jebat': 1, 'Eden': 1, 'Baker': 1, 'Range': 1, 'Kenya': 1, 'Bingka': 1, 'Margate': 1, 'Clarence': 1, 'Senja': 1, 'Swanage': 1, 'Lowland': 1, 'Pulasan': 1, 'Jedburgh': 1, 'Kemunchup': 1, 'Dunlop': 1, 'Bali': 1, 'Ama': 1, 'Selanting': 1, 'Ikan': 1, 'Champions': 1, 'Rusuk': 1, 'Abingdon': 1, 'Poole': 1, 'Chengkek': 1, 'Highland': 1, 'Vigilante': 1, 'Kallang-Paya': 1, 'Minto': 1, 'Berkshire': 1, 'Nira': 1, 'Thoma': 1, 'Electronics': 1, 'Dunkirk': 1, 'McCallum': 1, 'Marzuki': 1, 'Chermat': 1, 'Platina': 1, 'Lichfield': 1, '45': 1, 'Dickenson': 1, 'Bangket': 1, 'Fatt': 1, 'Berwick': 1, 'Istimewa': 1, 'Chermai': 1, 'Maria': 1, 'Harom': 1, 'Tembusu': 1, 'Belangkas': 1, 'Demak': 1, 'Route': 1, 'Daffodil': 1, 'Pillai': 1, 'Hikayat': 1, 'Siantan': 1})
modifiers = ["North", "South", "East", "West", "Central", "Upper", "Lower", "Old", "New",
"First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh",
"Eighth", "Ninth", "Tenth", "Seventeenth", "Twenty-fourth"]
def remove_modifiers(words):
"""
Removes modifiers such as numbers "First ?? Avenue"
and directions/descriptors "?? Avenue North"
"""
# remove integers/A-Z
words = [word for word in words if not word[0].isdigit()
and not (len(word) > 1 and word[1].isdigit())
and not len(word) == 1]
# remove "North/South/East/West/Central/Upper/etc"
words = [word for i, word in enumerate(words) if i == 0 or not word in modifiers]
return words
road_tags = ["Road", "Avenue", "Street", "Drive", "Lane",
"Crescent", "Walk", "Park", "Terrace", "Close", "Link",
"Place", "Way", "Grove", "Rise", "View", "Hill", "Estate",
"Farmway", "Green", "Garden", "Gardens", "Junction", "Boulevard",
"Central", "Circle", "Court", "Loop", "Track", "Square",
"Heights", "Village", "Promenade", "Vale", "Vista",
"Sector", "Circus", "Bridge", "Gate", "Valley", "Turn",
"Interchange", "Plaza", "Little", "Mount", "Highway", "Quay",
"Mall", "Bank", "Plain", "Beach", "Height", "Wood", "Ring",
"Ridge", "Island", "Ind", "Industrial", "Terminal", "Coast",
"Centre", "Northview", "Reservoir", "Alley", "Plains", "Parkway", "Viaduct",
"Expressway", "Tunnel", "Bow", "Concourse", "Grande", "Field", "Route",
"link", "Cresent", "Rd", "St", "AVe", "Cres", "Av.", "Carpark"] # residual typos and abbrevs
malay_prefix_tags = ["Jalan", "Lorong", "Bukit", "Lengkok", "Taman", "Kampong", "Lengkong"]
def split_tag(words):
"""
Splits road into tuple of name, tag, and an indicator of whether the road tag is Malay
"""
# split road tag from the actual name
tags = list()
# occasionally there may be >1 tag e.g. "Ring Road", so repeat until we're down to the name
while len(words) >= 2 and words[-1] in road_tags:
tags.append(words[-1]) # wrong order, we'll reverse them later
remainder = words[:-1]
words.pop()
if tags:
return (' '.join(remainder), ' '.join(reversed(tags)), 0) # remember to reverse!
# the above assumed the road tags would be at the end -
# in case it's Malay and the road tags are at the beginning,
# or it contains no road tag at all:
else:
if words[0] in malay_prefix_tags:
return (' '.join(words[1:]), words[0], 1)
else:
return (' '.join(words), '', 0)
def remove_residual_modifiers(data):
"""
Final round of modifier removal, ensuring that at least one word is left.
e.g. "North Road" would not have its modifier removed, but "North Bridge Road" would.
"""
# remove *initial* "North/South/East/West/Central/Upper/etc"
# assuming that there is still something left
road_name, road_tag, has_malay_road_tag = data
road_name_words = road_name.split()
if len(road_name_words) > 1:
words = [word for word in road_name_words if word not in modifiers]
return (' '.join(words), road_tag, has_malay_road_tag)
return data
def process_roadname(roadname):
"""
Perform 3 steps of cleaning: initial modifier removal,
tag splitting, removal of residual modifiers, and return the tuple
"""
return remove_residual_modifiers(split_tag(remove_modifiers(roadname.split())))
# put the results into a dataframe
split_roads = pd.DataFrame([process_roadname(road) for road in df['final_name'].values])
# rename columns
split_roads.columns = ['road_name', 'road_tag', 'has_malay_road_tag']
# which we then concatenate with the rest (there are other ways to do this too)
final = pd.concat([df, split_roads], axis=1)
final
name | final_name | road_name | road_tag | has_malay_road_tag | |
---|---|---|---|---|---|
0 | Orchard Road | Orchard Road | Orchard | Road | 0 |
1 | Hougang Avenue 1 | Hougang Avenue 1 | Hougang | Avenue | 0 |
2 | Scotts Road | Scotts Road | Scotts | Road | 0 |
3 | Keng Lee Road | Keng Lee Road | Keng Lee | Road | 0 |
4 | Newton Road | Newton Road | Newton | Road | 0 |
5 | Sarkies Road | Sarkies Road | Sarkies | Road | 0 |
6 | Patterson Road | Paterson Road | Paterson | Road | 0 |
7 | Orchard Boulevard | Orchard Boulevard | Orchard | Boulevard | 0 |
8 | Grange Road | Grange Road | Grange | Road | 0 |
9 | Paterson Hill | Paterson Hill | Paterson | Hill | 0 |
10 | River Valley Road | River Valley Road | River | Valley Road | 0 |
11 | Unity Street | Unity Street | Unity | Street | 0 |
12 | Merbau Road | Merbau Road | Merbau | Road | 0 |
13 | Mohamed Sultan Road | Mohamed Sultan Road | Mohamed Sultan | Road | 0 |
14 | Saiboo Street | Saiboo Street | Saiboo | Street | 0 |
15 | Merchant Loop | Merchant Loop | Merchant | Loop | 0 |
16 | Clemenceau Avenue | Clemenceau Avenue | Clemenceau | Avenue | 0 |
17 | Merchant Road | Merchant Road | Merchant | Road | 0 |
18 | Read Cresent | Read Crescent | Read | Crescent | 0 |
19 | Tampines Expressway | Tampines Expressway | Tampines | Expressway | 0 |
20 | Seletar Expressway | Seletar Expressway | Seletar | Expressway | 0 |
21 | Central Expressway | Central Expressway | Central | Expressway | 0 |
22 | Telok Blangah Road | Telok Blangah Road | Telok Blangah | Road | 0 |
23 | Ayer Rajah Expressway | Ayer Rajah Expressway | Ayer Rajah | Expressway | 0 |
24 | Turf Club Avenue | Turf Club Avenue | Turf Club | Avenue | 0 |
25 | Kranji Expressway | Kranji Expressway | Kranji | Expressway | 0 |
26 | Prinsep Street | Prinsep Street | Prinsep | Street | 0 |
27 | Tanglin Road | Tanglin Road | Tanglin | Road | 0 |
28 | Alexandra Road | Alexandra Road | Alexandra | Road | 0 |
29 | Nicoll Highway | Nicoll Highway | Nicoll | Highway | 0 |
... | ... | ... | ... | ... | ... |
3404 | Seletar North Link | Seletar North Link | Seletar | Link | 0 |
3405 | Ghim Moh Link | Ghim Moh Link | Ghim Moh | Link | 0 |
3406 | Hougang Street 31 | Hougang Street 31 | Hougang | Street | 0 |
3407 | Hougang Street 32 | Hougang Street 32 | Hougang | Street | 0 |
3408 | Serangoon Lane | Serangoon Lane | Serangoon | Lane | 0 |
3409 | Gambir Walk | Gambir Walk | Gambir | Walk | 0 |
3410 | Upper Serangoon Crescent | Upper Serangoon Crescent | Serangoon | Crescent | 0 |
3411 | Ubi Close | Ubi Close | Ubi | Close | 0 |
3412 | Sin Ming Lane | Sin Ming Lane | Sin Ming | Lane | 0 |
3413 | Compassvale Lane | Compassvale Lane | Compassvale | Lane | 0 |
3414 | Lorong 5 Realty Park | Lorong 5 Realty Park | Lorong Realty | Park | 0 |
3415 | Wee Nam Road | Wee Nam Road | Wee Nam | Road | 0 |
3416 | Tampines Street 72 | Tampines Street 72 | Tampines | Street | 0 |
3417 | Changi South Lane | Changi South Lane | Changi | Lane | 0 |
3418 | Telegraph Street | Telegraph Street | Telegraph | Street | 0 |
3419 | Biopolis Street | Biopolis Street | Biopolis | Street | 0 |
3420 | Biopolis Link | Biopolis Link | Biopolis | Link | 0 |
3421 | Plymouth Avenue | Plymouth Avenue | Plymouth | Avenue | 0 |
3422 | Gentle Road | Gentle Road | Gentle | Road | 0 |
3423 | Leicester Road | Leicester Road | Leicester | Road | 0 |
3424 | Simon Walk | Simon Walk | Simon | Walk | 0 |
3425 | Joo Hong Road | Joo Hong Road | Joo Hong | Road | 0 |
3426 | Florence Close | Florence Close | Florence | Close | 0 |
3427 | Hoot Kiam Road | Hoot Kiam Road | Hoot Kiam | Road | 0 |
3428 | Yishun Avenue 8 | Yishun Avenue 8 | Yishun | Avenue | 0 |
3429 | Choa Chu Kang Avenue 6 | Choa Chu Kang Avenue 6 | Choa Chu Kang | Avenue | 0 |
3430 | Clarke Quay | Clarke Quay | Clarke | Quay | 0 |
3431 | Countryside Walk | Countryside Walk | Countryside | Walk | 0 |
3432 | PIE | Pan-Island Expressway | Pan-Island | Expressway | 0 |
3433 | Nepal Park | Nepal Park | Nepal | Park | 0 |
3434 rows × 5 columns
final.to_csv("singapore-roadnames-final-split.csv")
We don't need to classify all these roads individually, since there are repeats like "Ang Mo Kio $roadtag$ $n$" for various values of $roadtag$ and $n$. So let's do a groupby
to collate the information
# we're really only interested in the max value of has_malay_road_tag
# as we're not using the rest of the info
gb = final.groupby('road_name').aggregate(max)
# flatten the groupby into a regular df and select the only columns we really want
gb2 = pd.DataFrame(gb).reset_index()[['road_name', 'has_malay_road_tag']]
gb2
road_name | has_malay_road_tag | |
---|---|---|
0 | Abingdon | 0 |
1 | Abu Talib | 1 |
2 | Adam | 0 |
3 | Adat | 1 |
4 | Adis | 0 |
5 | Admiralty | 0 |
6 | Ah Hood | 0 |
7 | Ah Soo | 1 |
8 | Ahmad Ibrahim | 1 |
9 | Aida | 0 |
10 | Airport | 0 |
11 | Alexandra | 0 |
12 | Aliwal | 0 |
13 | Aljunied | 0 |
14 | Allanbrooke | 0 |
15 | Allenby | 0 |
16 | Almond | 0 |
17 | Alnwick | 0 |
18 | Alps | 0 |
19 | Ama Keng | 0 |
20 | Amber | 0 |
21 | Amoy | 0 |
22 | Ampang | 1 |
23 | Ampas | 1 |
24 | Ampat | 1 |
25 | Anak Bukit | 1 |
26 | Anak Patong | 1 |
27 | Anamalai | 0 |
28 | Anchorvale | 0 |
29 | Anderson | 0 |
... | ... | ... |
1721 | Woodgrove | 0 |
1722 | Woodland | 0 |
1723 | Woodlands | 0 |
1724 | Woodleigh | 0 |
1725 | Woodsville | 0 |
1726 | Woollerton | 0 |
1727 | Worthing | 0 |
1728 | Xilin | 0 |
1729 | Yan Kit | 0 |
1730 | Yarrow | 0 |
1731 | Yarwood | 0 |
1732 | Yasin | 1 |
1733 | Yio Chu Kang | 0 |
1734 | Yishun | 0 |
1735 | Yong Siak | 0 |
1736 | York | 0 |
1737 | Youngberg | 0 |
1738 | Yuan Ching | 0 |
1739 | Yuk Tong | 0 |
1740 | Yung An | 0 |
1741 | Yung Ho | 0 |
1742 | Yung Kuang | 0 |
1743 | Yung Sheng | 0 |
1744 | Yunnan | 0 |
1745 | Zamrud | 1 |
1746 | Zehnder | 0 |
1747 | Zion | 0 |
1748 | Zubir Said | 0 |
1749 | kukoh | 1 |
1750 | one-north Gateway | 0 |
1751 rows × 2 columns
This is a tad backwards, but since I created these notebooks to illustrate the process after doing the classification, I'm going to go ahead and put the "gold standard" classifications into the dataframe. We'll use them as training data and to compare the predictions of the classification with it.
gold = pd.read_csv('final-classification.csv')
# strip out whitespace
gold.road_name = gold.road_name.str.strip(' ')
gold
road_name | classification | comment | |
---|---|---|---|
0 | Abingdon | British | NaN |
1 | Abu Talib | Malay | NaN |
2 | Adam | British | NaN |
3 | Adat | Malay | NaN |
4 | Adis | Other | Indian Jewish |
5 | Admiralty | British | NaN |
6 | Afifi | Malay | NaN |
7 | Ah Hood | Chinese | NaN |
8 | Ah Soo | Chinese | NaN |
9 | Ah Thia | Chinese | NaN |
10 | Ahmad Ibrahim | Malay | NaN |
11 | Aida | Other | NaN |
12 | Airline | Generic | NaN |
13 | Airport | Generic | NaN |
14 | Airport Cargo | Generic | NaN |
15 | Akyab | Other | Burmese |
16 | Albert | British | NaN |
17 | Alexandra | British | NaN |
18 | Aliwal | Indian | Battle of Aliwal in the Indo-Sikh war |
19 | Aljunied | Other | Arab |
20 | Alkaff | Other | Arab |
21 | Allamanda | Malay | NaN |
22 | Allanbrooke | British | NaN |
23 | Allenby | British | NaN |
24 | Almond | Generic | NaN |
25 | Alnwick | British | NaN |
26 | Ama Keng | Chinese | NaN |
27 | Amber | Other | after the Amber Trust fund established for poo... |
28 | Amberwood | British | NaN |
29 | Amoy | Chinese | NaN |
... | ... | ... | ... |
2159 | Sheares | British | NaN |
2160 | Slim Barracks | British | NaN |
2161 | South | Generic | NaN |
2162 | Straits | Generic | NaN |
2163 | Sumang | Malay | NaN |
2164 | Sunview | Generic | NaN |
2165 | Tanah Merah | Malay | NaN |
2166 | Tasmania | British | NaN |
2167 | Tebing | Malay | NaN |
2168 | Tekka | Chinese | NaN |
2169 | Tekong | Malay | NaN |
2170 | Terrace | Generic | NaN |
2171 | Tong Soon | Chinese | NaN |
2172 | Tram Safari | Generic | NaN |
2173 | Treasure | Generic | NaN |
2174 | Tuas Bay | Malay | NaN |
2175 | Tuas View Circuit | Malay | NaN |
2176 | Turl | British | NaN |
2177 | Ubin | Malay | NaN |
2178 | Upp Toh Tuck | Chinese | NaN |
2179 | Upper | Generic | NaN |
2180 | Venture | Generic | NaN |
2181 | Verde | Other | NaN |
2182 | Vista Exchange | Generic | NaN |
2183 | Wak Hassan | Malay | NaN |
2184 | Wat Siam | Other | NaN |
2185 | Wenya | Other | NaN |
2186 | West | Generic | NaN |
2187 | Woodland | Generic | NaN |
2188 | one-north Gateway | Generic | NaN |
2189 rows × 3 columns
gb3 = gb2.merge(gold, on='road_name', how='left')
gb3
# this is the file that we'll actually do classification on
# the rest will be used for merging the actual GeoJSON file with full road names
# and linestring data to the final classification
gb3.to_csv('singapore-roadnames-final-classified.csv')