#!/usr/bin/env python # coding: utf-8 # ##Cleaning the Highscores # # There is a lot of cleaning that needs to be done before the highscores can be read into a pandas dataframe, not all of which can be done programmatically. # # I started by taking a copy of the HighScores.vcd file (may appear as a .dat file), opening it with Notepad (PSPad will open it in the hex editor) and saving it as HighScores.txt. Manually I delete everything down as far as the first occurance of ===, this looks something like: Assembly-CSharpXRL.Core.ScoreboardScores�System.Collections.Generic.List`1[[XRL.Core.ScoreEntry, Assembly-CSharp.... # # There are a number of symbols which will prevent the file being read into ipython fully or will cause trouble when writing the data to file and these need to be deleted manually as well. These symbols are included in notes.txt, which should be opened in Notepad, and these can be removed by using Edit -> Replace. There is also a circle shape that needs to be removed, which can be hard to find. This is usually the symbol for the bits that make up an artifact, so it may be best to delete everything between a < >. If going through this code using your own highscores you may have to manually delete more lines or symbols. # # After all that the file can finally be read in! # In[2]: #This is what the HighScores file now looks like qud = open("HighScores.txt", "r") print qud.read() qud.close() # There is still a number of symbols that can not be read. I have found that the following two blocks work in getting rid of these. If anyone has any better solution please suggest it. # In[41]: import codecs qud = codecs.open("HighScores.txt", encoding='latin-1') #open and encode as latin-1 # In[42]: clean_qud = open("HighScores_clean.txt", "w") #open file to save to for line in qud: line = line.encode('utf8') line = line.decode('unicode_escape').encode('ascii','ignore') clean_qud.write(line) clean_qud.close() # If you open both HighScores.txt and HighScores_clean.txt you will see that a number of the unreadable symbols have been removed and that the clean text is now much more readable. # # Next, the real cleaning begins. This will be done in two major steps. First I will completely remove all unreadable text and save this as a human readable file. Then I will use this cleaned text to create a file which will fill in missing values and add seperators between each "column" so it can be read into pandas. # # For the first step a list of tags will have to be removed such as &y, &r etc. These tags seem to determine the color of the next word or symbol on the highscores screen and all need to be removed. Also, sometimes a highscore does not contain a description of how the character died. We need to be able to determine between a blank line where this description should be and a blank line which occurs between highscores. # In[3]: import re #list of text to remove. remove_list = ["&W", "&y", "&w", "&r", "&M", "&Y", "&C", "&c", "&b", "&B", "&K", "&R", "&W", "&G", "&g", "\r", "\n", "\t"] cleaned_qud = open("Cleaned_Qud_HighScores.txt", "w") #file to write to clean_highscores = open("HighScores_clean.txt", "r") #flag which will be used to determine if there is a blank line in the data instead of a line describing how the character died. #this represents if we have reached a line that says "Visited x Zones" which is always present and always occurs after the character death description visited = False #flag which will be used to determine if this is the first line in the file first_line = True for line in clean_highscores.readlines(): line = line.replace("`", "'").replace("\n", " ").strip() #some lines have a different ' which was causing havoc! Remove all linebreaks, strip away all whitespace for remove_word in remove_list: line = line.replace(remove_word, "") #go down through all words in the remove list and replace them with "" if "===" in line: #check if this is the first line of a highscore (===Game summary for ) visited = False #set the visited variable to false qud_search = re.search("=== ((\w*\'*\w*\s*)*) ===", line) #pull out everything between === and === if first_line == True: line = str(qud_search.group(1)) first_line = False else: line = "\n" + str(qud_search.group(1)) #if this is the first line in the file write as is, otherwise put a \n at the start. Prevents a blank line at the start of the file if len(line) == 0 and visited == False: #if a line is blank and we haven't hit the end of the highscore this is where a death description should be line = "blank" if "Visited" in line: visited = True #set to true to indicate we have passed the death description. Blank lines after this will be striped out #Even after all the cleaning some unwanted symbols were still getting through. The following line works, but is messy. But works. Did I mention it works?...well, it works so far... #If we have passed the death description (visited == True) any file striped of ALL spaces, even those between words, that is less than 10 letters can be assumed to be trash that has made it through the cleaning process. Delete. if visited == True: if len(line.replace(" ", "")) < 10: #continue print line #print out the line cleaned_qud.write(line + "\n") #write the line cleaned_qud.close() # Wow, that was tough and we're still not near Golgotha. This human readable file created above will now be used to create a pandas readable file. This could have all been done in one step but is done in two for my sanity, which I was in danger of losing during the above process and also in the event a user would rather change the below step to clean the file in a different way. # # Now we need to delete a lot of the filler text ("Game summary for ", "x died on the " etc) so that we are just left with catagorial (character name, artifact name) or integer values (score, zones). # # There is also the issue of uneven or unequal highscore descriptions. Some of them contain data that the others do not. If I found a "storied item" (I remember finding a shield called "Stopslavin") then a row "Generated 1 storied items." will be added. However, if I do not find a storied item then this line will not be there. Same with artifacts. So it is possible that some scores will have (at least) two lines more than other scores and a number of flags are used to check this. # # If going through this code using your own highscores data you will more than likely have to make adjustments/additions to the lines determining how the character died. # In[4]: import re #left behind as I often started the the notebook from this point, content with the cleaning in the above step from earlier cleaned_qud = open("Cleaned_Qud_HighScores_1.txt", "w") clean_highscores = open("Cleaned_Qud_HighScores.txt", "r") first_line = True name = " " #flags for checking if storied items or artifacts are present in the highscore visited = False generated = False artifact = False for line in clean_highscores.readlines(): line = line.replace("`", "'").replace("\n", " ").replace(".", "").strip() if "summary" in line: #If this is the first line of a highscore visited = False generated = False #set all flags to false artifact = False line = line.replace("Game summary for ", "") #remove everything but the characters name name = line #save the characters name to be used in a later deletion ("name died on ") line = line.strip() #strip blank space. This is from an attempt to parse a line where the character name was " " if first_line == True: first_line = False else: line = "\n"+str(line) #If this is the first line saved to the file add as is, other wise add a \n to the start if "Game ended" in line: line = line.replace("Game ended", "").replace("at", "").strip() #Remove "Game ended", leaving behind only the date if "died on" in line: line = line.replace("%s died on the" % name, "").strip() #remove "name died on the " leaving behind only the Game date #Code to figure out what caused the players death if " hits (" in line: #The chute crab hits (x1) for 2 damage with his crab claw ->7 1d2! [7] if "->" in line: death_search = re.search("((\w*\,?\-?\s*)+) hits \(x(\d*)\) for (\d*) damage with \w{3} ((\w*\,?\-?\s*)+) ->(\d+) (\d*d\d*)!?", line.replace("The", "").replace("bloody", "").strip()) #name, times hit, damage, weapon, PV, pos damage line = str(death_search.group(1)) + "\t" + str(death_search.group(3)) + "\t" + str(death_search.group(4)) + "\t" + str(death_search.group(5)) + "\t" + str(death_search.group(7)) + "\t" + str(death_search.group(8)) else: #Umchuum hits (x2) for 4 damage with his Umumerchacal! [9] death_search = re.search("((\w*\,?\-?\s*)+) hits \(x(\d*)\) for (\d*) damage with \w{3} ((\w*\,?\-?\s*)+)!?", line.replace("The", "").replace("bloody", "").strip()) #name, times hit, damage, weapon, PV, pos damage line = str(death_search.group(1)) + "\t" + str(death_search.group(3)) + "\t" + str(death_search.group(4)) + "\t" + str(death_search.group(5)) + "\t0" + "\t0" if "blank" in line: line = "unknown\t0\t0\tunknown\t0\t0" #lines that contain 'from' are generally short descriptions. A more effective regex could be written at a later time. if "from" in line: if line.strip() == "from bleeding!": line = "bleeding\t0\t0\tbleeding\t0\t0" elif line.strip() == "from the scalding steam!": line = "scalding steam\t0\t0\tscalding steam\t0\t0" elif line.strip() == "from the explosion!": line = "explosion\t0\t0\texplosion\t0\t0" elif "from the fire started by" in line: foe = line.replace("from the fire started by ", "").strip("!").strip() line = "%s\t0\t0\tfire\t0\t0" % foe elif "'s" in line: #from Wahmahcalcalit's lase beam! death_search = re.search("from ((\w*\,?\-?\s*)*(\w*\,?\-?(\'s){1}\s*)) ((\w*\,?\-?(\'s){0}\s*)+)", line) line = "%s\t0\t0\t%s\t0\t0\t" % (str(death_search.group(1).strip("'s'")), str(death_search.group(5))) if line.strip() == "Abandoned all hope": line = "quit\t0\t0\tquit\t0\t0" if "Scored" in line: line = line.replace("Scored", "").replace("points", "").strip() #remove all bar the points figure if "Survived " in line: line = line.replace("Survived for", "").replace("turns", "").strip() #remove all bar the turns figure if "Visited " in line: visited = True #set visited flag to true line = line.replace("Visited", "").replace("zones", "").replace("zone", "").strip() #remove all bar the zones figure if "Generated" in line: generated = True #set generated flag to true line = line.replace("Generated", "").replace("storied items", "").strip() #remove all bar the storied items figure if "Most advanced artifact" in line: artifact = True #set artifact flag to true gen_check = "" #create a string for checking if a storied items figure exists if generated == False: gen_check = "0\t" generated = True line = gen_check + str(line.replace("Most advanced artifact in possession:", "").strip()) #if there is an artifact but no storied item this will read "0\t" + artifactname. If there is a storied item this will be "" + artifactname if len(line) == 0 and visited == True: #if we are on a blank line and we have passed the visited line this will add "0 no artifact" to the end of the line if generated == False: line = str(line) + "0\t" if artifact == False: line = str(line) + "no artifact\t" print line cleaned_qud.write(line + "\t") cleaned_qud.write("0\tno artifact") #insert into final row cleaned_qud.close() # The text is now cleaned and can be read into a pandas dataframe. The above code works with my current highscores but a lot of work would need to be done to make it compatiable with other players highscores. There are many ways to die in Qud and my parsing only takes into consideration the few ways my characters have died. I am sure there are many ways to break the above code and I would greatly appreciate any suggestions or improvements.