#!/usr/bin/env python
# coding: utf-8

# ##Cleaning the Highscores
# 
# There is a lot of cleaning that needs to be done before the highscores can be read into a pandas dataframe, not all of which can be done programmatically. 
# 
# I started by taking a copy of the HighScores.vcd file (may appear as a .dat file), opening it with Notepad (PSPad will open it in the hex editor) and saving it as HighScores.txt. Manually I delete everything down as far as the first occurance of ===, this looks something like: Assembly-CSharpXRL.Core.ScoreboardScores�System.Collections.Generic.List`1[[XRL.Core.ScoreEntry, Assembly-CSharp....
# 
# There are a number of symbols which will prevent the file being read into ipython fully or will cause trouble when writing the data to file and these need to be deleted manually as well. These symbols are included in notes.txt, which should be opened in Notepad, and these can be removed by using Edit -> Replace. There is also a circle shape that needs to be removed, which can be hard to find. This is usually the symbol for the bits that make up an artifact, so it may be best to delete everything between a < >. If going through this code using your own highscores you may have to manually delete more lines or symbols.
# 
# After all that the file can finally be read in!

# In[2]:


#This is what the HighScores file now looks like
qud = open("HighScores.txt", "r")
print qud.read()
qud.close()


# There is still a number of symbols that can not be read. I have found that the following two blocks work in getting rid of these. If anyone has any better solution please suggest it.

# In[41]:


import codecs
qud = codecs.open("HighScores.txt", encoding='latin-1') #open and encode as latin-1


# In[42]:


clean_qud = open("HighScores_clean.txt", "w") #open file to save to 

for line in qud: 
    line = line.encode('utf8')  
    line = line.decode('unicode_escape').encode('ascii','ignore')
    clean_qud.write(line)
clean_qud.close()


# If you open both HighScores.txt and HighScores_clean.txt you will see that a number of the unreadable symbols have been removed and that the clean text is now much more readable.
# 
# Next, the real cleaning begins. This will be done in two major steps. First I will completely remove all unreadable text and save this as a human readable file. Then I will use this cleaned text to create a file which will fill in missing values and add seperators between each "column" so it can be read into pandas.
# 
# For the first step a list of tags will have to be removed such as &y, &r etc. These tags seem to determine the color of the next word or symbol on the highscores screen and all need to be removed. Also, sometimes a highscore does not contain a description of how the character died. We need to be able to determine between a blank line where this description should be and a blank line which occurs between highscores. 

# In[3]:


import re

#list of text to remove.
remove_list = ["&W", "&y", "&w", "&r", "&M", "&Y", "&C", "&c", "&b", "&B", "&K", "&R", "&W", "&G", "&g", "\r", "\n", "\t"]

cleaned_qud = open("Cleaned_Qud_HighScores.txt", "w") #file to write to

clean_highscores = open("HighScores_clean.txt", "r")

#flag which will be used to determine if there is a blank line in the data instead of a line describing how the character died.
#this represents if we have reached a line that says "Visited x Zones" which is always present and always occurs after the character death description
visited = False
#flag which will be used to determine if this is the first line in the file
first_line = True

for line in clean_highscores.readlines():
    line = line.replace("`", "'").replace("\n", " ").strip() #some lines have a different ' which was causing havoc! Remove all linebreaks, strip away all whitespace
    
    for remove_word in remove_list:
        line = line.replace(remove_word, "") #go down through all words in the remove list and replace them with ""
    if "===" in line: #check if this is the first line of a highscore (===Game summary for )
        visited = False #set the visited variable to false
        qud_search = re.search("=== ((\w*\'*\w*\s*)*) ===", line) #pull out everything between === and ===
        if first_line == True:
            line = str(qud_search.group(1))
            first_line = False
        else:
            line = "\n" + str(qud_search.group(1)) #if this is the first line in the file write as is, otherwise put a \n at the start. Prevents a blank line at the start of the file
    
    if len(line) == 0 and visited == False: #if a line is blank and we haven't hit the end of the highscore this is where a death description should be
        line = "blank"
        
    if "Visited" in line:
        visited = True #set to true to indicate we have passed the death description. Blank lines after this will be striped out
        
    #Even after all the cleaning some unwanted symbols were still getting through. The following line works, but is messy. But works. Did I mention it works?...well, it works so far...
    #If we have passed the death description (visited == True) any file striped of ALL spaces, even those between words, that is less than 10 letters can be assumed to be trash that has made it through the cleaning process. Delete.
    if visited == True:
        if len(line.replace(" ", "")) < 10:
            #continue 
    
    print line #print out the line
    cleaned_qud.write(line + "\n") #write the line
cleaned_qud.close()


# Wow, that was tough and we're still not near Golgotha. This human readable file created above will now be used to create a pandas readable file. This could have all been done in one step but is done in two for my sanity, which I was in danger of losing during the above process and also in the event a user would rather change the below step to clean the file in a different way.
# 
# Now we need to delete a lot of the filler text ("Game summary for ", "x died on the " etc) so that we are just left with catagorial (character name, artifact name) or integer values (score, zones).
# 
# There is also the issue of uneven or unequal highscore descriptions. Some of them contain data that the others do not. If I found a "storied item" (I remember finding a shield called "Stopslavin") then a row "Generated 1 storied items." will be added. However, if I do not find a storied item then this line will not be there. Same with artifacts. So it is possible that some scores will have (at least) two lines more than other scores and a number of flags are used to check this.
# 
# If going through this code using your own highscores data you will more than likely have to make adjustments/additions to the lines determining how the character died.

# In[4]:


import re #left behind as I often started the the notebook from this point, content with the cleaning in the above step from earlier

cleaned_qud = open("Cleaned_Qud_HighScores_1.txt", "w")

clean_highscores = open("Cleaned_Qud_HighScores.txt", "r")

first_line = True
name = " "
#flags for checking if storied items or artifacts are present in the highscore
visited = False
generated = False
artifact = False

for line in clean_highscores.readlines():
    
    line = line.replace("`", "'").replace("\n", " ").replace(".", "").strip() 
            
    if "summary" in line: #If this is the first line of a highscore
        visited = False
        generated = False #set all flags to false
        artifact = False
        line = line.replace("Game summary for ", "") #remove everything but the characters name
        name = line #save the characters name to be used in a later deletion ("name died on ")
        line = line.strip() #strip blank space. This is from an attempt to parse a line where the character name was " "
        if first_line == True:
            first_line = False
        else:
            line = "\n"+str(line) #If this is the first line saved to the file add as is, other wise add a \n to the start      
            
    if "Game ended" in line:
        line = line.replace("Game ended", "").replace("at", "").strip() #Remove "Game ended", leaving behind only the date       
        
        
    if "died on" in line:
        line = line.replace("%s died on the" % name, "").strip() #remove "name died on the " leaving behind only the Game date
        
       
    #Code to figure out what caused the players death
    if " hits (" in line:
        #The chute crab hits (x1) for 2 damage with his crab claw ->7 1d2! [7]
            if "->" in line:
                death_search = re.search("((\w*\,?\-?\s*)+) hits \(x(\d*)\) for (\d*) damage with \w{3} ((\w*\,?\-?\s*)+) ->(\d+) (\d*d\d*)!?", line.replace("The", "").replace("bloody", "").strip())
                #name, times hit, damage, weapon, PV, pos damage
                line = str(death_search.group(1)) + "\t" + str(death_search.group(3)) + "\t" + str(death_search.group(4)) + "\t" + str(death_search.group(5)) + "\t" + str(death_search.group(7)) + "\t" + str(death_search.group(8))   
            else:
        #Umchuum hits (x2) for 4 damage with his Umumerchacal! [9]
                death_search = re.search("((\w*\,?\-?\s*)+) hits \(x(\d*)\) for (\d*) damage with \w{3} ((\w*\,?\-?\s*)+)!?", line.replace("The", "").replace("bloody", "").strip())
                #name, times hit, damage, weapon, PV, pos damage
                line = str(death_search.group(1)) + "\t" + str(death_search.group(3)) + "\t" + str(death_search.group(4)) + "\t" + str(death_search.group(5)) + "\t0" + "\t0"   

    if "blank" in line:
        line = "unknown\t0\t0\tunknown\t0\t0"
    
    #lines that contain 'from' are generally short descriptions. A more effective regex could be written at a later time.
    if "from" in line:
        if line.strip() == "from bleeding!":
            line = "bleeding\t0\t0\tbleeding\t0\t0"
            
        elif line.strip() == "from the scalding steam!":
            line = "scalding steam\t0\t0\tscalding steam\t0\t0"
            
        elif line.strip() == "from the explosion!":
            line = "explosion\t0\t0\texplosion\t0\t0"
            
        elif "from the fire started by" in line:
            foe = line.replace("from the fire started by ", "").strip("!").strip()
            line = "%s\t0\t0\tfire\t0\t0" % foe
        
        elif "'s" in line:
        #from Wahmahcalcalit's lase beam!
            death_search = re.search("from ((\w*\,?\-?\s*)*(\w*\,?\-?(\'s){1}\s*)) ((\w*\,?\-?(\'s){0}\s*)+)", line)
            line = "%s\t0\t0\t%s\t0\t0\t" % (str(death_search.group(1).strip("'s'")), str(death_search.group(5)))        
            
            
    if line.strip() == "Abandoned all hope":
            line = "quit\t0\t0\tquit\t0\t0"
            
        
    if "Scored" in line:
        line = line.replace("Scored", "").replace("points", "").strip() #remove all bar the points figure
        
           
    if "Survived " in line:
        line = line.replace("Survived for", "").replace("turns", "").strip() #remove all bar the turns figure
        

    if "Visited " in line:
        visited = True #set visited flag to true
        line = line.replace("Visited", "").replace("zones", "").replace("zone", "").strip() #remove all bar the zones figure
        
        
    if "Generated" in line:
        generated = True #set generated flag to true
        line = line.replace("Generated", "").replace("storied items", "").strip() #remove all bar the storied items figure
    
        
    if "Most advanced artifact" in line:
        artifact = True #set artifact flag to true
        gen_check = "" #create a string for checking if a storied items figure exists
        if generated == False: 
            gen_check = "0\t"
            generated = True
        line = gen_check + str(line.replace("Most advanced artifact in possession:", "").strip())
        #if there is an artifact but no storied item this will read "0\t" + artifactname. If there is a storied item this will be "" + artifactname
        
    
    if len(line) == 0 and visited == True: #if we are on a blank line and we have passed the visited line this will add "0 no artifact" to the end of the line
        if generated == False:
            line = str(line) + "0\t"    
    
        if artifact == False:
            line = str(line) + "no artifact\t"   
        
        
    print line
    cleaned_qud.write(line + "\t")
cleaned_qud.write("0\tno artifact") #insert into final row    
cleaned_qud.close()


# The text is now cleaned and can be read into a pandas dataframe. The above code works with my current highscores but a lot of work would need to be done to make it compatiable with other players highscores. There are many ways to die in Qud and my parsing only takes into consideration the few ways my characters have died. I am sure there are many ways to break the above code and I would greatly appreciate any suggestions or improvements.