Notebook

Transliteration¶

Transliteration is the conversion of a text from one script to another. For instance, a Latin transliteration of the Greek phrase "Ελληνική Δημοκρατία", usually translated as 'Hellenic Republic', is "Ellēnikḗ Dēmokratía".

In [1]:

from polyglot.transliteration import Transliterator

Languages Coverage¶

In [2]:

from polyglot.downloader import downloader
print(downloader.supported_languages_table("transliteration2"))

  1. Haitian; Haitian Creole    2. Tamil                      3. Vietnamese               
  4. Telugu                     5. Croatian                   6. Hungarian                
  7. Thai                       8. Kannada                    9. Tagalog                  
 10. Armenian                  11. Hebrew (modern)           12. Turkish                  
 13. Portuguese                14. Belarusian                15. Norwegian Nynorsk        
 16. Norwegian                 17. Dutch                     18. Japanese                 
 19. Albanian                  20. Bulgarian                 21. Serbian                  
 22. Swahili                   23. Swedish                   24. French                   
 25. Latin                     26. Czech                     27. Yiddish                  
 28. Hindi                     29. Danish                    30. Finnish                  
 31. German                    32. Bosnian-Croatian-Serbian  33. Slovak                   
 34. Persian                   35. Lithuanian                36. Slovene                  
 37. Latvian                   38. Bosnian                   39. Gujarati                 
 40. Italian                   41. Icelandic                 42. Spanish; Castilian       
 43. Ukrainian                 44. Georgian                  45. Urdu                     
 46. Indonesian                47. Marathi (Marāṭhī)         48. Korean                   
 49. Galician                  50. Khmer                     51. Catalan; Valencian       
 52. Romanian, Moldavian, ...  53. Basque                    54. Macedonian               
 55. Russian                   56. Azerbaijani               57. Chinese                  
 58. Estonian                  59. Welsh                     60. Arabic                   
 61. Bengali                   62. Amharic                   63. Irish                    
 64. Malay                     65. Afrikaans                 66. Polish                   
 67. Greek, Modern             68. Esperanto                 69. Maltese

Downloading Necessary Models¶

In [3]:

%%bash
polyglot download embeddings2.en transliteration2.ar

[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package transliteration2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package transliteration2.ar is already up-to-date!

Example¶

We tag each word in the text with one part of speech.

In [7]:

from polyglot.text import Text

In [8]:

blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)

We can query all the tagged words

In [9]:

for x in text.transliterate("ar"):
  print(x)

وي
ويل
ميت
ات
ييايت
أوكلوك
ون
ثورسداي
مورنينغ

Command Line Interface¶

In [20]:

!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en transliteration --target ar | tail -n 30

which           ويكه            
India           ينديا           
beat            بيت             
Bermuda         بيرمودا         
in              ين              
Port            بورت            
of              وف              
Spain           سباين           
in              ين              
2007                            
,                               
which           ويكه            
was             واس             
equalled        يكالليد         
five            فيفي            
days            دايس            
ago             اغو             
by              بي              
South           سووث            
Africa          افريكا          
in              ين              
their           ثير             
victory         فيكتوري         
over            وفير            
West            ويست            
Indies          يندييس          
in              ين              
Sydney          سيدني           
.

Citation¶

This work is a direct implementation of the research being described in the False-Friend Detection and Entity Matching via Unsupervised Transliteration paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

       @article{chen2016false,
       title = {False-Friend Detection and Entity Matching via Unsupervised Transliteration},
       author = {Chen, Yanqing and Skiena, Steven},
       journal = {arXiv preprint arXiv:1611.06722},
       year = {2016}
       }