The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text.
The library provides the following functionalities:
The data resources required by the Indic NLP Library are hosted in a different repository. These resources are required for some modules. You can download from the Indic NLP Resources project.
----- Set these variables -----
# The path to the local git repo for Indic NLP library
INDIC_NLP_LIB_HOME=r"C:\Users\ankunchu\Documents\src\indic_nlp_library"
# The path to the local git repo for Indic NLP Resources
INDIC_NLP_RESOURCES=r"C:\Users\ankunchu\Documents\src\indic_nlp_resources"
Add Library to Python path
import sys
sys.path.append(r'{}'.format(INDIC_NLP_LIB_HOME))
Export environment variable
export INDIC_RESOURCES_PATH=<path>
OR
set it programmatically We will use that method for this demo
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)
Initialize the Indic NLP library
from indicnlp import loader
loader.load()
** Let's actually try out some of the API methods in the Indic NLP library **
Many of the API functions require a language code. We use 2-letter ISO 639-1 codes. Some languages do not have assigned 2-letter codes. We use the following two-letter codes for such languages:
Text written in Indic scripts display a lot of quirky behaviour on account of varying input methods, multiple representations for the same character, etc. There is a need to canonicalize the representation of text so that NLP applications can handle the data in a consistent manner. The canonicalization primarily handles the following issues:
- Non-spacing characters like ZWJ/ZWNL
- Multiple representations of Nukta based characters
- Multiple representations of two part dependent vowel signs
- Typing inconsistencies: e.g. use of pipe (|) for poorna virama
When data available is scarce, such normalization can help utilize the data more efficiently.
from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
input_text="\u0958 \u0915\u093c"
remove_nuktas=False
factory=IndicNormalizerFactory()
normalizer=factory.get_normalizer("hi",remove_nuktas)
output_text=normalizer.normalize(input_text)
print(input_text)
print()
print('Before normalization')
print(' '.join([ hex(ord(c)) for c in input_text ] ))
print('Length: {}'.format(len(input_text)))
print()
print('After normalization')
print(' '.join([ hex(ord(c)) for c in output_text ] ))
print('Length: {}'.format(len(output_text)))
реШ рдХрд╝ Before normalization 0x958 0x20 0x915 0x93c Length: 4 After normalization 0x915 0x93c 0x20 0x915 0x93c Length: 5
A smart sentence splitter which uses a two-pass rule-based system to split the text into sentences. It knows of common prefixes in Indian languages.
from indicnlp.tokenize import sentence_tokenize
indic_string="""рддреЛ рдХреНрдпрд╛ рд╡рд┐рд╢реНрд╡ рдХрдк 2019 рдореЗрдВ рдореИрдЪ рдХрд╛ рдмреЙрд╕ рдЯреЙрд╕ рд╣реИ? рдпрд╛рдиреА рдореИрдЪ рдореЗрдВ рд╣рд╛рд░-рдЬреАрдд рдореЗрдВ \
рдЯреЙрд╕ рдХреА рднреВрдорд┐рдХрд╛ рдЕрд╣рдо рд╣реИ? рдЖрдк рдРрд╕рд╛ рд╕реЛрдЪ рд╕рдХрддреЗ рд╣реИрдВред рд╡рд┐рд╢реНрд╡рдХрдк рдХреЗ рдЕрдкрдиреЗ-рдЕрдкрдиреЗ рдкрд╣рд▓реЗ рдореИрдЪ рдореЗрдВ рдмреБрд░реА рддрд░рд╣ рд╣рд╛рд░рдиреЗ рд╡рд╛рд▓реА рдПрд╢рд┐рдпрд╛ рдХреА рджреЛ рдЯреАрдореЛрдВ \
рдкрд╛рдХрд┐рд╕реНрддрд╛рди рдФрд░ рд╢реНрд░реАрд▓рдВрдХрд╛ рдХреЗ рдХрдкреНрддрд╛рди рдиреЗ рд╣рд╛рд▓рд╛рдВрдХрд┐ рдЕрдкрдиреЗ рд╣рд╛рд░ рдХреЗ рдкреАрдЫреЗ рдЯреЙрд╕ рдХреА рджрд▓реАрд▓ рддреЛ рдирд╣реАрдВ рджреА, рд▓реЗрдХрд┐рди рдпрд╣ рдЬрд░реВрд░ рдХрд╣рд╛ рдерд╛ рдХрд┐ рд╡рд╣ рдПрдХ рдЕрд╣рдо рдЯреЙрд╕ рд╣рд╛рд░ рдЧрдП рдереЗред"""
sentences=sentence_tokenize.sentence_split(indic_string, lang='hi')
for t in sentences:
print(t)
рддреЛ рдХреНрдпрд╛ рд╡рд┐рд╢реНрд╡ рдХрдк 2019 рдореЗрдВ рдореИрдЪ рдХрд╛ рдмреЙрд╕ рдЯреЙрд╕ рд╣реИ? рдпрд╛рдиреА рдореИрдЪ рдореЗрдВ рд╣рд╛рд░-рдЬреАрдд рдореЗрдВ рдЯреЙрд╕ рдХреА рднреВрдорд┐рдХрд╛ рдЕрд╣рдо рд╣реИ? рдЖрдк рдРрд╕рд╛ рд╕реЛрдЪ рд╕рдХрддреЗ рд╣реИрдВред рд╡рд┐рд╢реНрд╡рдХрдк рдХреЗ рдЕрдкрдиреЗ-рдЕрдкрдиреЗ рдкрд╣рд▓реЗ рдореИрдЪ рдореЗрдВ рдмреБрд░реА рддрд░рд╣ рд╣рд╛рд░рдиреЗ рд╡рд╛рд▓реА рдПрд╢рд┐рдпрд╛ рдХреА рджреЛ рдЯреАрдореЛрдВ рдкрд╛рдХрд┐рд╕реНрддрд╛рди рдФрд░ рд╢реНрд░реАрд▓рдВрдХрд╛ рдХреЗ рдХрдкреНрддрд╛рди рдиреЗ рд╣рд╛рд▓рд╛рдВрдХрд┐ рдЕрдкрдиреЗ рд╣рд╛рд░ рдХреЗ рдкреАрдЫреЗ рдЯреЙрд╕ рдХреА рджрд▓реАрд▓ рддреЛ рдирд╣реАрдВ рджреА, рд▓реЗрдХрд┐рди рдпрд╣ рдЬрд░реВрд░ рдХрд╣рд╛ рдерд╛ рдХрд┐ рд╡рд╣ рдПрдХ рдЕрд╣рдо рдЯреЙрд╕ рд╣рд╛рд░ рдЧрдП рдереЗред
A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.
from indicnlp.tokenize import indic_tokenize
indic_string='рд╕реБрдиреЛ, рдХреБрдЫ рдЖрд╡рд╛реЫ рдЖ рд░рд╣реА рд╣реИред рдлреЛрди?'
print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string):
print(t)
Input String: рд╕реБрдиреЛ, рдХреБрдЫ рдЖрд╡рд╛реЫ рдЖ рд░рд╣реА рд╣реИред рдлреЛрди? Tokens: рд╕реБрдиреЛ , рдХреБрдЫ рдЖрд╡рд╛реЫ рдЖ рд░рд╣реА рд╣реИ ред рдлреЛрди ?
A de-tokenizer for Indian languages that can address punctuation in Indic languages. The de-tokenizer is useful when generating natural language output. It can be used as a post-processor.
from indicnlp.tokenize import indic_detokenize
indic_string='" рд╕реБрдиреЛ , рдХреБрдЫ рдЖрд╡рд╛реЫ рдЖ рд░рд╣реА рд╣реИ . " , рдЙрд╕рдиреЗ рдХрд╣рд╛ ред '
print('Input String: {}'.format(indic_string))
print('Detokenized String: {}'.format(indic_detokenize.trivial_detokenize(indic_string,lang='hi')))
Input String: " рд╕реБрдиреЛ , рдХреБрдЫ рдЖрд╡рд╛реЫ рдЖ рд░рд╣реА рд╣реИ . " , рдЙрд╕рдиреЗ рдХрд╣рд╛ ред Detokenized String: "рд╕реБрдиреЛ, рдХреБрдЫ рдЖрд╡рд╛реЫ рдЖ рд░рд╣реА рд╣реИ.", рдЙрд╕рдиреЗ рдХрд╣рд╛ред
Convert from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script. The following scripts are supported:
Devanagari (Hindi,Marathi,Sanskrit,Konkani,Sindhi,Nepali), Assamese, Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Sindhi, Tamil, Telugu, Kannada, Malayalam
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
input_text='рд░рд╛рдЬрд╕реНрдерд╛рди'
# input_text='р┤░р┤╛р┤Ьр┤╕р╡Нр┤ер┤╛р┤и'
# input_text='р╢╗р╖Пр╢вр╖Гр╖Кр╢ор╖Пр╢▒'
print(UnicodeIndicTransliterator.transliterate(input_text,"hi","ta"))
ро░ро╛роЬро╕рпНродро╛рои
Convert script text to Roman text in the ITRANS notation
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text='рд░рд╛рдЬрд╕реНрдерд╛рди'
# input_text='роЖроЪро┐ро░ро┐ропро░рпНроХро│рпН'
lang='hi'
print(ItransTransliterator.to_itrans(input_text,lang))
raajasthaana
Let's call conversion of ITRANS-transliteration to an Indic script as Indicization!
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator
input_text='pAlakkAda'
# input_text='pitL^In'
lang='ml'
x=ItransTransliterator.from_itrans(input_text,lang)
print(x)
for y in x:
print('{:x}'.format(ord(y)))
р┤кр┤╛р┤▓р┤Хр╡Нр┤Хр┤╛р┤ж d2a d3e d32 d15 d4d d15 d3e d26
Indic scripts have been designed keeping phonetic principles in nature and the design and organization of the scripts makes it easy to obtain phonetic information about the characters.
With each script character, a phontic feature vector is associated, which encodes the phontic properties of the character. This is a bit vector which is can be obtained as shown below:
from indicnlp.script import indic_scripts as isc
c='рдХ'
lang='hi'
isc.get_phonetic_feature_vector(c,lang)
array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)
This fields in this bit vector are (from left to right):
sorted(isc.PV_PROP_RANGES.items(),key=lambda x:x[1][0])
[('basic_type', [0, 6]), ('vowel_length', [6, 8]), ('vowel_strength', [8, 11]), ('vowel_status', [11, 13]), ('consonant_type', [13, 18]), ('articulation_place', [18, 23]), ('aspiration', [23, 25]), ('voicing', [25, 27]), ('nasalization', [27, 29]), ('vowel_horizontal', [29, 32]), ('vowel_vertical', [32, 36]), ('vowel_roundness', [36, 38])]
Note: This interface below will be deprecated soon and a new interface will be available soon.
from indicnlp.langinfo import *
c='рдХ'
lang='hi'
print('Is vowel?: {}'.format(is_vowel(c,lang)))
print('Is consonant?: {}'.format(is_consonant(c,lang)))
print('Is velar?: {}'.format(is_velar(c,lang)))
print('Is palatal?: {}'.format(is_palatal(c,lang)))
print('Is aspirated?: {}'.format(is_aspirated(c,lang)))
print('Is unvoiced?: {}'.format(is_unvoiced(c,lang)))
print('Is nasal?: {}'.format(is_nasal(c,lang)))
Is vowel?: False Is consonant?: True Is velar?: True Is palatal?: False Is aspirated?: False Is unvoiced?: True Is nasal?: False
Using the phonetic feature vectors, we can define phonetic similarity between the characters (and underlying phonemes). The library implements some measures for phonetic similarity between the characters (and underlying phonemes). These can be defined using the phonetic feature vectors discussed earlier, so users can implement additional similarity measures.
The implemented similarity measures are:
** References **
Anoop Kunchukuttan, Pushpak Bhattacharyya, Mitesh Khapra. Substring-based unsupervised transliteration with phonetic and contextual knowledge. SIGNLL Conference on Computational Natural Language Learning ** (CoNLL 2016) **. 2016.
from indicnlp.script import indic_scripts as isc
from indicnlp.script import phonetic_sim as psim
c1='рдХ'
c2='рдЦ'
c3='рдн'
lang='hi'
print('Similarity between {} and {}'.format(c1,c2))
print(psim.cosine(
isc.get_phonetic_feature_vector(c1,lang),
isc.get_phonetic_feature_vector(c2,lang)
))
print()
print(u'Similarity between {} and {}'.format(c1,c3))
print(psim.cosine(
isc.get_phonetic_feature_vector(c1,lang),
isc.get_phonetic_feature_vector(c3,lang)
))
Similarity between рдХ and рдЦ 0.8333319444467593 Similarity between рдХ and рдн 0.4999991666680556
You may have figured out that you can also compute similarities of characters belonging to different scripts.
You can also get a similarity matrix which contains the similarities between all pairs of characters (within the same script or across scripts).
Let's see how we can compare the characters across Devanagari and Malayalam scripts
from indicnlp.script import indic_scripts as isc
from indicnlp.script import phonetic_sim as psim
slang='hi'
tlang='ml'
sim_mat=psim.create_similarity_matrix(psim.cosine,slang,tlang,normalize=False)
c1='рдХ'
c2='р┤Ц'
print('Similarity between {} and {}'.format(c1,c2))
print(sim_mat[isc.get_offset(c1,slang),isc.get_offset(c2,tlang)])
Similarity between рдХ and р┤Ц 0.8333319444467593
Some similarity functions like sim
do not generate values in the range [0,1] and it may be more convenient to have the similarity values in the range [0,1]. This can be achieved by setting the normalize
paramter to True
slang='hi'
tlang='ml'
sim_mat=psim.create_similarity_matrix(psim.sim1,slang,tlang,normalize=True)
c1='рдХ'
c2='р┤Ц'
print(u'Similarity between {} and {}'.format(c1,c2))
print(sim_mat[isc.get_offset(c1,slang),isc.get_offset(c2,tlang)])
Similarity between рдХ and р┤Ц 0.06860894001932027
from indicnlp.script import indic_scripts as isc
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
lang1_str='рдкрд┐рдЫрд▓реЗ рджрд┐рдиреЛрдВ рд╣рдо рд▓реЛрдЧреЛрдВ рдиреЗ рдХрдИ рдЙрддреНрд╕рд╡ рдордирд╛рдпреЗ. рдХрд▓, рд╣рд┐рдиреНрджреБрд╕реНрддрд╛рди рднрд░ рдореЗрдВ рд╢реНрд░реА рдХреГрд╖реНрдг рдЬрдиреНрдо-рдорд╣реЛрддреНрд╕рд╡ рдордирд╛рдпрд╛ рдЧрдпрд╛.'
lang2_str='рк╡рлАркдрлЗрк▓рк╛ ркжрк┐рк╡рк╕рлЛркорк╛ркВ ркЖрккркгрлЗ ркХрлЗркЯрк▓рк╛ркп ркЙркдрлНрк╕рк╡рлЛ ркЙркЬрк╡рлНркпрк╛. рк╣ркЬрлА ркЧркЗркХрк╛рк▓рлЗ ркЬ рккрлВрк░рк╛ рк╣рк┐ркВркжрлБрк╕рлНркдрк╛ркиркорк╛ркВ рк╢рлНрк░рлАркХрлГрк╖рлНркг ркЬркирлНркорлЛркдрлНрк╕рк╡ ркЙркЬрк╡рк╡рк╛ркорк╛ркВ ркЖрк╡рлНркпрлЛ.'
lang1='hi'
lang2='gu'
lcsr, len1, len2 = isc.lcsr_indic(lang1_str,lang2_str,lang1,lang2)
print('{} string: {}'.format(lang1, lang1_str))
print('{} string: {}'.format(lang2, UnicodeIndicTransliterator.transliterate(lang2_str,lang2,lang1)))
print('Both strings are shown in Devanagari script using script conversion for readability.')
print('LCSR: {}'.format(lcsr))
hi string: рдкрд┐рдЫрд▓реЗ рджрд┐рдиреЛрдВ рд╣рдо рд▓реЛрдЧреЛрдВ рдиреЗ рдХрдИ рдЙрддреНрд╕рд╡ рдордирд╛рдпреЗ. рдХрд▓, рд╣рд┐рдиреНрджреБрд╕реНрддрд╛рди рднрд░ рдореЗрдВ рд╢реНрд░реА рдХреГрд╖реНрдг рдЬрдиреНрдо-рдорд╣реЛрддреНрд╕рд╡ рдордирд╛рдпрд╛ рдЧрдпрд╛. gu string: рд╡реАрддреЗрд▓рд╛ рджрд┐рд╡рд╕реЛрдорд╛рдВ рдЖрдкрдгреЗ рдХреЗрдЯрд▓рд╛рдп рдЙрддреНрд╕рд╡реЛ рдЙрдЬрд╡реНрдпрд╛. рд╣рдЬреА рдЧрдЗрдХрд╛рд▓реЗ рдЬ рдкреВрд░рд╛ рд╣рд┐рдВрджреБрд╕реНрддрд╛рдирдорд╛рдВ рд╢реНрд░реАрдХреГрд╖реНрдг рдЬрдиреНрдореЛрддреНрд╕рд╡ рдЙрдЬрд╡рд╡рд╛рдорд╛рдВ рдЖрд╡реНрдпреЛ. Both strings are shown in Devanagari script using script conversion for readability. LCSR: 0.5545454545454546
Orthographic Syllabification is an approximate syllabification process for Indic scripts, where CV+ units are defined to be orthographic syllables.
See the following paper for details:
Anoop Kunchukuttan, Pushpak Bhattacharyya. Orthographic Syllable as basic unit for SMT between Related Languages. Conference on Empirical Methods in Natural Language Processing (EMNLP 2016). 2016.
from indicnlp.syllable import syllabifier
w='рдЬрдЧрджреАрд╢рдЪрдВрджреНрд░'
lang='hi'
print(' '.join(syllabifier.orthographic_syllabify(w,lang)))
рдЬ рдЧ рджреА рд╢ рдЪ рдВрджреНрд░
Unsupervised morphological analysers for various Indian language. Given a word, the analyzer returns the componenent morphemes. The analyzer can recognize inflectional and derivational morphemes.
The following languages are supported:
Hindi, Punjabi, Marathi, Konkani, Gujarati, Bengali, Kannada, Tamil, Telugu, Malayalam
Support for more languages will be added soon.
from indicnlp.morph import unsupervised_morph
from indicnlp import common
analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('mr')
indic_string='рдЖрдкрд▓реНрдпрд╛ рд╣рд┐рд░рдбреНрдпрд╛рдВрдЪреНрдпрд╛ рдЖрдгрд┐ рджрд╛рддрд╛рдВрдЪреНрдпрд╛рдордзреНрдпреЗ рдЬреАрд╡рд╛рдгреВ рдЕрд╕рддрд╛рдд .'
analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))
for w in analyzes_tokens:
print(w)
рдЖрдкрд▓реНрдпрд╛ рд╣рд┐рд░рдбреНрдпрд╛ рдВрдЪреНрдпрд╛ рдЖрдгрд┐ рджрд╛рддрд╛ рдВрдЪреНрдпрд╛ рдордзреНрдпреЗ рдЬреАрд╡рд╛рдгреВ рдЕрд╕рддрд╛рдд .
We use the BrahmiNet REST API for transliteration.
import json
import requests
from urllib.parse import quote
text=quote('manish joe')
# text=quote('рдордирд┐рд╢реН рдЬреЛрдП')
url='http://www.cfilt.iitb.ac.in/indicnlpweb/indicnlpws/transliterate_bulk/en/hi/{}/statistical'.format(text)
print(url)
response = requests.get(url)
response.json()
http://www.cfilt.iitb.ac.in/indicnlpweb/indicnlpws/transliterate_bulk/en/hi/manish%20joe/rule
{'hi': ['рдордирд┐рд╢реН рдЬреЛрдП']}
Acronyms have a different behaviour while transliterating. Hence, a rule-based transliterator for transliterating English acronyms to Indian languages is available.
This can also be used to generate synthetic transliteration data to train a Indian language to English transliterator for acronyms.
from indicnlp.transliterate import acronym_transliterator
ack_transliterator=acronym_transliterator.LatinToIndicAcronymTransliterator()
ack_transliterator.transliterate('ICICI',lang='hi')
'рдЖрдИрд╕реАрдЖрдИрд╕реАрдЖрдИ'
We use the Shata-anuvaadak for translation. You can read more about Shata-anuvaadak here
import json
import requests
from urllib.parse import quote
text=quote('Mumbai is the capital of Maharashtra')
# text=quote('рдордирд┐рд╢реН рдЬреЛрдП')
url='http://www.cfilt.iitb.ac.in/indicnlpweb/indicnlpws/translate/en/mr/{}/'.format(text)
## Note the forward slash '/' at the end of the URL. It's should be there, but please live with it for now!
print(url)
response = requests.get(url)
response.json()
http://www.cfilt.iitb.ac.in/indicnlpweb/indicnlpws/translate/en/mr/Mumbai%20is%20the%20capital%20of%20Maharashtra/
{'mr': 'рд░рд╛рдЬрдзрд╛рдиреА рдорд╣рд╛рд░рд╛рд╖реНрдЯреНрд░ рдореБрдВрдмрдИ рдЖрд╣реЗ . '}