Bengal Language Model

Sagor Sarker
2 min readJun 9, 2020

Bengali language model is build with fastai’s ULMFit and ready for prediction and classfication task.

https://github.com/sagorbrur/bnlm

  • This tool mostly followed inltk
  • We separated Bengali part with better evaluation results

Installation

pip install bnlm

Dependencies

  • use pytorch >=1.0.0 and <=1.3.0

Evaluation Result

Language Model

  • Accuracy 48.26% on validation dataset
  • Perplexity: ~22.79

Features and API

Download pretrained Model

To start, first download pretrained Language Model and Sentencepiece model

from bnlm.bnlm import download_modelsdownload_models()

Predict N Words

predict_n_words take three parameter as input:

  • input_sen(Your incomplete input text)
  • N(Number of word for prediction)
  • model_path(Pretrained model path)
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import predict_n_words
model_path = 'model'
input_sen = "আমি বাজারে"
output = predict_n_words(input_sen, 3, model_path)
print("Word Prediction: ", output)

Get Sentence Encoding

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_encoding
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
encoding = get_sentence_encoding(input_sentence, model_path, sp_model)
print("sentence encoding is: ", encoding)

Get Embedding Vectors

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_embedding_vectors
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
embed = get_embedding_vectors(input_sentence, model_path, sp_model)
print("sentence embedding is : ", embed)

Sentence Similarity

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_similarity
model_path = 'model'
sp_model = "model/bn_spm.model"
sentence_1 = "সে খুব করে কথা বলে।"
sentence_2 = "তার কথা খুবেই মিষ্টি।"
sim = get_sentence_similarity(sentence_1, sentence_2, model_path, sp_model)
print("Similarity is: %0.2f"%sim)
# Output: 0.72

Get Simillar Sentences

get_similar_sentences take four parameter

  • input sentence
  • N(Number of sentence you want to predict)
  • model_path(Pretrained Model Path)
  • sp_model(pretrained sentencepiece model)
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_similar_sentences
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি বাংলায় গান গাই।"
sen_pred = get_similar_sentences(input_sentence, 3, model_path, sp_model)
print(sen_pred)
# output: ['আমি বাংলায় গান গাই ।', 'আমি ইংরেজিতে গান গাই।', 'আমি বাংলায় গানও গাই।']

--

--