| |
The statistical language models for Persian are prepared in 3 types: monogram, bigram and trigram. These models are extracted from a Persian text corpus which contains about 10 million words. Monogram language model is the number of occurrences for each word or POS in text corpus. Bigram model is the number of occurrences for every couple of words, POS tags or classes. Trigram model presents the number of occurrences for every triple of words, POS tags or classes. So these statistics are extracted for sequences of words (word-based n-gram), POS-tags (POS-based n-gram) and classes (class-based n-gram). The Persian text corpus is being developed and beside its development, the extracted statistics are being updated. We can extract some more statistical models from corpus for different systems which use language models. These models can be used in speech recognition systems, intelligent typing systems or OCR systems.
|