| |
In every system which uses language models, a vocabulary or a lexicon should be defined. For defining such a set, it is necessary to extract more frequent words of the language. We have extracted the more frequent words of Persian language from Persian text corpus. The prepared vocabularies contain 5k, 10k and 20k words. These sets contain the words and some extra information about them such as phonetic transcriptions with considering their different pronunciations, part of speech tags of the words and the number of occurrences of them in the corpus. In these vocabularies, more frequent inflectional paradigms of each word beside its root are also considered. The lexicons which contain just the root of words are prepared with 10000 and 20000 entries too.
|