Probabilistic Language Models

Next: Unigram Idea Up: SEGMENTATION Previous: Problems with Path Approach

Probabilistic Language Models

A popular idea in computational linguistics is to create a probabilistic model of language. Such a model assigns a probability to every sentence in English in such a way that more likely sentences (in some sense) get higher probability. If you are unsure between two possible sentences, pick the higher probability one.

Comment: A ``perfect'' language model is only attainable with true intelligence. However, approximate language models are often easy to create and good enough for many applications.

Some models:

unigram: words generated one at a time, drawn from a fixed distribution.
bigram: probability of word depends on previous word.
tag bigram: probability of part of speech depends on previous part of speech, probability of word depends on part of speech.
maximum entropy: lots of other random features can contribute.
stochastic context free: words generated by a context-free grammar augmented with probabilitistic rewrite rules.

We'll use unigrams.

Guangwei Yuan
12/4/1999