Language I: Probabilities - Joints, Marginals and Conditionals.

Welcome to my first blog post in this series on text processing! In this series, I will explore the fascinating world of natural language processing, from the basics of language and words to more advanced concepts like text modeling and language generation.

In this first blog post, we will focus on the fundamental building blocks of language - letters - and how we can use them to understand probabilities. We will then move on to text modeling, which involves using statistical methods to analyze and understand natural language, and finally, we will try to build a simple language generator.

Data Link to heading

For this exercise, i will use the Webster’s Unabridged Dictionary as a dataset, downloaded using the R package gutenbergr. It is a $20^th$ century english dictionary (date: August 22, 2009 ), free to use, copy, and redistribute under the terms of the Project Gutenberg License. Interested users should check EBook #29765 in Project Gutenberg website.

Since the English Dictionary contains all english words, then it qualifies as a good sample dataset to use in trying to learn about the structure of letters in english words. Every word is represented, and contributes to the analysis.

Pre processing Link to heading

The dictionary sentences and paragraphs are broken down into words, a technique called unnesting. After that, pre-processing is done. The pre-processing performed for this analysis, includes:

  • Omitting punctiation marks.
  • Omitting numbers.
  • Omitting duplicated words.

It is common practise in text processing to omit stop words 1, however in this analysis, we do not omit them, since the purpose of the analysis is to examine the structure of letters in english text, and thus stop words add to the analysis.

The pre processing redcues the dataset from 4,722,799 tokens (words) to 202,487 tokens (words).

We now dive into the worls of probabilities:


Marginal probabilities Link to heading

A marginal probability is a probability that is calculated for a single event without considering other events. Put simply, it is the probability of an event occurring regardless of the occurrence of any other event.

In this analysis, we calculate the marginal probability of a letter occurring in the text corpus2 by dividing the number of times that letter appears by the total number of letters in the corpus.

For example, if we have a text corpus that contains 100 letters and the letter “k” appears 30 times in the corpus, then the marginal probability of “k” is 30/100 = 0.3. This means that, in this corpus, the probability of any randomly chosen letter being “k” is 0.3, regardless of what other letters appear in the corpus. The graph below shows the marginal probabilities for the english letters in the text corpus comprising words from the English Dictionary

Marginal Probability

An interesting observation is:

  1. The most commonly occuring letter in the English Dictionary is the letter ’e’, followed by the letter ‘a’, ‘i’, ‘r’.
  2. The rarest occuring letter is letter: ‘z’, followed by ‘j’ and ‘q’.

Joint probability Link to heading

The joint probability is a statistical concept used to determine the likelihood of two or more events occurring together. In the context of text processing, we can use joint probability to determine the probability of two letters appearing together in a given text corpus. To calculate the joint probability of two letters, we would divide the number of times they appear together in the corpus by the total number of letter pairs in the corpus. For instance, if we have a text corpus with 1000 letter pairs and the letter pair “st” appears 10 times in the corpus, then the joint probability of “s” and “t” occurring together is 10/1000, which equals 0.01. This means that in this particular corpus, the probability of the letter “s” being followed by “t” is 0.01, and the probability of any letter pair being “st” is 0.01. The image below illustrates the joint probabilities:

Joint Probabilities

Interesting observations include:

  1. The letter ‘q’ found in the words in the English Dictionary, are only followed by the letter ‘u’. An example of a word which comes in mind is: “question”

Conditional probability Link to heading

Conditional probability refers to the probability of an event occurring, given that another event has already taken place. We can calculate the conditional probability of a letter occurring given the occurrence of another letter by dividing the joint probability of both letters by the probability of the second letter. In this analysis, we use the conditional probability to determine the likelihood of a letter appearing, given that another letter has already appeared in the text corpus. For instance, let’s say we have a text corpus that contains the letter pair “op” five times and the letter “o” appears 10 times in the corpus. The conditional probability of “o” given “p” would be calculated as 5/10, which equals 0.5. This means that in the given text corpus, there is a 50% chance of the letter “p” appearing after the letter “o”. The image below illustrates the conditional probabilities:

Conditional Probabilities

Examples of notable ones include:

  1. Prob(Next = ‘u’ | Current = ‘q’) is higher than the rest, probably due to words like “question
  2. Prob(Current = ’e’ | Next = ‘x’) is higher than the rest, probably due to words like “example

Word “starters” and “enders” Link to heading

It is of interest to examine the distribution of letters which are commonly found in the starting and ending of english words.

To obtain the probability that a particular letter is a word “starter”, we take the number of occurences the letter was used as a starter, divided by the number of occurences of all other english letters being word starters.

For instance, if the letter ‘a’ was used as a word starter 60 out of 1000 times, then, the probability of the letter ‘a’ being a word starter is 60/1000, which is 0.06. This means that in the given text corpus, there is a 6% chance of the letter ‘a’ appearing at the beginning of a word. This logic similarly applies to word “enders”.

The charts below show the probabilities for word “starters”, and “enders”:

Starters-Enders

Interesting findings include:

  1. The letter ’s’ is the most common letter used in the beginning and ending of the english words found in English Dictionary

  1. Stop words are typically extremely common english words such as “the,” “of,” and “to,” ↩︎

  2. A corpus is a collection of written or spoken texts that are used as a source of data for linguistic analysis. A text corpus can be thought of as a large and diverse library of language samples, which may include books, articles, speeches, interviews, social media posts, and other written or spoken materials. ↩︎