Poisson distribution is a probability distribution that describes the number of events in a given interval (i.e., space or time) when these events occur randomly and independently at each instant, but with some event rate. This post will explain what Poisson distribution is, the benefits of using it for modeling purposes, and how to use it in Excel.
About the Poisson Distribution
Before we talk about the Poisson distribution itself and its applications, let’s first introduce the Poisson process. In short, the Poisson process is a model for a series of discrete events where the average time between events is known, but the exact timing of events is random. The occurrence of an event is also purely independent of the one that happened before.
Follows the Poisson Process
So let’s bring this theory to life with a real-world example. We all get frustrated when our internet connection is unstable. If we assume that one failure doesn’t influence the probability of the next one, we might say that it follows the Poisson process, where the event in question is “internet failure”. All we need to know is the average time between these failures. However, there is a set of criteria that needs to be met:
- The events of such a process are independent of each other.
- The average rate of event occurrences per unit of time (e.g. per month) is constant.
- Two events (e.g. internet failure or no internet failure) cannot occur simultaneously.
In our internet example, we assume that the events are independent and unrelated; that is, one instance of internet failure doesn’t affect the probability of the next instance. But sometimes, this might not be the case.
Another frequently given example for a Poisson process is Uber arrivals. However, this is not a true Poisson process because the arrivals are not completely independent of one another. Even for buses that do not run on time, we cannot be sure that their late arrival doesn’t affect the arrival time of the next bus.
On the other hand, cases such as customers calling a help center or visitors landing on a website are more likely to be independent and would probably be considered a more solid example of the Poisson process.
What is the Poisson distribution?
While the Poisson process is the model we use to describe events that occur independently of each other, the Poisson distribution allows us to turn these “descriptions” into meaningful insights. So, let’s now explain exactly what the Poisson distribution is.
The Poisson distribution is a discrete probability distribution
As you might have already guessed, the Poisson distribution is a discrete probability distribution which indicates how many times an event is likely to occur within a specific time period. But what is a discrete probability distribution?
Right, let’s first align on the concepts! A probability distribution is a mathematical function that gives the probabilities of possible outcomes happening in an experiment. As you might already know, probability distributions are used to define different types of random variables. These variables can be either discrete or continuous. When talking about Poisson distribution, we’re looking at discrete variables, which may take on only a countable number of distinct values, such as internet failures (to go back to our earlier example).
Given all that, Poisson distribution is used to model a discrete random variable, which we can represent by the letter “k”. As in the Poisson process, our Poisson distribution only applies to independent events which occur at a consistent rate within a period of time. In other words, this distribution can be used to estimate the probability of something happening a certain amount of times based on its event rate.
For example, if the average number of people who visit an exhibition on Saturday evening is 210, we can ask ourselves a question like “What is the probability that 300 people will visit the exhibition next week?”
Getting hands-on with Poisson distribution
So far, we’ve covered lots of theory. Now it’s time to delve into the mathematical side of Poisson distribution.
First, let’s consider the formula used to calculate our probabilities. Discrete probability distributions are defined by probability mass functions, also referred to as pmf. In statistics, a probability mass function is a function that gives you the probability that a discrete random variable (i.e., “k”) is exactly equal to some value. So, Poisson distribution pmf with a discrete random variable “k” is written as follows:
Hang on, don’t run away just yet! Let’s break it down:
- P(k events in interval) stands for “the probability of observing k events in a given interval”; that’s what we’re trying to find out.
- ” e “ is the Euler’s number, which is a mathematical constant with an approximate value of 2.71828.
- ” λ “ represents lambda, which is the expected number of possible occurrences. It is also sometimes called the rate parameter or event rate, and is calculated as follows: events/time * time period.
- ” ! “ is the symbol used to represent the factorial function. Factorials are products of each whole number from 1 to k. So, in terms of the formula above, the factorial function tells us to multiply all whole numbers from our chosen number down to 1. For example, if “k” is 4, “k!” essentially means: 4! = 1 * 2 * 3 * 4. So, k! = 24.
To get a better grasp of how it works, let’s apply the formula to the following example.
The average number of internet failures in a household is 2 per week (“λ”). What is the probability of 3 (“k”) internet failures happening next week? Assuming that these are independent events with a constant average event rate and that can’t happen simultaneously, let’s fill in the data we have:
P (k; λ) = e-λ * λk / k!
= 2.71828 – 2 * 23 / 3!
= 0.13534 * 8 / 6
Seems like the probability of 3 internet failures happening next week is around 18%, which is not that high.
Calculating formulas manually can be a rather tedious process, and, as a data analyst or a data scientist, it’s highly unlikely that you’ll ever do it as we have above! There are certain tools and computer languages that enable you to analyze your data without having to go through such formulas manually.
One such language is Python, a programming language which is used to create algorithms (or sets of instructions) that can be read and implemented by a computer. We won’t go into detail about Python here; for the purpose of this post, you just need to know that it can be used to simplify the process of calculating a Poisson distribution for a given set of data. If you’d like to learn more about what Python is, we’ve covered it in detail here: What is Python? A Complete Guide.
With that in mind, we’re now going to do the following:
- Generate some random Poisson-distributed data with Python
- Visualize our data
Generating and visualizing a Poisson distribution with Python
Below, you’ll see a snippet of code which will allow you to generate a Poisson distribution with the provided parameters (mu or also λ and size). In the code snippet itself, you’ll find explanations after the # sign, which is the way we do it in Python.
You can run this code either in your shell after installing Python to your local machine or simply by using the built-in shell at the official Python website.
What is the Poisson distribution used for?
Now we know what the Poisson distribution is and what it looks like in action, it’s time to zoom out again and see where the Poisson distribution fits into the bigger picture.
As you know, data analytics is all about drawing meaningful insights from raw data; insights which can be used to make smart decisions. Poisson distributions are commonly used to find the probability that an event might happen a specific amount of times based on how often it usually occurs. Based on these insights and future predictions, organizations can plan accordingly.
We have now covered a complete introduction to the Poisson distribution. There is certainly a lot more to be explored and plenty more exciting problems to solve, but hopefully this has given you a good starting point from which to continue your journey of discovery!
Before we finish, let’s summarize the main properties of Poisson distribution and the key takeaways from what we’ve covered:
- Poisson distributions are used to find the probability that an event might happen a definite number of times based on how often it usually occurs.
- The average number of outcomes per specific time interval is represented by λ and is called an event rate.
- The events are independent, meaning the number of events that occur in any interval of time is independent of the number of events that occur in any other interval.
- The probability of an event is proportional to the length of time in question (e.g. a week or a month).
- The probability of an event in a particular time duration is the same for all equivalent time durations.