Cara menggunakan geometric distribution in python

The geometric random variable with parameter \(p\in\left(0,1\right)\) can be defined as the number of trials required to obtain a success where the probability of success on each trial is \(p\) . Thus,

\begin{eqnarray*} p\left(k;p\right) & = & \left(1-p\right)^{k-1}p\quad k\geq1\\ F\left(x;p\right) & = & 1-\left(1-p\right)^{\left\lfloor x\right\rfloor }\quad x\geq1\\ G\left(q;p\right) & = & \left\lceil \frac{\log\left(1-q\right)}{\log\left(1-p\right)}\right\rceil \\ \mu & = & \frac{1}{p}\\ \mu_{2} & = & \frac{1-p}{p^{2}}\\ \gamma_{1} & = & \frac{2-p}{\sqrt{1-p}}\\ \gamma_{2} & = & \frac{p^{2}-6p+6}{1-p}.\end{eqnarray*}

\begin{eqnarray*} M\left(t\right) & = & \frac{p}{e^{-t}-\left(1-p\right)}\end{eqnarray*}

Implementation:

The goal is to use Python to help us get intuition on complex concepts, empirically test theoretical proofs, or build algorithms from scratch. In this series, you will find articles covering topics such as random variables, sampling distributions, confidence intervals, significance tests, and more.

At the end of each article, you can find exercises to test your knowledge. The solutions will be shared in the article of the following week.

Articles published so far:

  • Bernoulli and Binomial Random Variables with Python
  • From Binomial to Geometric and Poisson Random Variables with Python
  • Sampling Distribution of a Sample Proportion with Python
  • Confidence Intervals with Python
  • Significance Tests with Python
  • Two-sample Inference for the Difference Between Groups with Python
  • Inference for Categorical Data
  • Advanced Regression
  • Analysis of Variance — ANOVA

As usual, the code is available on my GitHub.

Comparing Geometric Random Variables and Binomial Random Variables

Let’s start by recapping what a Binomial Random Variable (RV) is. Here is the checklist:

  • The trials are independent;
  • Each trial can be classified as either success or failure;
  • Fixed number of trials;
  • The probability of success on each trial is constant.

An example of such a variable could be X, the number of 5’s after 10 rolls of a fair die. Notice that the trials are independent; rolling the die in the first trial does not impact the rolling of the die in the second trial. We can clearly classify the trials as success (we got a 5) or failure (we got any number but a 5). We roll 10 times our die, so 10 represents our number of trials. Finally, the probability of success is constant, as we are always using the same die again and again, so the probability of getting a 5 is always 1/6.

Then, what is the difference between a Binomial RV and a Geometric RV? See if you can spot the differences when I define the Geometric RV Y, the number of rolls until you get a 5 on a fair die. Think about it before continue (refer to the checklist).

The main difference is the fixed number of trials; we do not have them anymore. It could be 1 or a googol (10^100) number of rolls as far as we can tell.

from IPython.display import Image
import numpy as np
import seaborn as sns
from scipy.stats import binom

Probability for a Geometric Random Variable

Cristiano Ronaldo has a penalty conversion rate of around 84%, being one of the best in the world.

Figure 1: Career penalty statistics of 22 of the best penalty takers in the world; source

Now imagine that Ronaldo is taking penalties until he misses one. Let X be the number of shots it takes to miss his first penalty shot. What is the probability that Ronaldo’s first missed shot occurs on his 4th attempt?

You can see the equation above as the product of the probabilities of scoring the first 3 shots 0.84³ times the probability of missing the 4th 0.16. Recall that the attempts are independent; that is why we can multiply them like we do here.

So the probability of Ronaldo scoring 3 penalties in a row and then missing the 4th is around 10%.

0.84**3*0.160.09483263999999998

We can generalize the Probability Mass Function of the Geometric distribution as follows:

Cumulative Geometric Probability

When calculating the cumulative geometric probability, we can view it from two different perspectives.

Let’s imagine that the probability of success p=0.1 and that we want to calculate P(X<5).

We can think of it in the following way:

Or we can think in the opposite direction, where no success exists in the first 4 trials and so:

def geomcdf_1(p, x):
# implementing first approach
prob = 0
for i in range(x-1):
prob+=p*(1-p)**i
return prob
def geomcdf_2(p, x):
# implementing second approach
prob = 1-(1-p)**(x-1)
return prob
geomcdf_1(0.1, 5)0.34390000000000004geomcdf_2(0.1, 5)0.3439

We can see that we get the same result by implementing any of the approaches. However, the second one is significantly less computational expensive as we do not need to loop over all the cases.

Expected Value of a Geometric Random Variable

The probability of any discrete RV is the sum of the probability-weighted outcomes.

In a Geometric RV, we already know how to calculate the probabilities. For example, P(X=1) is the probability of one success, therefore P(X=1)=p. In P(X=2), we have two trials, so necessarily one unsuccessful and one successful trial, therefore P(X=2)=(1-p)p and so on.

If we subtract these two equations,

And there you go, we have a classic geometric series with a common ratio of (1-p). This is, in fact, one of the reasons why we call it a Geometric RV. The sum of a geometric series is 1/(1-r), where r is the common ratio.

Now that we derived the expected value for our Geometric RV let’s experimentally see if we can arrive at the same result.

p=0.2
X = np.random.geometric(p, 1000000)

print('Theoretical expected value: ' + str(X.mean()))
print('Empirically calculated expected value: ' + str(1/p))
Theoretical expected value: 5.00225
Empirically calculated expected value: 5.0

And there you go, we can feel comfortable with our derivation.

Comparing Poisson Random Variables and Binomial Random Variables

Can we approximate a Poisson Random Variable (RV) from a Binomial RV?

Let’s define our RV X as the number of UberEats motorcycles we see passing by our street.

If we try to build on this idea, we somehow get to an expected value of X that would look like:

And our probability would look like this:

The problem is that we can easily get more than 1 motorcycle passing by our street every minute, and we cannot account for that with our current setup. But there is a way to do it, as we can be more granular.

Now, we are working with seconds, and the probability of seeing 1 motorcycle every second is quite small. But we can still face the problem of having 2 motorcycles pass by our street in half a second. So we could get even smaller to account for that. I think that you see where this is going.

We need to understand what happens when n goes to infinity. Before diving in on this, let’s revise three properties that we will need further in our proof. The first two are related with limits:

And the third with combinatorics:

Using these properties, we can get the intuition on how the Poisson RV is actually a derivation of a Binomial RV.

We just arrived at the Probability Mass Function of a Poisson distribution.

Proving by computing values of Probability Mass Functions

We always want to be safe on our proofs and derivations, so let’s first compute the Probability Mass Functions for different values of n of a Binomial RV and then compare the results to a Poisson RV with the same rate.

from scipy.stats import binom, poisson
import matplotlib.pyplot as plt
n1 = 20
n2 = 60
n3 = 3600
λ = 9

x = np.arange(9)

X1 = binom.pmf(x, n1, λ/n1)
X2 = binom.pmf(x, n2, λ/n2)
X3 = binom.pmf(x, n3, λ/n3)

Z = poisson.pmf(x, 9)
print(f'P(X1<=9) of a Binomial RV with n={n1}: ' + str(np.sum(X1)))
print(f'P(X2<=9) of a Binomial RV with n={n2}: ' + str(np.sum(X2)))
print(f'P(X3<=9) of a Binomial RV with n={n3}: ' + str(np.sum(X3)))
print(f'P(Z<=9) of a Poisson RV with λ={λ}: ' + str(np.sum(X3)))
P(X1<=9) of a Binomial RV with n=20: 0.41430623411611534
P(X2<=9) of a Binomial RV with n=60: 0.44482436484547344
P(X3<=9) of a Binomial RV with n=3600: 0.455487676774296
P(Z<=9) of a Poisson RV with λ=9: 0.455487676774296

Notice that our intuition was correct; as the value of n gets larger, we are indeed approximating the same probability of a Poisson RV.

Proving by sampling

We can do the same exercise by sampling from the RVs that we defined above and then compute the probabilities.

X1 = binom.rvs(n=n1, p=λ/n1, size=100000)
X2 = binom.rvs(n=n2, p=λ/n2, size=100000)
X3 = binom.rvs(n=n3, p=λ/n3, size=100000)

Z = poisson.rvs(9, size=100000)
print(f'P(X1<=9) of a Binomial RV with n={n1}: ' + str(X1[X1<9].shape[0]/X1.shape[0]))
print(f'P(X2<=9) of a Binomial RV with n={n2}: ' + str(X2[X2<9].shape[0]/X2.shape[0]))
print(f'P(X3<=9) of a Binomial RV with n={n3}: ' + str(X3[X3<9].shape[0]/X3.shape[0]))
print(f'P(Z<=9) of a Poisson RV with λ={λ}: ' + str(Z[Z<9].shape[0]/Z.shape[0]))
P(X1<=9) of a Binomial RV with n=20: 0.41375
P(X2<=9) of a Binomial RV with n=60: 0.44721
P(X3<=9) of a Binomial RV with n=3600: 0.45365
P(Z<=9) of a Poisson RV with λ=9: 0.45409

We can also plot the Probability Mass Functions for each distribution, notice how the approximation when n=3600 is actually a very good one.

_,ax = plt.subplots(2, 2, figsize=(25, 10), sharey=True, sharex=True)
ax = ax.ravel()

λ=9

ax[0].hist(binom.rvs(n=20, p=9/20, size=100000), bins=50)
ax[0].set_title(f'Binomial RV with n={20}, p={9}')
ax[1].hist(binom.rvs(n=60, p=9/60, size=100000), bins=50, color='darkorange')
ax[1].set_title(f'Binomial RV with n={60}, p={9}')
ax[2].hist(binom.rvs(n=1000000, p=9/1000000, size=100000), bins=50, color='black');
ax[2].set_title(f'Binomial RV with n={3600}, p={9}')
ax[3].hist(poisson.rvs(9, size=100000), bins=50, color='red')
ax[3].set_title(f'Poisson RV with λ={9}');

Figure 2: Approximating a Poisson RV using a Binomial RV with different values of n; notice that as n gets larger the shape of the sampling distribution of the Binomial RV is getting closer to the sampling distribution of the Poisson RV.Exercises

You will get the solutions in next week’s article.

  1. You have a standard deck of cards, and you are picking cards until you get a Queen (you replace the cards if they are not Queens). What is the probability that you need to pick 5 cards? And less than 10? And more than 12?
  2. Jorge conducts inspections on freezers. He finds that 94% of the freezers successfully pass the inspection. Let C be the number of freezers Jorge inspects until a freezer fails an inspection. Assume that the results of each inspection are independent.
  3. Pedro makes 25% of the free kicks shots he attempts. For a warm-up, Pedro likes to shoot free kick shots until he makes one. Let M be the number of shots it takes Pedro to make his first free-kick. Assume that the results of each shot are independent. Find the probability that it takes Pedro fewer than 4 attempts to make his first shot.
  4. Build a function that computes the Poisson PMF without using any functions from external packages besides
    0.84**3*0.160.09483263999999998
    2 from
    0.84**3*0.160.09483263999999998
    3. Choose some parameters and compare your result with the
    0.84**3*0.160.09483263999999998
    4 function from
    0.84**3*0.160.09483263999999998
    5.
  5. Build a function that computes the Poisson CDF without using any external package. Choose some parameters and compare your result with the
    0.84**3*0.160.09483263999999998
    6 function from
    0.84**3*0.160.09483263999999998
    5.

Answers from last week

def fact(n):
x = 1
for i in range(1, n+1):
x *= i
return x

def comb(n, k):
x = 1
return fact(n)/(fact(k)*fact(n-k))

def binompmf(prob, n, k):
return comb(n,k)*prob**k*(1-prob)**(n-k)

def binomcdf(prob, n, x):
result = 0
for x_ in range(0, x+1):
result += binompmf(prob, n, x_)
return result

1.A company produces bottles of water. In its main factory, the number of defective bottles is 5%. A quality check consists of randomly selecting and testing 1000 bottles. What are the mean and standard deviation of the number of defective bottles in these samples?

print('μ_x={}'.format(1000*0.05))
print('σ_x={}'.format((1000*0.05*(1-0.05))**(1/2)))
μ_x=50.0
σ_x=6.892024376045111

2. A wine company is running a promotion that says 1-in-4 boxes of the wine contain a surprise. Suppose that you will buy 5 boxes of this wine, and let X represent the number of surprises you can win in these boxes. Assume that these boxes represent a random sample and assume that surprises are independent between boxes. What is the probability that you win at most 1 surprise in the 5 boxes?

(Can you solve this problem in 2 different ways? Tip: by sampling and using the theoretical derivation that we did above)