Learn practical skills, build real-world projects, and advance your career

MA2102

Probability and Statistics

Lecture-6

Say there is a school with nn students and you are asked to write a (database)software to maintain student records, and one should be able to retrieve student record quickly(almost in constant time) using your software. To achieve that fast access you should be using data structure called Hash table to store student records. Hash tables are array like data structures in which particular record gets stored in at particular index which is determined through hashing using some hash function. Here is process of getting index of record. Fist we feed key of the record to hash function which spits the number called hash code then this hash code may be converted to index(by taking modulo of size of array so that index fall in the range of array indices).

Suppose our keys are tuple (FullName,DOB). Say the following hash functions is in your mind;

(Name,DOB)hashfunction\overset{hash function}{\rightarrow} dd where dd is day of the years i.e d{1,2,3,...365}d\in\{1,2,3,...365\}(Assuming nobody born in leap year)


So our hash function maps student keys to {1,2,...,365}. This hash code can be used as index to store records in hash table. One problem with hashing is collisions (two students may have born on the same day of the year), one way to deal with collisions is chaining(keeping the linked list of records(instead of single record) who have born on same day of the year at their corresponding index). Let us assume that our hash function uniformly distribute keys over set of index values{1,2,...,365} so that we will not have clusters. Still we would like to have less collisions to reduce the average time to retrieve a record. For that we can't use too many students records to store in the hash table as we have only 365 slots in this set up. And less than 365 students doesn't mean that we will not have any collisions though our hash function uniformly distribute the keys over indices, But we can feel that more students we have more chances of collisions (so more number of collisions) that will increase average time to retrieve a record. But What is the trade-off between the number of students that we can store and probability of collision in the hash table?

Birthday Problem(Paradox):

If $n$ people are present in a room. What is the probability at least two of them celebrate birthday on the same day of the year

solution : Let us assume nobody born on leap year.
case i: n>365n>365
In this case the probability at least two people having same birthday is 1 (\because Pigeon hole principle)
case ii: n365n\le 365
Let us assume that a person could be born any day of the year with same probability( It is not true as there are seasonal effects, that means in some seasons more marriages takes place,so after nine months of that season more babies are born) But with this equally likely assumption the problem will become easy and the answer we get also very reasonable to reality.
Let us observe birthdays (DD/MM only) of a random class with n people.
Sample space Ω={b1b2b3...bn:bi{1,2,3...365}}\Omega=\{b_{1}b_{2}b_{3}...b_{n}:b_{i}\in \{1,2,3...365\}\}
(Here 1:= 0/1/01, 2:= 02/01,.....365:=31/12)
Ω=365n|\Omega|=365^{n} (\because outcomes in Ω\Omega are n-permutation of 365 objects with repetition )
Let EE denote the event that at least two people in class having same birthday, and P(E)=1P(Ec)P(E)=1-P(E^{c})
where EcE^{c} will denote the event that no two people in the class having same birthday
Ec= 365Pn|E^{c}|=~^{365}P_{n} (\because outcomes in EcE^{c} are n-permutation of 365 objects without repetition )
now, P(Ec)=EcΩ=365Pn365nP(E^{c})=\frac{|E^{c}|}{|\Omega|}=\frac{^{365}P_{n}}{365^{n}} (\because we assumed outcomes are equally likely )
P(E)=1365Pn365n\therefore P(E)= 1- \frac{^{365}P_{n}}{365^{n}}

"""Birth day Paradox"""
import matplotlib.pyplot as plt
from ipywidgets import interact #,interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import display
%matplotlib nbagg

def npr(n,r): #npr=n(n-1)(n-2)...(n-(r-1))
    p=1
    for i in range(0,r):
        p=p*(n-i)
    return p

def birthdayProblem(n):
    if n > 365:
        return 1
    else:
        return round(1-npr(365,n)/365**n,4)
def n_value_changed(change):
    prob.value = birthdayProblem(change.new)
    prob.description = "p={}".format(round(prob.value,6))

numberOfpeople=widgets.IntSlider(min=0.0,max=100,value=0,description="n:")
prob = widgets.FloatProgress(min=0.0,max=1.0,step=0.01,value=0.0,description="p={}".format(0),bar_style='info',orientation='vertical')

numberOfpeople.observe(n_value_changed, 'value')
widgets.HBox([numberOfpeople, prob])
HBox(children=(IntSlider(value=0, description='n:'), FloatProgress(value=0.0, bar_style='info', description='p…

Table:

np
230.5073
300.7063
400.8912
500.9704
600.9941
700.9992