MA2102
Probability and Statistics
Lecture-6
Say there is a school with students and you are asked to write a (database)software to maintain student records, and one should be able to retrieve student record quickly(almost in constant time) using your software. To achieve that fast access you should be using data structure called Hash table to store student records. Hash tables are array like data structures in which particular record gets stored in at particular index which is determined through hashing using some hash function. Here is process of getting index of record. Fist we feed key of the record to hash function which spits the number called hash code then this hash code may be converted to index(by taking modulo of size of array so that index fall in the range of array indices).
Suppose our keys are tuple (FullName,DOB). Say the following hash functions is in your mind;
(Name,DOB) where is day of the years i.e (Assuming nobody born in leap year)
So our hash function maps student keys to {1,2,...,365}. This hash code can be used as index to store records in hash table. One problem with hashing is collisions (two students may have born on the same day of the year), one way to deal with collisions is chaining(keeping the linked list of records(instead of single record) who have born on same day of the year at their corresponding index). Let us assume that our hash function uniformly distribute keys over set of index values{1,2,...,365} so that we will not have clusters. Still we would like to have less collisions to reduce the average time to retrieve a record. For that we can't use too many students records to store in the hash table as we have only 365 slots in this set up. And less than 365 students doesn't mean that we will not have any collisions though our hash function uniformly distribute the keys over indices, But we can feel that more students we have more chances of collisions (so more number of collisions) that will increase average time to retrieve a record. But What is the trade-off between the number of students that we can store and probability of collision in the hash table?
Birthday Problem(Paradox):
If $n$ people are present in a room. What is the probability at least two of them celebrate birthday on the same day of the year
solution : Let us assume nobody born on leap year.
case i:
In this case the probability at least two people having same birthday is 1 ( Pigeon hole principle)
case ii:
Let us assume that a person could be born any day of the year with same probability( It is not true as there are seasonal effects, that means in some seasons more marriages takes place,so after nine months of that season more babies are born) But with this equally likely assumption the problem will become easy and the answer we get also very reasonable to reality.
Let us observe birthdays (DD/MM only) of a random class with n people.
Sample space
(Here 1:= 0/1/01, 2:= 02/01,.....365:=31/12)
( outcomes in are n-permutation of 365 objects with repetition )
Let denote the event that at least two people in class having same birthday, and
where will denote the event that no two people in the class having same birthday
( outcomes in are n-permutation of 365 objects without repetition )
now, ( we assumed outcomes are equally likely )
"""Birth day Paradox"""
import matplotlib.pyplot as plt
from ipywidgets import interact #,interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import display
%matplotlib nbagg
def npr(n,r): #npr=n(n-1)(n-2)...(n-(r-1))
p=1
for i in range(0,r):
p=p*(n-i)
return p
def birthdayProblem(n):
if n > 365:
return 1
else:
return round(1-npr(365,n)/365**n,4)
def n_value_changed(change):
prob.value = birthdayProblem(change.new)
prob.description = "p={}".format(round(prob.value,6))
numberOfpeople=widgets.IntSlider(min=0.0,max=100,value=0,description="n:")
prob = widgets.FloatProgress(min=0.0,max=1.0,step=0.01,value=0.0,description="p={}".format(0),bar_style='info',orientation='vertical')
numberOfpeople.observe(n_value_changed, 'value')
widgets.HBox([numberOfpeople, prob])
HBox(children=(IntSlider(value=0, description='n:'), FloatProgress(value=0.0, bar_style='info', description='p…
Table:
n | p |
---|---|
23 | 0.5073 |
30 | 0.7063 |
40 | 0.8912 |
50 | 0.9704 |
60 | 0.9941 |
70 | 0.9992 |