Aviation incidents in Canada: 80 years of data


As stated by the International Civil Aviation Organization (ICAO), safety is the highest priority of all involved in aviation. The shared goal is for every flight to take-off and land safely, as happens more than 126,000 times every day. In 2018, the fatal accident rate was 0.28 per 1 million flights, the equivalent of one fatal accident for every 4.2 million flights.

The aviation industry is a complex collaboration of multiple fields, from manufacturers to commercial airlines. The air traffic is regulated and managed by different agencies, and service providers.

In Canada, the aviation industry is regulated by Transport Canada. Air traffic services are provided by Nav Canada.

Data collection and analysis is a key factor in the safety management systems. It helps investigations and defines efficient regulations and procedures.

Content of this notebook

In this notebook, we will analyze aviation occurrence data provided by Transport Canada. We will download the data, and after some cleaning and organizing, we will show plots of multiple aspects of these events.

We will investigate the following points:

  • Where did the incidents happen?
  • What are the principal causes of aviation events?
  • When did the incidents happen?
  • What are the categories and consequences of accidents and incidents?
  • What is the impact of environmental factors on aviation safety?
  • Which flight phases are more dangerous?
  • Which aircraft are more likely to be involved in events?
  • How do flight plans, search and rescue operations impact aviation safety?

Before diving into the data, let's define some terminology:

  • Occurrence: Any event which is irregular, unplanned, or non-routine, including any aircraft accident, incident, or other occurrences.

  • Accident: An occurrence associated with the operation of an aircraft which takes place between the time any person boards the aircraft with the intention of flight until all such persons have disembarked, in which:
    1- a person is fatally or seriously injured
    2- the aircraft sustains damage or structural failure
    3- the aircraft is missing or is completely inaccessible.

  • INCIDENT: An occurrence, other than an accident, associated with the operation of an aircraft that affects or could affect the safety of operation.

  • SERIOUS INCIDENT: An incident involving circumstances indicating that an accident nearly occurred.

# importing python libraries
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os
import numpy as np
import plotly.express as px
# Setting plots parameters
%matplotlib inline
matplotlib.rcParams['font.size'] = 18
matplotlib.rcParams['figure.figsize'] = (18, 10)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
import warnings

Datasets used in this study

The data is divided into five (5) data frames:

  • Occurrence table: This table contains data on the occurrence summary, including the date, time, and location of the occurrence, the occurrence type and category, the occurrence classification, the aircraft involved, the number of injuries/or fatalities, the weather conditions, and data relating to the landing and takeoff aerodrome or operating surface.
  • Aircraft table: This table contains data about the involved aircraft, including its type, make, model, registration, and country of registration, aircraft’s engine(s), propellers, and rotors, data relating to an explosion, fire, fumes and/or smoke, operator information including the type of operator, type of flight plan, flight number, departure and destination, and air traffic service involvement.
  • Injuries table: This table contains data on the number and severity of injuries resulting from the occurrence.
  • Events and phases table: This table contains data about the phases of the occurrence flight and the events during the flight.
  • Survivability table: This table contains data relating to the evacuation of the occurrence aircraft, the effectiveness of survival devices, and the systems for locating the occurrence aircraft.

All these tables are available as csv files. data_dictionary table contains the definitions for each column in the data frames.

1st step: download the datasets

We use the urllib library to download the csv files. We start by defining the different URLs used in this project:

ref_url = 'https://www.bst-tsb.gc.ca/eng/stats/aviation/data-5.html'
data_dictionary_url = 'https://www.bst-tsb.gc.ca/eng/stats/aviation/csv/ASISdb-dd.csv'
occurence_table_url = 'https://www.bst-tsb.gc.ca/includes/stats/csv/Air/ASISdb_MDOTW_VW_OCCURRENCE_PUBLIC.csv'
aircraft_table_url = 'https://www.bst-tsb.gc.ca/includes/stats/csv/Air/ASISdb_MDOTW_VW_AIRCRAFT_PUBLIC.csv'
injuries_table_url = 'https://www.bst-tsb.gc.ca/includes/stats/csv/Air/ASISdb_MDOTW_VW_INJURIES_PUBLIC.csv'
events_phases_table_url = 'https://www.bst-tsb.gc.ca/includes/stats/csv/Air/ASISdb_MDOTW_VW_EVENTS_AND_PHASES_PUBLIC.csv'
survivability_table_url = 'https://www.bst-tsb.gc.ca/includes/stats/csv/Air/ASISdb_MDOTW_VW_SURVIVABILITY_PUBLIC.csv'

In the next section, we download the data and tackle some encoding errors.

files = ['data_dict.csv','occurence.csv','aircraft.csv','injuries.csv','events_phases.csv','survivability.csv']
urls = [data_dictionary_url,occurence_table_url,aircraft_table_url,injuries_table_url,events_phases_table_url,survivability_table_url]
encoding = 'ISO-8859-1'
for filename,url in zip(files,urls):
    if filename not in os.listdir():
        print('Downloading', filename)
        urllib.request.urlretrieve(url, filename)
        # in order to avoid formatting errors, we have to replace all "–" by "-"
    print('formatting ',filename)    
    # Read in the file
    with open(filename, 'r', encoding=encoding) as file:
        filedata = file.read()

    #Replace the target string
    filedata = filedata.replace('â\x80\x93', '-')

    # Write the file out again
    with open(filename, 'w', encoding=encoding) as file:
formatting data_dict.csv formatting occurence.csv formatting aircraft.csv formatting injuries.csv formatting events_phases.csv formatting survivability.csv

The first csv we download describes the content of each column of the data frames.

data_dict_df = pd.read_csv('data_dict.csv',encoding = encoding)
print('Total number of columns',data_dict_df.shape[0])
Total number of columns 514

After looking into data_dict_df, we have decided to keep the following columns for this study:

occurence_sel_columns = ['OccNo','CountryID_DisplayEng',
'ICAO_DisplayEng', 'OccDate', 'OccRegionID_DisplayEng', 'OccTime', 'OccTypeID_DisplayEng',
'InjuriesEnum_DisplayEng', 'LightCondID_DisplayEng','SkyCondID_DisplayEng', 
'GeneralWeatherID_DisplayEng', 'VisibilyCeiling', 'Visibility', 'WeatherPhenomenaTypeID_DisplayEng', 'Summary',
'RunwaySurfaceID_DisplayEng', 'SurfaceContaminationID_DisplayEng','TimeZoneID_DisplayEng']
aircraft_sel_columns = ['OccNo','AircraftTypeID_DisplayEng', 'AircraftMakeID_DisplayEng',
'OperatorTypeID_DisplayEng', 'YearOfManuf','DamageLevelID_DisplayEng', 'FlightPlanTypeID_DisplayEng', 'AircraftModelID_DisplayEng', 'OrganizationID_DisplayEng']
survivability_sel_columns = ['OccNo', 'SurvivableEnum_DisplayEng', 'AircraftEvacTime', 'EvacEgressIssueEnum_DisplayEng', 'SurvEquiID_DisplayEng', 'EquipEffReasonID_DisplayEng', 'EquipEffectiveEnum_DisplayEng','EquipInfluenceEnum_DisplayEng', 'EvacHamperedID_DisplayEng']
injuries_sel_columns=['OccNo'] # place holder for future work

Using data_dict_df data frame, We can generate html that contains descriptions for each selected column. This will be added as a reference at the end of the notebook.

# uncomment this to generate html.
#for column in set(occurence_sel_columns + aircraft_sel_columns +injuries_sel_columns + events_phases_sel_columns + survivability_sel_columns) :
#    print('**'+column+'** ',':', data_dict_df[data_dict_df['Column name']==column]['Description English'].values[0],'<br>')

The next step is to load data using pandas library.

occurence_df = pd.read_csv('occurence.csv',encoding = encoding,usecols=occurence_sel_columns)
aircraft_df = pd.read_csv('aircraft.csv',encoding = encoding, usecols=aircraft_sel_columns)
injuries_df = pd.read_csv('injuries.csv', encoding=encoding, usecols=injuries_sel_columns)
events_phases_df = pd.read_csv('events_phases.csv', encoding=encoding, usecols=events_phases_sel_columns)
survivability_df = pd.read_csv('survivability.csv', encoding=encoding, usecols=survivability_sel_columns)

Occurrences have a unique number, stored in the column OccNo. This column is shared by all files and will be used as a reference. Some occurrences appeared multiple times, to report different pieces of information. We start first by creating two versions of the tables, a full version that contains duplicates, and a clean version without duplicates:

# remove duplicates from dataframe
occurence_df_full = occurence_df.copy()
occurence_df.drop_duplicates('OccNo', inplace=True)
survivability_df_full = survivability_df.copy()
survivability_df.drop_duplicates('OccNo', inplace=True)
aircraft_df_full = aircraft_df.copy()
aircraft_df.drop_duplicates('OccNo', inplace=True)
injuries_df_full = injuries_df.copy()
injuries_df.drop_duplicates('OccNo', inplace=True)
events_phases_df_full = events_phases_df.copy()
events_phases_df.drop_duplicates('OccNo', inplace=True)

The first information I was curious about was how many unique occurrences are presented in the datasets?

print('Number of aviation occurrences',occurence_df.shape[0])
Number of aviation occurrences 45277

Where did the incidents happen?

Aviation occurrences are classified by geographic area:

This plot will show the number of occurrences reported in every region:
matplotlib.rcParams['figure.figsize'] = (7,57)
matplotlib.rcParams['font.size'] = 14
column = 'OccRegionID_DisplayEng'
graph =  occurence_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(7);
fig = graph.plot.pie(autopct="%.1f%%");
fig.set(ylabel = '',Title='Breakdown of aviation occurences by region');

More events were reported in ONTARIO compared to the rest of Canada. This is mainly due to the higher traffic density in this province, which is the most populous province in Canada. FOREIGN region refers to occurrences that happened outside Canada, involving Canadian aircraft (manufactured, operated, etc). Let's look at the countries where occurrences were reported:

matplotlib.rcParams['figure.figsize'] = (10, 8)
matplotlib.rcParams['font.size'] = 19
column = 'CountryID_DisplayEng'
graph =  occurence_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(20)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Number of occurrences [log scale] ', ylabel = '', Title='Number of events reported in each country');
canada_count = occurence_df[occurence_df[column]=='CANADA']['OccNo'].count()
total_count = occurence_df['OccNo'].count()
print('{} events happend in Canada out of {} occurrences ({:.2f}%).'.format(canada_count, total_count,100*canada_count/total_count))
41834 events happend in Canada out of 45277 occurrences (92.40%).

Most of the events (92%) happened in Canada.

What are the principal causes of aviation events?

Each event is categorized by ICAO, according to its principal cause. The following plot shows the breakdown of events:

matplotlib.rcParams['figure.figsize'] = (10, 10)
column = 'ICAO_DisplayEng'
graph =  occurence_df_full.groupby(column)['OccNo'].count().sort_values(ascending=False).head(10)
fig = sns.barplot(graph.values,graph.index);
fig.set(xlabel = 'Number of occurrences', ylabel = '', Title='Events reported in each ICAO category');

Note that the incident category data is missing in most of the records:

print(occurence_df_full['ICAO_DisplayEng'].isna().sum(),'missing records out of {}.'.format(total_count))
41703 missing records out of 45277.

Here is a breakdown of the top-10 occurrence categories (in %):

print('Events categories (%)')

System failures (related or not to engines) are the most common causes of occurrences (about 50% of events).

When did the incidents happen?

Date information requires some transformations to extract interesting insights.

occurence_df['OccDate'] = pd.to_datetime(occurence_df['OccDate'])
occurence_df['OccYear'] = occurence_df['OccDate'].dt.year
occurence_df['OccMonth'] = occurence_df['OccDate'].dt.month
occurence_df['OccWeekday'] = occurence_df['OccDate'].dt.weekday

First, we display the number of events reported every year.

matplotlib.rcParams['figure.figsize'] = (10, 7)
column = 'OccYear'
data = occurence_df[occurence_df[column]>1965].dropna(subset=[column])
data2= data.groupby(column)['OccNo'].count()
plt.fill_between(data2.index, data2.values);
plt.ylabel('Number of occurrences');
plt.title('Number of occurrences reported each year');

We have excluded events before 1965 due to the low number of reported events. The number of occurrences seems to be steady since the 90s, with a slight tendency to decrease. An interesting comparison would include the total number of aircraft movements. Using Statcan website, we can observe that the number of movements is not increasing (at least since 1997, if we exclude 2020 and 2021 due to the covid-19 crisis).

#ref: StatCan website : https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=2310001801&pickMembers%5B0%5D=1.1&cubeTimeFrame.startYear=1997&cubeTimeFrame.endYear=2018&referencePeriods=19970101%2C20180101
total_traffic_per_month = [295073,337146,380552,442854,450516,464973,510697,458663,447297,456912,407902,344482,325659,380045,417183,482710,491502,464962,533190,489821,463890,465243,400306,342608,336128,366646,452346,472217,502205,479946,519847,498412,477100,469594,422355,331220,345760,381111,437902,420143,454782,472125,488821,469541,446128,456177,398326,308714,367114,335254,418483,424343,460960,465684,507708,478667,398620,423309,387907,316658,322265,326231,362693,405125,433683,435156,468725,452305,424170,412831,363115,301364,307702,314836,365905,378116,417552,418059,452675,407577,402467,384931,350045,298217,267541,342664,363529,378121,382597,413927,418497,403773,398083,366234,352016,266931,283328,319662,358132,376230,407889,390620,427859,411896,396070,374974,319636,282289,291797,300182,366577,382264,400425,409285,441021,440333,400522,385201,343793,314044,323338,316990,394955,407534,473718,449994,490882,473606,451913,448830,395144,305836,336956,346748,402155,442058,467391,443741,492754,459748,467978,449438,358436,266203,289420,339228,394607,416342,447574,444682,462781,434142,423997,379867,363273,286269,321061,304147,413944,398521,437358,419919,473132,432291,386163,409364,348383,277661,290037,291483,366283,371370,404122,417210,445231,428915,404187,391091,351734,296863,287259,331536,366189,370297,421938,397728,447536,421029,400788,355873,355214,266787,293116,295953,363491,366338,396742,395670,459494,419687,387989,392543,341189,259238,280830,297214,352957,374535,416872,408643,454126,416991,397643,374009,321259,290319,294544,290853,371033,370288,408358,416629,451495,406442,399377,385502,338086,270894,299222,308524,360373,396811,410145,407532,445571,432552,405965,364946,343035,267840,304166,291847,344150,367705,414790,425777,474985,440303,400950,383760,336210,290225,290320,298983,373493,371989,454638,439791,475931,436955,412189,405096,348431,319111,310268,290654,412066,391156,450495,450060,491377,456956,412919,419319,352500,302368,294785,339424,285335,91600,157118,244049,304795,324274,282202,283235,258053,205465,204773,194127]
years = 1997+(np.arange(0,len(total_traffic_per_month))/12)
matplotlib.rcParams['figure.figsize'] = (10, 7);
plt.plot(years, total_traffic_per_month,linewidth=4);
plt.ylabel('Number of Aircraft movements');

Note that this graph only shows the number of movements at airports and not the number of passengers. Using the two plots, we could infer that the number of events is somewhat constant concerning traffic density.

In the following plot we answer the question: are there any periods of the year where more events take place?

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'OccMonth'
fig = sns.countplot(x=column, data = occurence_df.dropna(subset=[column]));
fig.set(xlabel = '', ylabel = 'Number of occurrences', Title='Monthly breakdown of aviation occurrences');
months = ['Jan','Fev','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']

And the answer is Yes! It turns out that more events happen during the summer months. This is due to a higher density of traffic during summer.

Can we have similar deductions if we consider weekdays?

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'OccWeekday'
fig = sns.countplot(x=column, data = occurence_df.dropna(subset=[column]));
fig.set(xlabel = '', ylabel = 'Number of occurrences', Title='Breakdown of aviation occurrences by weekdays');
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

The result shows that there is no obvious correlation between the number of events and weekdays. Later on, we will show that some categories of incidents happen more frequently during weekends.

In aviation, time is provided in UTC (universal coordinated time or GMT) standard format. In our dataset, some of the events are not reported in UTC, but in local time. This can be useful to have an intuition about the time of the day (day/night). Let's take a look at the breakdown of these occurrences.

occurence_df['OccTime'] = pd.to_datetime(occurence_df['OccTime'])
occurence_df['OccHour'] = occurence_df['OccTime'].dt.hour
matplotlib.rcParams['figure.figsize'] = (15, 5)
column = 'OccHour'
fig = sns.countplot(x=column, data = occurence_df[occurence_df['TimeZoneID_DisplayEng'] !='UTC'].dropna(subset=[column]));
fig.set(xlabel = 'Local hour', ylabel = 'Number of occurrences', Title='Hourly breakdown of aviation occurrences (local time)');

It comes out that most occurrences happen during the daytime. Traffic density during the day is higher, and the risk of incidents is increased.

Accidents and incidents

In this part, we look at the incident/accident repartition.

matplotlib.rcParams['figure.figsize'] = (10, 1.5)
column = 'OccTypeID_DisplayEng'
fig = sns.countplot(y=column, data = occurence_df.dropna(subset=[column]));
fig.set(xlabel = 'Number of occurrences', ylabel = '', Title='Breakdown of aviation occurrences by event category');

Statistically, there are more incidents (not involving fatal injuries, or aircraft damage) than accidents. With that being said, about 40% of events are categorized as accidents. Let's see how this rate is evolving with time.

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'OccYear'
sub_occ1 = occurence_df[occurence_df['OccTypeID_DisplayEng']=='ACCIDENT']
sub_occ2 = occurence_df[occurence_df['OccTypeID_DisplayEng']=='INCIDENT REPORTABLE']

data = sub_occ1[sub_occ1[column]>1980].dropna(subset=[column])
data2 = data.groupby(column)['OccNo'].count()

data3 = sub_occ2[sub_occ2[column]>1980].dropna(subset=[column])
data4 = data3.groupby(column)['OccNo'].count()

plt.plot(data2.index, data2.values, label='ACCIDENT',linewidth=6);
plt.plot(data4.index, data4.values, label = 'INCIDENT',linewidth=6);

plt.ylabel('Number of occurrences');
plt.title('Accidents and Incidents');

The result is encouraging! the rate of accidents among the reported events is decreasing. This means that there are fewer fatal injuries and substantial damages.

Let's focus now on the accidents, and plot the number of deaths reported every year:

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'OccYear'
data = occurence_df[occurence_df[column]>1965].dropna(subset=[column])
data2= data.groupby(column)['TotalFatalCount'].sum()
plt.fill_between(data2.index, data2.values);
plt.title('Number of fatal injuries reported each year');

The general trend is downward (which is good news). There are still some peaks in this graph that require more investigation. Those values are due to crashes that caused a high number of fatal injuries. Let's look for more details by searching for the worst accidents, by death count:

# 10 worst accidents, by death count
data = occurence_df.sort_values('TotalFatalCount',ascending=False)[['OccDate','TotalFatalCount','OccNo','CountryID_DisplayEng']].head(10)
OccNos = data['OccNo'].values
for OccNo in OccNos:
    print(data[data['OccNo']==OccNo].values, aircraft_df[aircraft_df['OccNo']==OccNo][['AircraftModelID_DisplayEng','OrganizationID_DisplayEng']].values)
['OccDate', 'TotalFatalCount', 'OccNo', 'CountryID_DisplayEng'] [AircraftModelID_DisplayEng OrganizationID_DisplayEng] [[Timestamp('1991-07-11 00:00:00') 261 'A91F0011' 'SAUDI ARABIA']] [['DC-8-61' 'NATIONAIR (NOLISAIR)']] [[Timestamp('1985-12-12 00:00:00') 256 'A85H0902' 'CANADA']] [['DC-8-63' 'ARROW AIR']] [[Timestamp('1998-09-02 00:00:00') 229 'A98H0003' 'CANADA']] [['MD-11' 'SWISSAIR']] [[Timestamp('1976-01-01 00:00:00') 62 'A76P7962' 'CANADA']] [['DC-4M2' nan]] [[Timestamp('1989-09-08 00:00:00') 55 'A89H0011' 'DENMARK']] [['CV580' 'PARTNAIR']] [[Timestamp('1978-02-11 00:00:00') 43 'A78H0001' 'CANADA']] [['737-200' 'PACIFIC WESTERN AIRLINES']] [[Timestamp('1989-03-10 00:00:00') 24 'A89C0048' 'CANADA']] [['F-28 MK 1000' 'AIR ONTARIO']] [[Timestamp('1983-06-02 00:00:00') 23 'A83F0006' 'UNITED STATES']] [['DC-9-32' 'AIR CANADA']] [[Timestamp('1990-04-18 00:00:00') 20 'A90H0004' 'PANAMA']] [['DHC-6-200' 'AEROPERLAS S.A.']] [[Timestamp('1990-09-11 00:00:00') 18 'A90A0214' 'INTERNATIONAL WATERS']] [['727-200' 'FAUCETT PERU']]

We can see that the top 3 accidents happened in the years 1985, 1991 and 1998. Let's get more information about these accidents.

### let's consider the top 3
dates = [1985,1991,1998]
fatal_accidents = []
for date in dates:

for fatal in fatal_accidents:
    OccNo = occurence_df[occurence_df['TotalFatalCount']==fatal]['OccNo']
    event = occurence_df[occurence_df['OccNo']==OccNo.values[0]]['Summary'].values[0]
    aircraft = aircraft_df[aircraft_df['OccNo']==OccNo.values[0]]['AircraftModelID_DisplayEng'].values[0]
    date = occurence_df[occurence_df['OccNo']==OccNo.values[0]]['OccDate'].values[0]
    print('Date: {} - {} deaths - Aircraft type: {}.\nEvent# {}\n'.format(str(date)[:10],fatal,

Date: 1985-12-12 - 256 deaths - Aircraft type: DC-8-63. Event# A85H0902: AIRCRAFT CRASHED ON TAKE-OFF, APPROXIMATELY ONE-HALF MILE BEYOND THE DEPARTURE END OF RUNWAY 22. THE AIRCRAFT WAS DESTROYED BY IMPACT FORCES AND A POST-CRASH FIRE. Date: 1991-07-11 - 261 deaths - Aircraft type: DC-8-61. Event# A91F0011: THE NATIONAR DC-8 WAS RETURNING TO LAND SHORTLY AFTER TAKE-OFF FROM KING ABDUL AZZIZ INTERNATIONAL AIRPORT IN SAUDI ARABIA AND CRASHED ABOUT ONE MILE ON FINAL APPROACH FOR RUNWAY 34C. THE AIRCRAFT WAS DESTROYED IN THE CRASH AND THERE WERE NO SURVIVORS.THE PRESIDENCY OF CIVIL AVIATION OF SAUDI ARABIA IS CONDUCTING THE INVESTIGATION. RON COLEMAN HAS BEEN APPOINTED CANADIAN ACCREDITED REPRESENTATIVE. Date: 1998-09-02 - 229 deaths - Aircraft type: MD-11. Event# A98H0003: At 0018 UTC, a Swissair MD-11 (SR 111) departed New York, USA, for Geneva, Switzerland, with 215 passengers and 14 crew members on board. About 53 minutes after departure, the pilots detected an odour in the cockpit. Within about three minutes they assessed there was smoke in the cockpit and made a "panpan" urgency radio transmission. They assessed that the smoke originated from an air conditioning source, and decided to divert to Halifax International Airport, about 68 nm away. While the pilots were preparing for the landing in Halifax, an on-board fire progressed unknown to them. About 11 minutes after the pilots initially detected the smoke, the situation in the cockpit began to deteriorate rapidly. The flight crew declared an emergency and requested an immediate landing in Halifax. Shortly thereafter, radio contact and primary radar contact with the aircraft was lost, when the aeroplane was about 10,000 feet asl. Six minutes later, the aircraft crashed into the ocean about 5 nm southwest of Peggy's Cove, Nova Scotia. The aircraft was destroyed by impact forces and there were no survivors. test

Statistically, those events affect significantly the properties of the dataset. For instance, here is a plot of average fatal injuries count for each month:

matplotlib.rcParams['figure.figsize'] = (10, 5)
data = occurence_df.groupby('OccMonth').TotalFatalCount.mean()
fig = sns.barplot(x = data.index, y = data.values);
fig.set(xlabel = '', ylabel = 'Average fatal injury count', Title='Monthly average fatal injury count breakdown');

We can see peaks in July, September, and December, where the three worst accidents happened.

We perform the same analysis to the serious injuries column. The results are shown in the following plot:

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'OccYear'
data = occurence_df[occurence_df[column]>1965].dropna(subset=[column])
data2= data.groupby(column)['TotalSeriousCount'].sum()
plt.fill_between(data2.index, data2.values);
plt.title('Number of serious injuries reported each year');

New technologies and new regulations are improving the safety of air travel, in addition to more sophisticated search and rescue services. Interestingly, the worst 3 accidents we have shown earlier, didn't affect the number of serious injuries. Sadly, these catastrophic events have a very low count of survivors.

We observe a curious relationship between the average number of serious injuries and the weekdays:

matplotlib.rcParams['figure.figsize'] = (10, 7)
data = occurence_df.groupby('OccWeekday').TotalSeriousCount.mean()
fig = sns.barplot(x = data.index, y = data.values);
fig.set(xlabel = '', ylabel = '', Title='Average serious injury breakdown by weekdays');

It appears that more serious injuries happen on weekends. More investigation is required to determine the real reasons for such a trend.

Impact of environmental factors on aviation safety

Multiple environment variables could affect flight progress. The first parameter we consider is light condition.

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'LightCondID_DisplayEng'
graph =  occurence_df.groupby(column)['OccNo'].count().sort_values(ascending=False)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Number of occurrences', ylabel = '', Title='Light condition impact on aviation occurrences');

The data is missing in most of the rows. Based on what is available, most events occurred during daylight.
Let's consider sky conditions impact:

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'SkyCondID_DisplayEng'
graph =  occurence_df.groupby(column)['OccNo'].count().sort_values(ascending=False)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Number of occurrences', ylabel = '', Title='Breakdown of aviation occurrences by sky conditions');

Based on this data, it appears that occurrences happen more often in a clear sky.

Another weather-related parameter is the visual (VMC)/instrument (IMC) weather condition. According to Canadian regulations, this refers to visibility being above or less than 3 nautical miles. The definition can vary in other countries.

matplotlib.rcParams['figure.figsize'] = (10, 1.5)
column = 'GeneralWeatherID_DisplayEng'
fig= sns.countplot(y=column, data = occurence_df[occurence_df[column]!='UNKNOWN'].dropna(subset=[column]));
fig.set(xlabel = 'Events count', ylabel = '', Title='Breakdown of aviation occurrences by weather condition');

Events happen more often in IMC conditions, where visibility is low. In aviation, visibility is described using two parameters: visibility ceiling (vertical visibility) and visibility (horizontal visibility).

matplotlib.rcParams['figure.figsize'] = (10, 10)
data = occurence_df[occurence_df['VisibilyCeiling']<8000]
data = data[data['Visibility']<6]
data = data.groupby(['VisibilyCeiling','Visibility'])['OccNo'].count()
fig = sns.scatterplot(y ='VisibilyCeiling',x='Visibility', hue = data.values, data =data,s=100);
fig.set(ylabel = 'Vertical visibility (ft)', xlabel = 'Horizontal visibility (miles)', Title='Impact of visibility on aviation safety');
fig.legend(title='Number of reported events');

Aviation accidents and incidents are more likely to happen in low visibility conditions. These conditions are usually referred to as marginal and can change quickly. They affect mainly VFR (visual flight rules) aircraft that are not certified and equipped to fly using IFR (instrument flight rules).

Many other weather phenomena can affect safety as shown in the following plot.

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'WeatherPhenomenaTypeID_DisplayEng'
graph =  occurence_df_full.groupby(column)['OccNo'].count().sort_values(ascending=False).head(7)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Number of occurrences [log scale]', ylabel = '', Title='Impact of weather phenomena on flight safety');

The top 3 weather conditions affecting aviation are icing, obscuration, and precipitation. Turbulence is a very serious concern, especially when it is categorized as moderate or severe. Among the less recurrent but most dangerous phenomena are wind shear, Microburst, and lightning. These three conditions are difficult to forecast and can cause substantial damages in a short time.

Which flight phases are more dangerous?

The flight phases have a key influence on the outcome of events. Let's take a look at the breakdown of incidents by phase.

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'PhaseID_DisplayEng'
graph =  events_phases_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(10)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count', ylabel = '', Title='Breakdown of aviation events by phase of the flight');

It turns out that more events happened during the cruise phase. To have a better intuition about flight phase impact, we need to determine which events are more dangerous? (by dangerous, we mean more fatal injuries)

#we join the occurrence and events_phases datasets using the occurrence numbers as index.
new_dataset3 = occurence_df.set_index('OccNo').join(events_phases_df.set_index('OccNo'))
select_phases = graph.index
data = new_dataset3.loc[(new_dataset3['PhaseID_DisplayEng']==select_phases[0])|
                (new_dataset3['PhaseID_DisplayEng']==select_phases[9])].groupby('PhaseID_DisplayEng')                          ['TotalFatalCount'].mean()
matplotlib.rcParams['figure.figsize'] = (10, 5)
fig = sns.barplot( data.values,select_phases);
fig.set(xlabel = 'Average fatal injuries count', ylabel = '', Title='Impact of the phase of the flight on fatal injuries count');

The accidents causing more fatal injuries happen in general during the beginning of the flight (Initial climb) and the last minutes of the flight (Final approach, Landing roll).

When we consider takeoff and landing operations, it is clear that the runway plays a key role. Runway surface can be described using two parameters : surface condition and contamination.

matplotlib.rcParams['figure.figsize'] = (10, 2)
column = 'RunwaySurfaceID_DisplayEng'
graph =  occurence_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(5)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count', ylabel = '', Title='Impact of runway condition on occurrences count');
matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'SurfaceContaminationID_DisplayEng'
graph =  occurence_df_full.groupby(column)['OccNo'].count().sort_values(ascending=False).head(9)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count', ylabel = '', Title='Impact of runway contamination on aviation events');

Even if the contamination affects the runway operation, we can observe that a high number of events occur in bare and dry runways. In this case, human or equipment factors must be considered.

Which aircraft are more likely to be involved in events?

Let's see if a specific aircraft manufacturer is more represented in the incident/accident database:

matplotlib.rcParams['figure.figsize'] = (10, 7)
column = 'AircraftMakeID_DisplayEng'
graph =  aircraft_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(15)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count', ylabel = '', Title='Breakdown of aviation events by aircraft manufacturer');

Cessna and Piper show among the top 5 manufacturers. Most of the aircraft produced by these two makes are light aircraft used for personal/tourism and training flights. Pilots flying these aircraft are less experienced on average. There are also many occurrences involving aircraft made by Boeing, Airbus, and De Havilland. These aircraft are the most commonly exploited by Canadian commercial operators.

Aviation events are not reported only for airplanes, but for all types of aircraft. Here is a breakdown of events by aircraft types:

matplotlib.rcParams['figure.figsize'] = (10, 3)
column = 'AircraftTypeID_DisplayEng'
graph =  aircraft_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(7)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count [log scale]', ylabel = '', Title='Breakdown of aviation events by aircraft type');

Events related to airplanes are the top category. There are still a non-neglectable number of events that happened with helicopters, gliders, and even ballons!

Events can be classified as well, according to their operator type:

matplotlib.rcParams['figure.figsize'] = (10, 3)
column = 'OperatorTypeID_DisplayEng'
graph =  aircraft_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(5)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count', ylabel = '', Title='Breakdown of aviation events by operator type');

Commercial flights report the highest number of events, due to more intense activity. Private flights include tourism and training activities.
Using the YearOfManuf column, we will compute the aircraft age at the time of the event. First, we have to prepare/clean the data.

def clean_YoM(txt):
    txt = str(txt)
    if len(txt)==4 and txt[0]!=' ':
        return int(txt)
    elif len(txt)==6:
        return int(txt[:-2])

column = 'YearOfManuf'
aircraft_df[column] = aircraft_df[column].apply(clean_YoM)
#we join the occurrence and aircraft datasets using the occurrence numbers.
new_dataset = aircraft_df.set_index('OccNo').join(occurence_df.set_index('OccNo'))
#compute the age
new_dataset['age'] = new_dataset['OccYear']-new_dataset['YearOfManuf']
matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'age'
fig = sns.histplot(x = column, data = new_dataset, bins=12,kde = True);
fig.set(xlabel = 'Aircraft age (years)', ylabel = 'Events count', Title='Breakdown of aviation events by aircraft age');

This histogram shows that -in general- incidents/accidents are not more frequent with old aircraft. This can be associated with two factors:

  • There are fewer old aircraft, so fewer events involving this category.
  • Airworthiness and strict maintenance programs allow keeping aircraft in a good and safe condition.

Not all occurrences lead to aircraft damage. The next plot shows a breakdown by damage level.

matplotlib.rcParams['figure.figsize'] = (10, 3)
column = 'DamageLevelID_DisplayEng'
graph =  aircraft_df.groupby(column)['OccNo'].count().sort_values(ascending=False)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count', ylabel = '', Title='Breakdown of aviation events by aircraft damage level');

Most of the events caused no damage. It turns out that, if an aircraft is damaged during an event, it is more likely that the damage is substantial.

Flight plans, search and rescue operations, and aviation safety

Flight plans are documents filed by a pilot or flight dispatcher with the local Air Navigation Service Provider before departure which indicates the plane's planned route or flight path. Flight plans are obligatory in some cases (IFR, VFR cross border, etc). They contain important data about the flight originating airport, route, destination airport, altitude, and aircraft type. They include contact information and search and rescue time.

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'FlightPlanTypeID_DisplayEng'
graph =  aircraft_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(10)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count', ylabel = '', Title='Breakdown of aviation events by flight plan type');

The top-two categories of flights involved in events are :

  • flights without a flight plan,
  • flight notes, where the pilot is responsible for search and rescue

Furthermore, we can prove that the average fatal injuries count is lower when flight plans are available.

mean_fatal_no_fp = new_dataset.loc[(new_dataset['FlightPlanTypeID_DisplayEng']=='NONE')|(new_dataset['FlightPlanTypeID_DisplayEng']=='FLIGHT NOTE')]['TotalFatalCount'].mean()

mean_fatal_with_fp = new_dataset.loc[(new_dataset['FlightPlanTypeID_DisplayEng']!='NONE')&(new_dataset['FlightPlanTypeID_DisplayEng']!='FLIGHT NOTE')]['TotalFatalCount'].mean()

matplotlib.rcParams['figure.figsize'] = (10, 3)
fig = sns.barplot(['No FP / Flight note','Documented FP'],[mean_fatal_no_fp,mean_fatal_with_fp]);
fig.set(xlabel = '', ylabel = '', Title='Average fatal injuries count by Flight plan type (FP)');

During search and rescue operations, evacuation time plays a significant role. In the following plot, we show a breakdown of reported evacuation times.

matplotlib.rcParams['figure.figsize'] = (15, 5)
column = 'AircraftEvacTime'
data = survivability_df[survivability_df[column]<90].groupby(column).OccNo.count()
fig  = sns.barplot(x = data.index, y = data.values);
fig.set(xlabel = 'Evacuation time (minutes)', ylabel = 'Events count', Title='Breakdown of aviation events by evacuation time');
fig.set_xticklabels([int(x) for x in data.index]);

Let's look at the impact of evacuation time on fatal injuries count.

new_dataset2 = occurence_df.set_index('OccNo').join(survivability_df.set_index('OccNo'))
new_dataset2['AircraftEvacTimeCat'] = pd.cut(new_dataset2['AircraftEvacTime'], bins = [0,5,15,30,np.inf])

matplotlib.rcParams['figure.figsize'] = (10, 3)
column = 'AircraftEvacTimeCat'
data = new_dataset2.groupby(column)['TotalFatalCount'].sum()
fig = sns.barplot(x = data.index, y = data.values);
fig.set(xlabel = 'Evacuation time', ylabel = 'Total fatal injuries', Title='Impact of evacuation time on fatal injuries');
fig.set_xticklabels(['less than 5min','5 to 15 min','15 to 30 min', 'more than 30 min']);

As we might expect, more fatal injuries happen in situations where it takes longer to proceed with evacuation.
Evacuation can be delayed due to multiple reasons, in addition to search time.

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'EvacHamperedID_DisplayEng'
graph =  survivability_df.groupby(column)['OccNo'].count().sort_values(ascending=False).head(10)
fig = sns.barplot(graph.values/sum(graph)*100, graph.index);
fig.set(xlabel = '%', ylabel = '', Title='Why was evacuation hampered?');

The availability of survival equipment has a great impact on the incident outcome. Here is a breakdown of reported survival equipment in the dataset.

matplotlib.rcParams['figure.figsize'] = (10, 5)
column = 'SurvEquiID_DisplayEng'
graph =  survivability_df_full.groupby(column)['OccNo'].count().sort_values(ascending=False).head(12)
fig = sns.barplot(graph.values, graph.index);
fig.set(xlabel = 'Events count [log scale]', ylabel = '', Title='Breakdown of aviation events by survival equipment');

Now let's see if the survival equipment was used properly during the events.

matplotlib.rcParams['figure.figsize'] = (15, 7)
matplotlib.rcParams['font.size'] = 30
column = 'EquipEffReasonID_DisplayEng'
graph =  survivability_df[survivability_df[column]!='UNKNOWN'].groupby(column)['OccNo'].count().sort_values(ascending=False)
fig = sns.barplot(graph.values/sum(graph)*100, graph.index);
fig.set(xlabel = '%', ylabel = '', Title='How effective was survival equipment?');

This information is missing in a large number of reports. Nonetheless, it appears that -in a majority of cases- the survival equipment was used properly. The next question would be: how effective is the equipment?

matplotlib.rcParams['figure.figsize'] = (10, 2)
matplotlib.rcParams['font.size'] = 18
column = 'EquipInfluenceEnum_DisplayEng'
survivability_df[column] = survivability_df.fillna({column:'No'})[column]
fig = sns.countplot(y=column,data = survivability_df);
fig.set(xlabel = 'Number of occurrences', ylabel = '', Title='Has Survival equipment influenced survivability ?');

At first glance, we could observe that survival equipment didn't affect survivability in more than 14000 events. The important statistic here is that equipment was effective in more than 6000 events, so it saved at least 6000 lives!


We presented through this analysis some of the key aspects to consider when it comes to aviation safety. In general, PETE factors are largely used to investigate safety concerns (Person, Equipment, Task, Environment). Some trends can be observed, such as the decrease in the number of fatal injuries number. On the other hand, the number of incidents is not decreasing. We have shown that some periods are more prone to aviation incidents so more precautions should be taken. In the document, the impact of the environmental variables was shown. We established the link between the flight phase and the likelihood of incidents/accidents.
The last part of the analysis was devoted to showing the importance of flight plans in search and rescue operations. Time and information are the most important factors during these operations.

Future work

This study was done using less than 40 columns out of 500. We still have to go in-depth to get the most of the available data. The next step will be more specialized EDAs, with more emphasis on what we can do to improve safety. Each factor of the PETE model can be considered and an extensive search is required to develop efficient solutions.


I would like to thank the Jovian team, especially Aakash N S for all the extensive assistance through my Data Science training.

Web references

Columns descriptions:

EquipInfluenceEnum_DisplayEng : Indicates whether the equipment influenced the survivability of the occurrence, if known, in English.
PhaseID_DisplayEng : The phase of the flight, in English. Note: Multiple phases can be assigned to each occurrence (OccID) or specific aircraft (AcID) and each will have an associated EventID.
OccTypeID_DisplayEng : A description of the occurrence type (accident/incident reportable), in English.
OrganizationID_DisplayEng : The name of the organization (if the operator is an organization), in English.
SurvivableEnum_DisplayEng : Indicates whether the occurrence was survivable (for occurrences involving an impact), in English. Indicator: Yes/No/Unknown.
SurvEquiID_DisplayEng : The survival equipment available on the occurrence aircraft, if relevant, in English. Equipment grid - Multiple survival equipments can be assigned to an occurrence (OccID). Each equipment will result in a separate entry.
TimeZoneID_DisplayEng : The time zone used for reporting the time of occurrence, in English.
LightCondID_DisplayEng : A description of the light conditions, in English.
YearOfManuf : The year in which the aircraft was manufactured.
AircraftModelID_DisplayEng : The aircraft model, in English.
TotalSeriousCount : The total number of serious injuries (includes any ground injuries).
InjuriesEnum_DisplayEng : Indicates whether there were any injuries related to the occurrence, including ground injuries, in English.
ICAO_DisplayEng : The International Civil Aviation Organization (ICAO) occurrence category, in English. For one occurrence, multiple ICAO categories may be assigned, that will generate multiple entries/rows.
AircraftMakeID_DisplayEng : The aircraft make, in English.
EvacEgressIssueEnum_DisplayEng : Indicates whether there were evacuation egress issues, if known, in English.
SkyCondID_DisplayEng : The sky conditions at the time of the occurrence, in English.
GeneralWeatherID_DisplayEng : Indicates whether the known weather conditions were conducive to visual or instrument flight rules, in English.
EquipEffReasonID_DisplayEng : The reason for survival equipment effectiveness, in English.
AircraftEvacTime : The duration of the aircraft evacuation, in minutes.
SurfaceContaminationID_DisplayEng : The type of surface contamination, if relevant, in English. Note: Each description will result in a separate entry.
FlightPlanTypeID_DisplayEng : The type of flight plan, in English.
TotalFatalCount : The total number of fatalities (includes any ground fatalities).
OccTime : The time the occurrence happened. Time format is hh:mm (24-hour clock).
VisibilyCeiling : The visibility ceiling, in feet.
OccNo : The unique occurrence number for general reference.
OccRegionID_DisplayEng : The region of the occurrence, as defined by the geographical area each regional office has been assigned, in English.
OccDate : The occurrence date. Date format is YYYY-MM-DD.
Visibility : The visibility, in statute miles.
AircraftTypeID_DisplayEng : The aircraft type as defined in the Canadian Aviation Regulations, Part 1, Subpart 1.
RunwaySurfaceID_DisplayEng : The texture of the surface of the runway involved in the occurrence, in English.
Summary : The summary of the occurrence.
EvacHamperedID_DisplayEng : The reason(s) why the evacuation was hampered, if applicable, in English.
EquipEffectiveEnum_DisplayEng : Indicates whether equipment was effective and corresponds to the survival equipment, if known, in English.
DamageLevelID_DisplayEng : The aircraft level of damage as defined by ICAO, in English.
WeatherPhenomenaTypeID_DisplayEng : The type of weather phenomena at the time of the occurrence, in English. Weather phenomena grid - Multiple weather phenomenas can be specified for each occurrence (OccID), each type and its associated description will appear as a separate entry.
CountryID_DisplayEng : The country of the occurrence, in English.
OperatorTypeID_DisplayEng : The type of operator (private, commercial, state) involved in the occurrence, in English.

!pip install jovian --upgrade -q
import jovian
jovian.commit(project="kara-mounir/canadian-aviation-data",filename="canadian_aviation.ipynb", environment=None)
[jovian] Attempting to save notebook.. [jovian] Updating notebook "kara-mounir/canadian-aviation-data" on https://jovian.ai/ [jovian] Uploading notebook.. [jovian] Committed successfully! https://jovian.ai/kara-mounir/canadian-aviation-data
Mounir Kara Zaitri6 months ago
Sign In