User Recognition from Cell Phone Accelerometer Data Using ML

Open in Google Colab and click the "Run" button to execute the code.

Figure 1. a. (left) Coordinate system (relative to a device) that's used by the Android Sensor API. b. (right) 670 sample data points collected from an accelerometer stream during approximately 2 minutes, while the device was motionless in portrait, landscape and face-up positions. The plotted x,y,z coordinates of each accelerometer event appear as points on the surface of a sphere with radius ~9.81 m/s^2.

A. INTRODUCTION

The data used in this project comes from a Kaggle competition, which attempts to answer the question:
Can cell phone accelerometer data be used a biometric for identifying users of mobile devices?

User authentication is an important component for providing cell phone owners secure access to their cell phone applications and data. Authentication allows a computerized system to know who the user is and to determine that the user is not impersonating the identity of an authorized user. Typically, passwords have been used to implement authentication of users to computerized systems, though other methods including biometric identification, one time tokens and digital signatures are also used. A common drawback to such methods is that they are obtrusive to the user, requiring the user to remember and enter a sequence of characters, interact with a biometric device or be in possession of a device that generates one-time tokens each time that he wishes to access the computerized system. It would be desirable to have an unobtrusive method for authenticating a user to a mobile device that can also be exploited to provide authentication for accessing other computerized systems, performing secure payments or for unlocking physical barriers.

This Kaggle competition was sponsored by a company named "Seal-Id". The company filed a US provisional patent application (61/430,549) in January 2011, describing "Method and System for Unobtrusive Mobile Device User Recognition". The abstract of the patent defines the goal:

"Parameters of user interaction with a hand-held mobile device are continually and unobtrusively sampled, discretized and monitored during normal operation of the device. Software on the device learns the usage repertoire of an authorized user from the samples, and is able to distinguish between the authorized user of the device and an unauthorized user. Recognition of an authorized user by a mobile device can be exploited to trigger defensive actions or to facilitate provision of secure automatic authentication of users."

B. DESCRIPTION OF THE DATA

All Data:

To collect the data, Seal published an app on Googles’s Android PlayStore that samples accelerometer data in the background and posts it to a central database for analysis.

Data was collected from 387 users over a period of several months during normal device usage
data where the device is motionless was discarded
The columns are X, Y, Z raw accelerometer data (g= 9.81 m/s^2) plus a unix datetime
60 million unique samples of accelerometer data total (train and test)
Sampling rate varies for each device but averages ~ 1 sample / 207 ms (~5 Hz)
The first 50% of the data sorted in time (for each device) became the training set
The last 50% of the data sorted in time (for each device) became the test set

Training Data:

30 million samples in the training set
labeled with the unique device from which the data was collected

Test Data:

30 Million samples in the test set
Demarcated into 90k sequences of consecutive samples from one device.
300 samples per sequence ID

C. SCORING

A file of test questions is provided in which you are asked to determine whether the accelerometer data in each test sequence came from the proposed device. We are told that 50% of the answers are intentionally incorrect. The submission file would contain probabilities the proposed device corresponds to the data.

The analysis and scoring of this model uses the ROC AUC scoring. This is useful as the idea is to be able to differentiate between the user of the cell phone and all other people who may try to access the phone. In multiclass classification this is refered to as One-vs-Rest or 'OVR'. This is particularly well described by the AUC of the ROC function, which gives the probability of identifying the actual user from all of the other users.