Learn practical skills, build real-world projects, and advance your career
!git clone https://github.com/woldemarg/impulse_classifier
Cloning into 'impulse_classifier'... remote: Enumerating objects: 28, done. remote: Counting objects: 100% (28/28), done. remote: Compressing objects: 100% (22/22), done. remote: Total 28 (delta 8), reused 21 (delta 4), pack-reused 0 Unpacking objects: 100% (28/28), 178.24 KiB | 986.00 KiB/s, done.
cd impulse_classifier
/content/impulse_classifier/impulse_classifier
!pip install --upgrade lightgbm
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: lightgbm in /usr/local/lib/python3.9/dist-packages (3.3.5) Requirement already satisfied: wheel in /usr/local/lib/python3.9/dist-packages (from lightgbm) (0.40.0) Requirement already satisfied: scikit-learn!=0.22.0 in /usr/local/lib/python3.9/dist-packages (from lightgbm) (1.2.2) Requirement already satisfied: scipy in /usr/local/lib/python3.9/dist-packages (from lightgbm) (1.10.1) Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from lightgbm) (1.22.4) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn!=0.22.0->lightgbm) (1.1.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn!=0.22.0->lightgbm) (3.1.0)
%matplotlib inline
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# define random state for reproducibility
RND = 1234
# set sample size
SMP = 5000
# shate of hidden to emulate PU-dataset
n_hidden_share = 0.5
# set test share
EVL = 0.25

X, y = make_classification(
    n_samples=SMP,
    weights=[0.6],
    shuffle=True,
    random_state=RND)

y_pu = y.copy()

pos = np.nonzero(y)[0]
np.random.RandomState(RND).shuffle(pos)

n_hidden = int(y.sum() * n_hidden_share)
y_pu[pos[:n_hidden]] = 0

X_trn, X_tst, y_trn_pu, y_tst_pu, y_trn, y_tst = train_test_split(
    X, y_pu, y, test_size=EVL, random_state=RND, stratify=y_pu)

print(f'Positives in original target: {y.sum()} ({y.mean():.1%})')
print(f'Positives in modified target: {y_pu.sum()} ({y_pu.mean():.1%})')
Positives in original target: 2013 (40.3%) Positives in modified target: 1007 (20.1%)