Home Credit Default Risk (HCDR)

  • Project Title : Home Credit Default Risk Prediction

  • Group Name : Group1

  • Group Number : 1

  • Group Members (From Left to right and top to bottom)

    • Archana Krishnamurthy (akrishn@iu.edu)
    • Anitha Ganapathy (aganapa@iu.edu)
    • Bathurunnisha Abdul Jabbar (babdulj@iu.edu)
    • Rajesh Thanji (rthanji@iu.edu)
  • Team Montage image.png

Overview : The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Abstract

The objective of this project is to use machine learning methodologies on historical loan application data to predict whether or not an applicant will be able to repay a loan. As an extension to EDA and hyper-tuned model, this phase provided valuable insights when feature engineering was modified to handle data leakage employing a better data processing flow. Multiple experiments were conducted applying feature selection techniques including RFE, SelecKbest, Variance threshold to Logistic regression, Gradient Boosting, XGBoost, LightGBM & SVM models , further handling class imbalance using SMOTE for XGBoost , monitoring error generalization with early stopping and building high performance Neural Networks . Our results in this phase show that the best performing algorithm was Logistic Regerssion with varaince threshold selection with test ROC_AOC as 75.22%. The lowest performing was SVM model with test AUC(Area under ROC) as 67.21%. Our best score in Kaggle submission was for Logistic Regression with SelectKBest with score of 0.72158 for private and 0.72592 for public.

Table of Contents

Project Description

Home Credit is an international non-bank financial institution, which primarily focuses on lending people money regardless of their credit history. Home credit groups aim to provide positive borrowing experience to customers, who do not bank on traditional sources. Hence, Home Credit Group published a dataset on Kaggle website with the objective of identifying and solving unfair loan rejection.

The goal of this project is to build a machine learning model to predict the customer behavior on repayment of loan. Our task would be to create a pipeline to build a baseline machine learning model using logistic regression classification algorithms. The final model will be evaluated with various performance metrics to build a better model. Businesses will be able to use the output of the model to identify if loan is at risk to default. The new model built will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

The results of our machine learning pipelines will be measured using the follwing metrics;

  • Confusion Matrix
  • Accuracy Score
  • Precision
  • Recall
  • F1 score
  • AUC (Area Under ROC Curve)
  • CXE Loss
  • Hinge Loss (Deep Learning)

The pipeline results will be logged, compared and ranked using the appropriate measurements. The most efficient pipeline will be submitted to the HCDR Kaggle Competition.

Workflow

For this project, we are following the proposed workflow as mentioned in Phase-0 of this project.

image.png

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

If running on Google Drive Run below else not

In [ ]:
import gc
In [ ]:
gc.set_threshold(1000, 15, 15)
In [ ]:
def collecttrash():
  print('before collection : ',gc.get_count())
  gc.collect()
  print('after collection : ',gc.get_count())
In [ ]:
collecttrash()
before collection :  (347, 0, 10)
after collection :  (0, 0, 0)
In [ ]:
%config Completer.use_jedi = False
from time import time, ctime

nb_start = time()

print("Note Book Start time:   ", ctime(nb_start))
Note Book Start time:    Mon May  3 16:09:59 2021
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: UserWarning: Config option `use_jedi` not recognized by `IPCompleter`.
  """Entry point for launching an IPython kernel.
In [ ]:
# !pwd
In [ ]:
# !mkdir ~/.kaggle
# !cp /root/shared/Downloads/kaggle.json ~/.kaggle
# !chmod 600 ~/.kaggle/kaggle.json

Mount drive for file load setup

Dataset Description

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files Overview

There are 7 different sources of data:

  • application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
  • bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  • bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  • previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  • POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  • credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  • installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Imports

In [569]:
# Uncomment if using running it on jupyter lab 
# For code autocompletion
# %config Completer.use_jedi = False

# Preprocessing imports
import copy
import numpy as np
import pandas as pd 
import os
import gc
import zipfile
import matplotlib.pyplot as plt
# from matplotlib import pyplot
import seaborn as sns
from pandas.plotting import scatter_matrix

# Pipelines
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA

# Model imports
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.utils import resample
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss
from sklearn.metrics import classification_report, roc_auc_score, make_scorer
from sklearn.metrics import roc_auc_score, make_scorer, roc_curve, ConfusionMatrixDisplay, precision_recall_curve
from sklearn.metrics import explained_variance_score
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve


from scipy import stats
from time import time, ctime

# Display help
from IPython.display import display, HTML
import re
import json
import pprint
import warnings
warnings.filterwarnings('ignore')

pprint = pprint.PrettyPrinter().pprint
In [ ]:
# !pip install lightgbm
In [570]:
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.utils import resample

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.ensemble import VotingClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif

from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss, classification_report, roc_auc_score, make_scorer
from scipy import stats
import json
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, make_scorer, roc_curve, ConfusionMatrixDisplay, precision_recall_curve
from sklearn.metrics import explained_variance_score
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve

Download the files via Kaggle API

In [ ]:
DATA_DIR = "/root/shared/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/HCDR_Phase_1_baseline_submission/data"   #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir $DATA_DIR
In [ ]:
!ls -l $DATA_DIR
In [ ]:
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
In [ ]:
from google.colab import drive 
drive.mount('/content/drive',force_remount=True)

import os 

os.chdir("/content/drive/My Drive")
Mounted at /content/drive
In [ ]:
unzippingReq = False
if unzippingReq: #please modify this code 
    zip_file = DATA_DIR + '/' + 'home-credit-default-risk.zip'
    zip_ref = zipfile.ZipFile(zip_file, 'r')
    zip_ref.extractall(path=DATA_DIR)
    zip_ref.close()

Data Download

Helper Function

In [ ]:
def load_data(in_path, name):
    df = pd.read_csv(in_path)
    print(f"{name}: shape is {df.shape}")
    print(df.info())
    display(df.head(5))
    return df

datasets={}  # lets store the datasets in a dictionary so we can keep track of them easily

The full dataset consists of 7 tables. There is 1 primary table and 6 secondary tables.

Primary Tables

  1. application_train
    This Primary table includes the application information for each loan application at Home Credit in one row. This row includes the target variable of whether or not the loan was repaid. We use this field as the basis to determine the feature importance. The target variable is binary in nature based since this is a classification problem.
  • '1' - client with payment difficulties: he/she had late payment more than N days on at least one of the first M installments of the loan in our sample
  • '0' - all other cases
    The number of variables are 122. The number of data entries are 307,511
  1. application_test
    This table includes the application information for each loan application at Home Credit in one row. The features are the same as the train data but exclude the target variable
    The number of variables are 121. The number of data entries are 48,744.

image.png

In [ ]:
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE ... LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 Unaccompanied Working Higher education Married House / apartment 0.018850 -19241 -2329 -5170.0 -812 NaN 1 1 0 1 0 1 NaN 2.0 2 2 TUESDAY 18 0 0 0 0 0 0 Kindergarten ... NaN 0.0514 NaN NaN NaN block of flats 0.0392 Stone, brick No 0.0 0.0 0.0 0.0 -1740.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.035792 -18064 -4469 -9118.0 -1623 NaN 1 1 0 1 0 0 Low-skill Laborers 2.0 2 2 FRIDAY 9 0 0 0 0 0 0 Self-employed ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 NaN Working Higher education Married House / apartment 0.019101 -20038 -4458 -2175.0 -3503 5.0 1 1 0 1 0 0 Drivers 2.0 2 2 MONDAY 14 0 0 0 0 0 0 Transport: type 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -856.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.026392 -13976 -1866 -2000.0 -4208 NaN 1 1 0 1 1 0 Sales staff 4.0 2 2 WEDNESDAY 11 0 0 0 0 0 0 Business Entity Type 3 ... 0.2446 0.3739 0.0388 0.0817 reg oper account block of flats 0.3700 Panel No 0.0 0.0 0.0 0.0 -1805.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.010032 -13040 -2191 -4000.0 -4262 16.0 1 1 1 1 0 0 NaN 3.0 2 2 FRIDAY 5 0 0 0 0 1 1 Business Entity Type 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -821.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN

5 rows × 121 columns

Secondary Tables

  1. Bureau
    This table includes all previous credits received by a customer from other financial institutions prior to their loan application. There is one row for each previous credit, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR.
    The number of variables are 17.The number of data entries are 1,716,428.

  2. Bureau Balance
    This table includes the monthly balance for a previous credit at other financial institutions. There is one row for each monthly balance, meaning a many-to-one relationship with the Bureau table. We could join it with bureau table by using bureau's ID, SK_ID_BUREAU.
    The number of variables are 3. The number of data entries are 27,299,925

  3. Previous Application
    This table includes previous applications for loans made by the customer at Home Credit. There is one row for each previous application, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR.
    There are four types of contracts:
    a. Consumer loan(POS – Credit limit given to buy consumer goods)
    b. Cash loan(Client is given cash)
    c. Revolving loan(Credit)
    d. XNA (Contract type without values)
    The number of variables are 37. The number of data entries are 1,670,214

  4. POS CASH Balance
    This table includes a monthly balance snapshot of a previous point of sale or cash loan that the customer has at Home Credit. There is one row for each monthly balance, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR.
    The number of variables are 8. The number of data entries are 10,001,358

  5. Credit Card Balance
    This table includes a monthly balance snapshot of previous credit cards the customer has with Home Credit. There is one row for each previous monthly balance, meaning a many-to-one relationship with the Previous Application table.We could join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR.
    The number of variables are 23. The number of data entries are 3,840,312

  6. Installments Payments
    This table includes previous repayments made or not made by the customer on credits issued by Home Credit. There is one row for each payment or missed payment, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR.
    The number of variables are 8 . The number of data entries are 13,605,401

The application dataset has the most information about the client: Gender, income, family status, education ...

Download all the files

In [ ]:
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
            "previous_application","POS_CASH_balance")

for ds_name in ds_names:
    datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ... LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 351000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648.0 -2120 NaN 1 1 0 1 1 0 Laborers 1.0 2 2 WEDNESDAY 10 0 0 0 0 0 0 ... 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2.0 2.0 2.0 2.0 -1134.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 1129500.0 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186.0 -291 NaN 1 1 0 1 1 0 Core staff 2.0 1 1 MONDAY 11 0 0 0 0 0 0 ... 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1.0 0.0 1.0 0.0 -828.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 135000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260.0 -2531 26.0 1 1 1 1 1 0 Laborers 1.0 2 2 MONDAY 9 0 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -815.0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 297000.0 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833.0 -2437 NaN 1 1 0 1 0 0 Laborers 2.0 2 2 WEDNESDAY 17 0 0 0 0 0 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 2.0 0.0 -617.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 513000.0 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311.0 -3458 NaN 1 1 0 1 0 0 Core staff 1.0 2 2 THURSDAY 11 0 0 0 0 1 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -1106.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

application_test: shape is (48744, 121)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE ... LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100001 Cash loans F N Y 0 135000.0 568800.0 20560.5 450000.0 Unaccompanied Working Higher education Married House / apartment 0.018850 -19241 -2329 -5170.0 -812 NaN 1 1 0 1 0 1 NaN 2.0 2 2 TUESDAY 18 0 0 0 0 0 0 Kindergarten ... NaN 0.0514 NaN NaN NaN block of flats 0.0392 Stone, brick No 0.0 0.0 0.0 0.0 -1740.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
1 100005 Cash loans M N Y 0 99000.0 222768.0 17370.0 180000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.035792 -18064 -4469 -9118.0 -1623 NaN 1 1 0 1 0 0 Low-skill Laborers 2.0 2 2 FRIDAY 9 0 0 0 0 0 0 Self-employed ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
2 100013 Cash loans M Y Y 0 202500.0 663264.0 69777.0 630000.0 NaN Working Higher education Married House / apartment 0.019101 -20038 -4458 -2175.0 -3503 5.0 1 1 0 1 0 0 Drivers 2.0 2 2 MONDAY 14 0 0 0 0 0 0 Transport: type 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -856.0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 1.0 4.0
3 100028 Cash loans F N Y 2 315000.0 1575000.0 49018.5 1575000.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.026392 -13976 -1866 -2000.0 -4208 NaN 1 1 0 1 1 0 Sales staff 4.0 2 2 WEDNESDAY 11 0 0 0 0 0 0 Business Entity Type 3 ... 0.2446 0.3739 0.0388 0.0817 reg oper account block of flats 0.3700 Panel No 0.0 0.0 0.0 0.0 -1805.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0
4 100038 Cash loans M Y N 1 180000.0 625500.0 32067.0 625500.0 Unaccompanied Working Secondary / secondary special Married House / apartment 0.010032 -13040 -2191 -4000.0 -4262 16.0 1 1 1 1 0 0 NaN 3.0 2 2 FRIDAY 5 0 0 0 0 1 1 Business Entity Type 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 -821.0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN

5 rows × 121 columns

bureau: shape is (1716428, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_CURR              int64  
 1   SK_ID_BUREAU            int64  
 2   CREDIT_ACTIVE           object 
 3   CREDIT_CURRENCY         object 
 4   DAYS_CREDIT             int64  
 5   CREDIT_DAY_OVERDUE      int64  
 6   DAYS_CREDIT_ENDDATE     float64
 7   DAYS_ENDDATE_FACT       float64
 8   AMT_CREDIT_MAX_OVERDUE  float64
 9   CNT_CREDIT_PROLONG      int64  
 10  AMT_CREDIT_SUM          float64
 11  AMT_CREDIT_SUM_DEBT     float64
 12  AMT_CREDIT_SUM_LIMIT    float64
 13  AMT_CREDIT_SUM_OVERDUE  float64
 14  CREDIT_TYPE             object 
 15  DAYS_CREDIT_UPDATE      int64  
 16  AMT_ANNUITY             float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 NaN 0 91323.0 0.0 NaN 0.0 Consumer credit -131 NaN
1 215354 5714463 Active currency 1 -208 0 1075.0 NaN NaN 0 225000.0 171342.0 NaN 0.0 Credit card -20 NaN
2 215354 5714464 Active currency 1 -203 0 528.0 NaN NaN 0 464323.5 NaN NaN 0.0 Consumer credit -16 NaN
3 215354 5714465 Active currency 1 -203 0 NaN NaN NaN 0 90000.0 NaN NaN 0.0 Credit card -16 NaN
4 215354 5714466 Active currency 1 -629 0 1197.0 NaN 77674.5 0 2700000.0 NaN NaN 0.0 Consumer credit -21 NaN
bureau_balance: shape is (27299925, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   SK_ID_BUREAU    int64 
 1   MONTHS_BALANCE  int64 
 2   STATUS          object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
SK_ID_BUREAU MONTHS_BALANCE STATUS
0 5715448 0 C
1 5715448 -1 C
2 5715448 -2 C
3 5715448 -3 C
4 5715448 -4 C
credit_card_balance: shape is (3840312, 23)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   SK_ID_PREV                  int64  
 1   SK_ID_CURR                  int64  
 2   MONTHS_BALANCE              int64  
 3   AMT_BALANCE                 float64
 4   AMT_CREDIT_LIMIT_ACTUAL     int64  
 5   AMT_DRAWINGS_ATM_CURRENT    float64
 6   AMT_DRAWINGS_CURRENT        float64
 7   AMT_DRAWINGS_OTHER_CURRENT  float64
 8   AMT_DRAWINGS_POS_CURRENT    float64
 9   AMT_INST_MIN_REGULARITY     float64
 10  AMT_PAYMENT_CURRENT         float64
 11  AMT_PAYMENT_TOTAL_CURRENT   float64
 12  AMT_RECEIVABLE_PRINCIPAL    float64
 13  AMT_RECIVABLE               float64
 14  AMT_TOTAL_RECEIVABLE        float64
 15  CNT_DRAWINGS_ATM_CURRENT    float64
 16  CNT_DRAWINGS_CURRENT        int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  float64
 18  CNT_DRAWINGS_POS_CURRENT    float64
 19  CNT_INSTALMENT_MATURE_CUM   float64
 20  NAME_CONTRACT_STATUS        object 
 21  SK_DPD                      int64  
 22  SK_DPD_DEF                  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE AMT_BALANCE AMT_CREDIT_LIMIT_ACTUAL AMT_DRAWINGS_ATM_CURRENT AMT_DRAWINGS_CURRENT AMT_DRAWINGS_OTHER_CURRENT AMT_DRAWINGS_POS_CURRENT AMT_INST_MIN_REGULARITY AMT_PAYMENT_CURRENT AMT_PAYMENT_TOTAL_CURRENT AMT_RECEIVABLE_PRINCIPAL AMT_RECIVABLE AMT_TOTAL_RECEIVABLE CNT_DRAWINGS_ATM_CURRENT CNT_DRAWINGS_CURRENT CNT_DRAWINGS_OTHER_CURRENT CNT_DRAWINGS_POS_CURRENT CNT_INSTALMENT_MATURE_CUM NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 2562384 378907 -6 56.970 135000 0.0 877.5 0.0 877.5 1700.325 1800.0 1800.0 0.000 0.000 0.000 0.0 1 0.0 1.0 35.0 Active 0 0
1 2582071 363914 -1 63975.555 45000 2250.0 2250.0 0.0 0.0 2250.000 2250.0 2250.0 60175.080 64875.555 64875.555 1.0 1 0.0 0.0 69.0 Active 0 0
2 1740877 371185 -7 31815.225 450000 0.0 0.0 0.0 0.0 2250.000 2250.0 2250.0 26926.425 31460.085 31460.085 0.0 0 0.0 0.0 30.0 Active 0 0
3 1389973 337855 -4 236572.110 225000 2250.0 2250.0 0.0 0.0 11795.760 11925.0 11925.0 224949.285 233048.970 233048.970 1.0 1 0.0 0.0 10.0 Active 0 0
4 1891521 126868 -1 453919.455 450000 0.0 11547.0 0.0 11547.0 22924.890 27000.0 27000.0 443044.395 453919.455 453919.455 0.0 1 0.0 1.0 101.0 Active 0 0
installments_payments: shape is (7217242, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7217242 entries, 0 to 7217241
Data columns (total 8 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   SK_ID_PREV              int64  
 1   SK_ID_CURR              int64  
 2   NUM_INSTALMENT_VERSION  float64
 3   NUM_INSTALMENT_NUMBER   int64  
 4   DAYS_INSTALMENT         float64
 5   DAYS_ENTRY_PAYMENT      float64
 6   AMT_INSTALMENT          float64
 7   AMT_PAYMENT             float64
dtypes: float64(5), int64(3)
memory usage: 440.5 MB
None
SK_ID_PREV SK_ID_CURR NUM_INSTALMENT_VERSION NUM_INSTALMENT_NUMBER DAYS_INSTALMENT DAYS_ENTRY_PAYMENT AMT_INSTALMENT AMT_PAYMENT
0 1054186 161674 1.0 6 -1180.0 -1187.0 6948.360 6948.360
1 1330831 151639 0.0 34 -2156.0 -2156.0 1716.525 1716.525
2 2085231 193053 2.0 1 -63.0 -63.0 25425.000 25425.000
3 2452527 199697 1.0 3 -2418.0 -2426.0 24350.130 24350.130
4 2714724 167756 1.0 2 -1383.0 -1366.0 2165.040 2160.585
previous_application: shape is (1670214, 37)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START FLAG_LAST_APPL_PER_CONTRACT NFLAG_LAST_APPL_IN_DAY RATE_DOWN_PAYMENT RATE_INTEREST_PRIMARY RATE_INTEREST_PRIVILEGED NAME_CASH_LOAN_PURPOSE NAME_CONTRACT_STATUS DAYS_DECISION NAME_PAYMENT_TYPE CODE_REJECT_REASON NAME_TYPE_SUITE NAME_CLIENT_TYPE NAME_GOODS_CATEGORY NAME_PORTFOLIO NAME_PRODUCT_TYPE CHANNEL_TYPE SELLERPLACE_AREA NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 Y 1 0.0 0.182832 0.867336 XAP Approved -73 Cash through the bank XAP NaN Repeater Mobile POS XNA Country-wide 35 Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 Y 1 NaN NaN NaN XNA Approved -164 XNA XAP Unaccompanied Repeater XNA Cash x-sell Contact center -1 XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 Y 1 NaN NaN NaN XNA Approved -301 Cash through the bank XAP Spouse, partner Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 Y 1 NaN NaN NaN XNA Approved -512 Cash through the bank XAP NaN Repeater XNA Cash x-sell Credit and cash offices -1 XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 Y 1 NaN NaN NaN Repairs Refused -781 Cash through the bank HC NaN Repeater XNA Cash walk-in Credit and cash offices -1 XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN
POS_CASH_balance: shape is (10001358, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SK_ID_PREV             int64  
 1   SK_ID_CURR             int64  
 2   MONTHS_BALANCE         int64  
 3   CNT_INSTALMENT         float64
 4   CNT_INSTALMENT_FUTURE  float64
 5   NAME_CONTRACT_STATUS   object 
 6   SK_DPD                 int64  
 7   SK_DPD_DEF             int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
SK_ID_PREV SK_ID_CURR MONTHS_BALANCE CNT_INSTALMENT CNT_INSTALMENT_FUTURE NAME_CONTRACT_STATUS SK_DPD SK_DPD_DEF
0 1803195 182943 -31 48.0 45.0 Active 0 0
1 1715348 367990 -33 36.0 35.0 Active 0 0
2 1784872 397406 -32 12.0 9.0 Active 0 0
3 1903291 269225 -35 48.0 42.0 Active 0 0
4 2341044 334279 -35 36.0 35.0 Active 0 0
CPU times: user 29.5 s, sys: 2.54 s, total: 32.1 s
Wall time: 38.7 s
In [ ]:
for ds_name in datasets.keys():
    print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_test        : [     48,744, 121]
dataset application_train       : [    307,511, 122]
dataset bureau                  : [  1,716,428, 17]
dataset bureau_balance          : [ 27,299,925, 3]
dataset credit_card_balance     : [  3,840,312, 23]
dataset installments_payments   : [  7,217,242, 8]
dataset previous_application    : [  1,670,214, 37]
dataset POS_CASH_balance        : [ 10,001,358, 8]

Exploratory Data Analysis

Exploratory Data Analysis is valuable to this project since it allows us to get closer to the certainty that the future results will be valid, accurately interpreted, and applicable to the proposed solution.

In phase 1 for this project this step involves looking at the summary statistics for each individual table in the model and focusing on the missing data , distribution and its central tendencies such as mean, median, count, min, max and the interquartile ranges.

Categorical and numerical features were looked at to identify anamolies in the data. Specific features were chosen to be visualized based on the correlation and distribution. The highly correlated features were used to plot the density to evaluate the distributions in comparison to the target.

Data file Statistics

Statistics Helper functions

In [ ]:
pd.set_option("display.max_rows", None, "display.max_columns", None)

# Full stats

def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True, null_counts=True ))
    print("-----"*15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----"*15)
    print(f"Statistical summary of {df_name} is :")
    print("-----"*15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(),2).to_html()))) 
    #print(f"Description of the df {df_name}:\n",np.round(datasets['application_train'].describe(),2))

def stats_summary2(df, df_name):   
    print(f"Description of the df continued for {df_name}:\n")
    print("-----"*15)
    print("Data type value counts: \n",df.dtypes.value_counts())
    print("\nReturn number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis = 0))
    

# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----"*15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----"*15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---"*10)
    print("\n \n")    
        
# Null data list and plot.        
def null_data_plot(df, df_name):
    percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending = False)
    missing_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data=missing_data[missing_data['Percent'] > 0] 
    print("-----"*15)
    print("-----"*15)
    print('\n The Missing Data: \n')
#     display(missing_data)  # display few
    if len(missing_data)==0:
      print("No missing Data")
    else:
      display(HTML(missing_data.to_html()))  # display all the rows
      print("-----"*15)
      if len(df.columns)> 35:
        f,ax =plt.subplots(figsize=(8,15))
      else: 
        f,ax =plt.subplots()
      #plt.xticks(rotation='90')
      #fig=sns.barplot(missing_data.index, missing_data["Percent"],alpha=0.8)
      #plt.xlabel('Features', fontsize=15)
      #plt.ylabel('Percent of missing values', fontsize=15)
      plt.title(f'Percent missing data for {df_name}.', fontsize=10)
      fig=sns.barplot(missing_data["Percent"],missing_data.index ,alpha=0.8)
      plt.xlabel('Percent of missing values', fontsize=10)
      plt.ylabel('Features', fontsize=10)
      return missing_data


# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--"*40)
    print(" "*20 + '\033[1m'+ df_name +  '\033[0m' +" "*20)
    print("--"*40)
    stats_summary1(df, df_name)

def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)

Summary of application_train

In [ ]:
(datasets['application_train'].dtypes).unique()
Out[ ]:
array([dtype('int64'), dtype('O'), dtype('float64')], dtype=object)
In [ ]:
display_stats(datasets['application_train'], 'application_train')
--------------------------------------------------------------------------------
                    application_train                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Data columns (total 122 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   SK_ID_CURR                    307511 non-null  int64  
 1   TARGET                        307511 non-null  int64  
 2   NAME_CONTRACT_TYPE            307511 non-null  object 
 3   CODE_GENDER                   307511 non-null  object 
 4   FLAG_OWN_CAR                  307511 non-null  object 
 5   FLAG_OWN_REALTY               307511 non-null  object 
 6   CNT_CHILDREN                  307511 non-null  int64  
 7   AMT_INCOME_TOTAL              307511 non-null  float64
 8   AMT_CREDIT                    307511 non-null  float64
 9   AMT_ANNUITY                   307499 non-null  float64
 10  AMT_GOODS_PRICE               307233 non-null  float64
 11  NAME_TYPE_SUITE               306219 non-null  object 
 12  NAME_INCOME_TYPE              307511 non-null  object 
 13  NAME_EDUCATION_TYPE           307511 non-null  object 
 14  NAME_FAMILY_STATUS            307511 non-null  object 
 15  NAME_HOUSING_TYPE             307511 non-null  object 
 16  REGION_POPULATION_RELATIVE    307511 non-null  float64
 17  DAYS_BIRTH                    307511 non-null  int64  
 18  DAYS_EMPLOYED                 307511 non-null  int64  
 19  DAYS_REGISTRATION             307511 non-null  float64
 20  DAYS_ID_PUBLISH               307511 non-null  int64  
 21  OWN_CAR_AGE                   104582 non-null  float64
 22  FLAG_MOBIL                    307511 non-null  int64  
 23  FLAG_EMP_PHONE                307511 non-null  int64  
 24  FLAG_WORK_PHONE               307511 non-null  int64  
 25  FLAG_CONT_MOBILE              307511 non-null  int64  
 26  FLAG_PHONE                    307511 non-null  int64  
 27  FLAG_EMAIL                    307511 non-null  int64  
 28  OCCUPATION_TYPE               211120 non-null  object 
 29  CNT_FAM_MEMBERS               307509 non-null  float64
 30  REGION_RATING_CLIENT          307511 non-null  int64  
 31  REGION_RATING_CLIENT_W_CITY   307511 non-null  int64  
 32  WEEKDAY_APPR_PROCESS_START    307511 non-null  object 
 33  HOUR_APPR_PROCESS_START       307511 non-null  int64  
 34  REG_REGION_NOT_LIVE_REGION    307511 non-null  int64  
 35  REG_REGION_NOT_WORK_REGION    307511 non-null  int64  
 36  LIVE_REGION_NOT_WORK_REGION   307511 non-null  int64  
 37  REG_CITY_NOT_LIVE_CITY        307511 non-null  int64  
 38  REG_CITY_NOT_WORK_CITY        307511 non-null  int64  
 39  LIVE_CITY_NOT_WORK_CITY       307511 non-null  int64  
 40  ORGANIZATION_TYPE             307511 non-null  object 
 41  EXT_SOURCE_1                  134133 non-null  float64
 42  EXT_SOURCE_2                  306851 non-null  float64
 43  EXT_SOURCE_3                  246546 non-null  float64
 44  APARTMENTS_AVG                151450 non-null  float64
 45  BASEMENTAREA_AVG              127568 non-null  float64
 46  YEARS_BEGINEXPLUATATION_AVG   157504 non-null  float64
 47  YEARS_BUILD_AVG               103023 non-null  float64
 48  COMMONAREA_AVG                92646 non-null   float64
 49  ELEVATORS_AVG                 143620 non-null  float64
 50  ENTRANCES_AVG                 152683 non-null  float64
 51  FLOORSMAX_AVG                 154491 non-null  float64
 52  FLOORSMIN_AVG                 98869 non-null   float64
 53  LANDAREA_AVG                  124921 non-null  float64
 54  LIVINGAPARTMENTS_AVG          97312 non-null   float64
 55  LIVINGAREA_AVG                153161 non-null  float64
 56  NONLIVINGAPARTMENTS_AVG       93997 non-null   float64
 57  NONLIVINGAREA_AVG             137829 non-null  float64
 58  APARTMENTS_MODE               151450 non-null  float64
 59  BASEMENTAREA_MODE             127568 non-null  float64
 60  YEARS_BEGINEXPLUATATION_MODE  157504 non-null  float64
 61  YEARS_BUILD_MODE              103023 non-null  float64
 62  COMMONAREA_MODE               92646 non-null   float64
 63  ELEVATORS_MODE                143620 non-null  float64
 64  ENTRANCES_MODE                152683 non-null  float64
 65  FLOORSMAX_MODE                154491 non-null  float64
 66  FLOORSMIN_MODE                98869 non-null   float64
 67  LANDAREA_MODE                 124921 non-null  float64
 68  LIVINGAPARTMENTS_MODE         97312 non-null   float64
 69  LIVINGAREA_MODE               153161 non-null  float64
 70  NONLIVINGAPARTMENTS_MODE      93997 non-null   float64
 71  NONLIVINGAREA_MODE            137829 non-null  float64
 72  APARTMENTS_MEDI               151450 non-null  float64
 73  BASEMENTAREA_MEDI             127568 non-null  float64
 74  YEARS_BEGINEXPLUATATION_MEDI  157504 non-null  float64
 75  YEARS_BUILD_MEDI              103023 non-null  float64
 76  COMMONAREA_MEDI               92646 non-null   float64
 77  ELEVATORS_MEDI                143620 non-null  float64
 78  ENTRANCES_MEDI                152683 non-null  float64
 79  FLOORSMAX_MEDI                154491 non-null  float64
 80  FLOORSMIN_MEDI                98869 non-null   float64
 81  LANDAREA_MEDI                 124921 non-null  float64
 82  LIVINGAPARTMENTS_MEDI         97312 non-null   float64
 83  LIVINGAREA_MEDI               153161 non-null  float64
 84  NONLIVINGAPARTMENTS_MEDI      93997 non-null   float64
 85  NONLIVINGAREA_MEDI            137829 non-null  float64
 86  FONDKAPREMONT_MODE            97216 non-null   object 
 87  HOUSETYPE_MODE                153214 non-null  object 
 88  TOTALAREA_MODE                159080 non-null  float64
 89  WALLSMATERIAL_MODE            151170 non-null  object 
 90  EMERGENCYSTATE_MODE           161756 non-null  object 
 91  OBS_30_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 92  DEF_30_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 93  OBS_60_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 94  DEF_60_CNT_SOCIAL_CIRCLE      306490 non-null  float64
 95  DAYS_LAST_PHONE_CHANGE        307510 non-null  float64
 96  FLAG_DOCUMENT_2               307511 non-null  int64  
 97  FLAG_DOCUMENT_3               307511 non-null  int64  
 98  FLAG_DOCUMENT_4               307511 non-null  int64  
 99  FLAG_DOCUMENT_5               307511 non-null  int64  
 100 FLAG_DOCUMENT_6               307511 non-null  int64  
 101 FLAG_DOCUMENT_7               307511 non-null  int64  
 102 FLAG_DOCUMENT_8               307511 non-null  int64  
 103 FLAG_DOCUMENT_9               307511 non-null  int64  
 104 FLAG_DOCUMENT_10              307511 non-null  int64  
 105 FLAG_DOCUMENT_11              307511 non-null  int64  
 106 FLAG_DOCUMENT_12              307511 non-null  int64  
 107 FLAG_DOCUMENT_13              307511 non-null  int64  
 108 FLAG_DOCUMENT_14              307511 non-null  int64  
 109 FLAG_DOCUMENT_15              307511 non-null  int64  
 110 FLAG_DOCUMENT_16              307511 non-null  int64  
 111 FLAG_DOCUMENT_17              307511 non-null  int64  
 112 FLAG_DOCUMENT_18              307511 non-null  int64  
 113 FLAG_DOCUMENT_19              307511 non-null  int64  
 114 FLAG_DOCUMENT_20              307511 non-null  int64  
 115 FLAG_DOCUMENT_21              307511 non-null  int64  
 116 AMT_REQ_CREDIT_BUREAU_HOUR    265992 non-null  float64
 117 AMT_REQ_CREDIT_BUREAU_DAY     265992 non-null  float64
 118 AMT_REQ_CREDIT_BUREAU_WEEK    265992 non-null  float64
 119 AMT_REQ_CREDIT_BUREAU_MON     265992 non-null  float64
 120 AMT_REQ_CREDIT_BUREAU_QRT     265992 non-null  float64
 121 AMT_REQ_CREDIT_BUREAU_YEAR    265992 non-null  float64
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
None
---------------------------------------------------------------------------
Shape of the df application_train is (307511, 122) 

---------------------------------------------------------------------------
Statistical summary of application_train is :
---------------------------------------------------------------------------
Description of the df application_train:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
In [ ]:
display_feature_info(datasets['application_train'], 'application_train')
Description of the df continued for application_train:

---------------------------------------------------------------------------
Data type value counts: 
 float64    65
int64      41
object     16
dtype: int64

Return number of unique elements in the object. 

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of application_train.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
       'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
       'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
       'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
       'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
       'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
       'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
       'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
       'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
       'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
       'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
       'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
       'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
       'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
       'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
       'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
       'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
COMMONAREA_MEDI 69.87 214865
COMMONAREA_AVG 69.87 214865
COMMONAREA_MODE 69.87 214865
NONLIVINGAPARTMENTS_MODE 69.43 213514
NONLIVINGAPARTMENTS_MEDI 69.43 213514
NONLIVINGAPARTMENTS_AVG 69.43 213514
FONDKAPREMONT_MODE 68.39 210295
LIVINGAPARTMENTS_MEDI 68.35 210199
LIVINGAPARTMENTS_MODE 68.35 210199
LIVINGAPARTMENTS_AVG 68.35 210199
FLOORSMIN_MEDI 67.85 208642
FLOORSMIN_MODE 67.85 208642
FLOORSMIN_AVG 67.85 208642
YEARS_BUILD_MEDI 66.50 204488
YEARS_BUILD_AVG 66.50 204488
YEARS_BUILD_MODE 66.50 204488
OWN_CAR_AGE 65.99 202929
LANDAREA_MODE 59.38 182590
LANDAREA_AVG 59.38 182590
LANDAREA_MEDI 59.38 182590
BASEMENTAREA_MEDI 58.52 179943
BASEMENTAREA_AVG 58.52 179943
BASEMENTAREA_MODE 58.52 179943
EXT_SOURCE_1 56.38 173378
NONLIVINGAREA_MEDI 55.18 169682
NONLIVINGAREA_AVG 55.18 169682
NONLIVINGAREA_MODE 55.18 169682
ELEVATORS_MODE 53.30 163891
ELEVATORS_AVG 53.30 163891
ELEVATORS_MEDI 53.30 163891
WALLSMATERIAL_MODE 50.84 156341
APARTMENTS_MODE 50.75 156061
APARTMENTS_AVG 50.75 156061
APARTMENTS_MEDI 50.75 156061
ENTRANCES_MEDI 50.35 154828
ENTRANCES_MODE 50.35 154828
ENTRANCES_AVG 50.35 154828
LIVINGAREA_MEDI 50.19 154350
LIVINGAREA_MODE 50.19 154350
LIVINGAREA_AVG 50.19 154350
HOUSETYPE_MODE 50.18 154297
FLOORSMAX_MODE 49.76 153020
FLOORSMAX_MEDI 49.76 153020
FLOORSMAX_AVG 49.76 153020
YEARS_BEGINEXPLUATATION_MEDI 48.78 150007
YEARS_BEGINEXPLUATATION_AVG 48.78 150007
YEARS_BEGINEXPLUATATION_MODE 48.78 150007
TOTALAREA_MODE 48.27 148431
EMERGENCYSTATE_MODE 47.40 145755
OCCUPATION_TYPE 31.35 96391
EXT_SOURCE_3 19.83 60965
AMT_REQ_CREDIT_BUREAU_QRT 13.50 41519
AMT_REQ_CREDIT_BUREAU_YEAR 13.50 41519
AMT_REQ_CREDIT_BUREAU_WEEK 13.50 41519
AMT_REQ_CREDIT_BUREAU_MON 13.50 41519
AMT_REQ_CREDIT_BUREAU_DAY 13.50 41519
AMT_REQ_CREDIT_BUREAU_HOUR 13.50 41519
NAME_TYPE_SUITE 0.42 1292
OBS_30_CNT_SOCIAL_CIRCLE 0.33 1021
OBS_60_CNT_SOCIAL_CIRCLE 0.33 1021
DEF_60_CNT_SOCIAL_CIRCLE 0.33 1021
DEF_30_CNT_SOCIAL_CIRCLE 0.33 1021
EXT_SOURCE_2 0.21 660
AMT_GOODS_PRICE 0.09 278
---------------------------------------------------------------------------

Observation 1

  • We can see anamolies in the data from the descriptive statistics for Days Birth, Days employed, Days registration, Days Id publish which is a negative value and is not expected.
  • Own car age has a max of 91.
  • There are redundant features related to living space and realty which can be helpful to weed out during the feature eduction process and avoid issues with multicollinearity.
Days Employed
In [ ]:
datasets["application_train"]['DAYS_EMPLOYED'].describe() 
Out[ ]:
count    307511.000000
mean      63815.045904
std      141275.766519
min      -17912.000000
25%       -2760.000000
50%       -1213.000000
75%        -289.000000
max      365243.000000
Name: DAYS_EMPLOYED, dtype: float64
In [ ]:
anom_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']==365243]
norm_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']!=365243]
print(anom_days_employed.shape)

dr_anom = anom_days_employed['TARGET'].mean()*100
dr_norm = norm_days_employed['TARGET'].mean()*100

print('Default rate (Anomaly): {:.2f}'.format(dr_anom))
print('Default rate (Normal): {:.2f}'.format(dr_norm))

pct_anom_days_employed = (anom_days_employed.shape[0]/datasets["application_train"].shape[0])*100
print(pct_anom_days_employed) 
(55374, 122)
Default rate (Anomaly): 5.40
Default rate (Normal): 8.66
18.00716071945394
In [ ]:
df_app_train=datasets["application_train"].copy()
df_app_train['DAYS_EMPLOYED_ANOM'] = df_app_train['DAYS_EMPLOYED'] == 365243
df_app_train['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace=True)
plt.hist(df_app_train['DAYS_EMPLOYED'],edgecolor = 'k', bins = 25)
plt.title('DAYS_EMPLOYED'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count'); 
In [ ]:
gc.collect()
Out[ ]:
18405

The bins above histogram shows that the data is not logical and this feature needs to be further investigated for imbalances. Number of days employed would show a steady source of income and could be a useful feature for predicting risk

Own Car Age
In [ ]:
plt.hist(datasets["application_train"]['OWN_CAR_AGE'],edgecolor = 'k', bins = 25)
plt.title('OWN CAR AGE'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count'); 

We see that those who have cars over 60 years old have a number of applications (i.e., 3339). This could a good area to investigate risk

In [ ]:
display_feature_info(datasets['application_train'], 'application_train')
Description of the df continued for application_train:

---------------------------------------------------------------------------
Data type value counts: 
 float64    65
int64      41
object     16
dtype: int64

Return number of unique elements in the object. 

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of application_train.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
       'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
       'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
       'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
       'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
       'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
       'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
       'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
       'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
       'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
       'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
       'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
       'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
       'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
       'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
       'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
       'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
COMMONAREA_MEDI 69.87 214865
COMMONAREA_AVG 69.87 214865
COMMONAREA_MODE 69.87 214865
NONLIVINGAPARTMENTS_MODE 69.43 213514
NONLIVINGAPARTMENTS_MEDI 69.43 213514
NONLIVINGAPARTMENTS_AVG 69.43 213514
FONDKAPREMONT_MODE 68.39 210295
LIVINGAPARTMENTS_MEDI 68.35 210199
LIVINGAPARTMENTS_MODE 68.35 210199
LIVINGAPARTMENTS_AVG 68.35 210199
FLOORSMIN_MEDI 67.85 208642
FLOORSMIN_MODE 67.85 208642
FLOORSMIN_AVG 67.85 208642
YEARS_BUILD_MEDI 66.50 204488
YEARS_BUILD_AVG 66.50 204488
YEARS_BUILD_MODE 66.50 204488
OWN_CAR_AGE 65.99 202929
LANDAREA_MODE 59.38 182590
LANDAREA_AVG 59.38 182590
LANDAREA_MEDI 59.38 182590
BASEMENTAREA_MEDI 58.52 179943
BASEMENTAREA_AVG 58.52 179943
BASEMENTAREA_MODE 58.52 179943
EXT_SOURCE_1 56.38 173378
NONLIVINGAREA_MEDI 55.18 169682
NONLIVINGAREA_AVG 55.18 169682
NONLIVINGAREA_MODE 55.18 169682
ELEVATORS_MODE 53.30 163891
ELEVATORS_AVG 53.30 163891
ELEVATORS_MEDI 53.30 163891
WALLSMATERIAL_MODE 50.84 156341
APARTMENTS_MODE 50.75 156061
APARTMENTS_AVG 50.75 156061
APARTMENTS_MEDI 50.75 156061
ENTRANCES_MEDI 50.35 154828
ENTRANCES_MODE 50.35 154828
ENTRANCES_AVG 50.35 154828
LIVINGAREA_MEDI 50.19 154350
LIVINGAREA_MODE 50.19 154350
LIVINGAREA_AVG 50.19 154350
HOUSETYPE_MODE 50.18 154297
FLOORSMAX_MODE 49.76 153020
FLOORSMAX_MEDI 49.76 153020
FLOORSMAX_AVG 49.76 153020
YEARS_BEGINEXPLUATATION_MEDI 48.78 150007
YEARS_BEGINEXPLUATATION_AVG 48.78 150007
YEARS_BEGINEXPLUATATION_MODE 48.78 150007
TOTALAREA_MODE 48.27 148431
EMERGENCYSTATE_MODE 47.40 145755
OCCUPATION_TYPE 31.35 96391
EXT_SOURCE_3 19.83 60965
AMT_REQ_CREDIT_BUREAU_QRT 13.50 41519
AMT_REQ_CREDIT_BUREAU_YEAR 13.50 41519
AMT_REQ_CREDIT_BUREAU_WEEK 13.50 41519
AMT_REQ_CREDIT_BUREAU_MON 13.50 41519
AMT_REQ_CREDIT_BUREAU_DAY 13.50 41519
AMT_REQ_CREDIT_BUREAU_HOUR 13.50 41519
NAME_TYPE_SUITE 0.42 1292
OBS_30_CNT_SOCIAL_CIRCLE 0.33 1021
OBS_60_CNT_SOCIAL_CIRCLE 0.33 1021
DEF_60_CNT_SOCIAL_CIRCLE 0.33 1021
DEF_30_CNT_SOCIAL_CIRCLE 0.33 1021
EXT_SOURCE_2 0.21 660
AMT_GOODS_PRICE 0.09 278
---------------------------------------------------------------------------

Observation 2

  • Application Train is the transactional dataset and has the most details regarding the loan requests submitted.
  • Missing values seem to be of concern in this dataset. Occupation Type and Organization Type are categorical values that have 58 and 18 categories respectively and can be useful in feature engineering.
Applicants Age
In [ ]:
plt.hist(datasets["application_train"]['DAYS_BIRTH']/-365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count'); 

To visualize the effect of the age on the target, we will next make a kernel density estimation plot (KDE) colored by the value of the target.

In [ ]:
plt.figure(figsize = (10, 8))

# KDE plot of loans that were repaid on time
sns.kdeplot(datasets['application_train'].loc[datasets['application_train']['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')

# KDE plot of loans which were not repaid on time
sns.kdeplot(datasets['application_train'].loc[datasets['application_train']['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')
plt.legend(loc='upper left')
# set_style("whitegrid")
plt.grid()
# Labeling of plot
plt.xlabel('Age (years)', fontsize=18); 
plt.ylabel('Density', fontsize=16); 
plt.suptitle('Distribution of Ages',fontsize=25); 

The target == 1 curve skews towards the younger end of the range. Let's look at this relationship in another way: average failure to repay loans by age bracket. To make this graph, first we cut the age category into bins of 5 years each. Then, for each bin, we calculate the average value of the target, which tells us the ratio of loans that were not repaid in each age category.

In [ ]:
# Age information into a separate dataframe
age_data = datasets['application_train'][['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / -365

# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10) 
Out[ ]:
TARGET DAYS_BIRTH YEARS_BIRTH YEARS_BINNED
0 1 -9461 25.920548 (25.0, 30.0]
1 0 -16765 45.931507 (45.0, 50.0]
2 0 -19046 52.180822 (50.0, 55.0]
3 0 -19005 52.068493 (50.0, 55.0]
4 0 -19932 54.608219 (50.0, 55.0]
5 0 -16941 46.413699 (45.0, 50.0]
6 0 -13778 37.747945 (35.0, 40.0]
7 0 -18850 51.643836 (50.0, 55.0]
8 0 -20099 55.065753 (55.0, 60.0]
9 0 -14469 39.641096 (35.0, 40.0]
In [ ]:
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups 
Out[ ]:
TARGET DAYS_BIRTH YEARS_BIRTH
YEARS_BINNED
(20.0, 25.0] 0.123036 -8532.795625 23.377522
(25.0, 30.0] 0.111436 -10155.219250 27.822518
(30.0, 35.0] 0.102814 -11854.848377 32.479037
(35.0, 40.0] 0.089414 -13707.908253 37.555913
(40.0, 45.0] 0.078491 -15497.661233 42.459346
(45.0, 50.0] 0.074171 -17323.900441 47.462741
(50.0, 55.0] 0.066968 -19196.494791 52.593136
(55.0, 60.0] 0.055314 -20984.262742 57.491131
(60.0, 65.0] 0.052737 -22780.547460 62.412459
(65.0, 70.0] 0.037270 -24292.614340 66.555108
In [ ]:
plt.figure(figsize = (8, 8))

# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])

# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group'); 
Applicants occupations
In [ ]:
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"], order = datasets["application_train"]['OCCUPATION_TYPE'].value_counts().index);
plt.title('Applicants Occupation');
plt.xticks(rotation=90); 
Contract Type with Amount Credit and Code Gender
In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation
%matplotlib inline     
sns.set(color_codes=True) 
In [ ]:
def generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale):
  sns.boxplot(xaxisfeature, yaxisfeature, hue = legendcategory, data = data)
  plt.title('Boxplot for '+ xaxisfeature +' with ' + yaxisfeature+' and '+legendcategory,fontsize=10)
  if log_scale:
                plt.yscale('log')
                plt.ylabel(f'{yaxisfeature} (log Scale)')
                plt.tight_layout() 
In [ ]:
def box_plot(plots):
  number_of_subplots = len(plots)
  plt.figure(figsize = (20,8))
  sns.set_style('whitegrid')
  for i, ele in enumerate(plots):
        plt.subplot(1, number_of_subplots, i + 1)
        plt.subplots_adjust(wspace=0.25)
        xaxisfeature=ele[0]
        yaxisfeature=ele[1]
        legendcategory=ele[2]
        data=ele[3]
        log_scale=ele[4]
        generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale) 
In [ ]:
plots=[['NAME_CONTRACT_TYPE','AMT_CREDIT','CODE_GENDER',datasets['application_train'],False]] 
In [ ]:
box_plot(plots) 

Gender does not indicate a major impact . But credit amount for cash loans are significantly high compared to revolving loans.

Summary of previous_application

In [ ]:
display_stats(datasets['previous_application'], 'previous_application') 
--------------------------------------------------------------------------------
                    previous_application                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1670214 entries, 0 to 1670213
Data columns (total 37 columns):
 #   Column                       Non-Null Count    Dtype  
---  ------                       --------------    -----  
 0   SK_ID_PREV                   1670214 non-null  int64  
 1   SK_ID_CURR                   1670214 non-null  int64  
 2   NAME_CONTRACT_TYPE           1670214 non-null  object 
 3   AMT_ANNUITY                  1297979 non-null  float64
 4   AMT_APPLICATION              1670214 non-null  float64
 5   AMT_CREDIT                   1670213 non-null  float64
 6   AMT_DOWN_PAYMENT             774370 non-null   float64
 7   AMT_GOODS_PRICE              1284699 non-null  float64
 8   WEEKDAY_APPR_PROCESS_START   1670214 non-null  object 
 9   HOUR_APPR_PROCESS_START      1670214 non-null  int64  
 10  FLAG_LAST_APPL_PER_CONTRACT  1670214 non-null  object 
 11  NFLAG_LAST_APPL_IN_DAY       1670214 non-null  int64  
 12  RATE_DOWN_PAYMENT            774370 non-null   float64
 13  RATE_INTEREST_PRIMARY        5951 non-null     float64
 14  RATE_INTEREST_PRIVILEGED     5951 non-null     float64
 15  NAME_CASH_LOAN_PURPOSE       1670214 non-null  object 
 16  NAME_CONTRACT_STATUS         1670214 non-null  object 
 17  DAYS_DECISION                1670214 non-null  int64  
 18  NAME_PAYMENT_TYPE            1670214 non-null  object 
 19  CODE_REJECT_REASON           1670214 non-null  object 
 20  NAME_TYPE_SUITE              849809 non-null   object 
 21  NAME_CLIENT_TYPE             1670214 non-null  object 
 22  NAME_GOODS_CATEGORY          1670214 non-null  object 
 23  NAME_PORTFOLIO               1670214 non-null  object 
 24  NAME_PRODUCT_TYPE            1670214 non-null  object 
 25  CHANNEL_TYPE                 1670214 non-null  object 
 26  SELLERPLACE_AREA             1670214 non-null  int64  
 27  NAME_SELLER_INDUSTRY         1670214 non-null  object 
 28  CNT_PAYMENT                  1297984 non-null  float64
 29  NAME_YIELD_GROUP             1670214 non-null  object 
 30  PRODUCT_COMBINATION          1669868 non-null  object 
 31  DAYS_FIRST_DRAWING           997149 non-null   float64
 32  DAYS_FIRST_DUE               997149 non-null   float64
 33  DAYS_LAST_DUE_1ST_VERSION    997149 non-null   float64
 34  DAYS_LAST_DUE                997149 non-null   float64
 35  DAYS_TERMINATION             997149 non-null   float64
 36  NFLAG_INSURED_ON_APPROVAL    997149 non-null   float64
dtypes: float64(15), int64(6), object(16)
memory usage: 471.5+ MB
None
---------------------------------------------------------------------------
Shape of the df previous_application is (1670214, 37) 

---------------------------------------------------------------------------
Statistical summary of previous_application is :
---------------------------------------------------------------------------
Description of the df previous_application:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None

Observation 3

  • The count of children go upto 19, this could be an outlier and a risk worth investigating.
  • All no of day fields are negative values showing anamolies in the data. However there are fields indicating average years. A calculation comparing the average years and days could prove valuable
In [ ]:
display_feature_info(datasets['previous_application'], 'previous_application') 
Description of the df continued for previous_application:

---------------------------------------------------------------------------
Data type value counts: 
 object     16
float64    15
int64       6
dtype: int64

Return number of unique elements in the object. 

NAME_CONTRACT_TYPE              4
WEEKDAY_APPR_PROCESS_START      7
FLAG_LAST_APPL_PER_CONTRACT     2
NAME_CASH_LOAN_PURPOSE         25
NAME_CONTRACT_STATUS            4
NAME_PAYMENT_TYPE               4
CODE_REJECT_REASON              9
NAME_TYPE_SUITE                 7
NAME_CLIENT_TYPE                4
NAME_GOODS_CATEGORY            28
NAME_PORTFOLIO                  5
NAME_PRODUCT_TYPE               3
CHANNEL_TYPE                    8
NAME_SELLER_INDUSTRY           11
NAME_YIELD_GROUP                5
PRODUCT_COMBINATION            17
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of previous_application.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'HOUR_APPR_PROCESS_START',
       'NFLAG_LAST_APPL_IN_DAY', 'DAYS_DECISION', 'SELLERPLACE_AREA'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT',
       'AMT_GOODS_PRICE', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
       'RATE_INTEREST_PRIVILEGED', 'CNT_PAYMENT', 'DAYS_FIRST_DRAWING',
       'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE',
       'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'WEEKDAY_APPR_PROCESS_START',
       'FLAG_LAST_APPL_PER_CONTRACT', 'NAME_CASH_LOAN_PURPOSE',
       'NAME_CONTRACT_STATUS', 'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON',
       'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY',
       'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE',
       'NAME_SELLER_INDUSTRY', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
RATE_INTEREST_PRIVILEGED 99.64 1664263
RATE_INTEREST_PRIMARY 99.64 1664263
RATE_DOWN_PAYMENT 53.64 895844
AMT_DOWN_PAYMENT 53.64 895844
NAME_TYPE_SUITE 49.12 820405
DAYS_TERMINATION 40.30 673065
NFLAG_INSURED_ON_APPROVAL 40.30 673065
DAYS_FIRST_DRAWING 40.30 673065
DAYS_FIRST_DUE 40.30 673065
DAYS_LAST_DUE_1ST_VERSION 40.30 673065
DAYS_LAST_DUE 40.30 673065
AMT_GOODS_PRICE 23.08 385515
AMT_ANNUITY 22.29 372235
CNT_PAYMENT 22.29 372230
PRODUCT_COMBINATION 0.02 346
---------------------------------------------------------------------------

Summary of bureau

In [ ]:
display_stats(datasets['bureau'], 'bureau') 
--------------------------------------------------------------------------------
                    bureau                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1716428 entries, 0 to 1716427
Data columns (total 17 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   SK_ID_CURR              1716428 non-null  int64  
 1   SK_ID_BUREAU            1716428 non-null  int64  
 2   CREDIT_ACTIVE           1716428 non-null  object 
 3   CREDIT_CURRENCY         1716428 non-null  object 
 4   DAYS_CREDIT             1716428 non-null  int64  
 5   CREDIT_DAY_OVERDUE      1716428 non-null  int64  
 6   DAYS_CREDIT_ENDDATE     1610875 non-null  float64
 7   DAYS_ENDDATE_FACT       1082775 non-null  float64
 8   AMT_CREDIT_MAX_OVERDUE  591940 non-null   float64
 9   CNT_CREDIT_PROLONG      1716428 non-null  int64  
 10  AMT_CREDIT_SUM          1716415 non-null  float64
 11  AMT_CREDIT_SUM_DEBT     1458759 non-null  float64
 12  AMT_CREDIT_SUM_LIMIT    1124648 non-null  float64
 13  AMT_CREDIT_SUM_OVERDUE  1716428 non-null  float64
 14  CREDIT_TYPE             1716428 non-null  object 
 15  DAYS_CREDIT_UPDATE      1716428 non-null  int64  
 16  AMT_ANNUITY             489637 non-null   float64
dtypes: float64(8), int64(6), object(3)
memory usage: 222.6+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau is (1716428, 17) 

---------------------------------------------------------------------------
Statistical summary of bureau is :
---------------------------------------------------------------------------
Description of the df bureau:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
In [ ]:
display_feature_info(datasets['bureau'], 'bureau') 
Description of the df continued for bureau:

---------------------------------------------------------------------------
Data type value counts: 
 float64    8
int64      6
object     3
dtype: int64

Return number of unique elements in the object. 

CREDIT_ACTIVE       4
CREDIT_CURRENCY     4
CREDIT_TYPE        15
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of bureau.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_CURR', 'SK_ID_BUREAU', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
       'CNT_CREDIT_PROLONG', 'DAYS_CREDIT_UPDATE'],
      dtype='object')}
------------------------------
{'float64': Index(['DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_MAX_OVERDUE',
       'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT', 'AMT_CREDIT_SUM_LIMIT',
       'AMT_CREDIT_SUM_OVERDUE', 'AMT_ANNUITY'],
      dtype='object')}
------------------------------
{'object': Index(['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
AMT_ANNUITY 71.47 1226791
AMT_CREDIT_MAX_OVERDUE 65.51 1124488
DAYS_ENDDATE_FACT 36.92 633653
AMT_CREDIT_SUM_LIMIT 34.48 591780
AMT_CREDIT_SUM_DEBT 15.01 257669
DAYS_CREDIT_ENDDATE 6.15 105553
---------------------------------------------------------------------------

Summary of bureau_balance

In [ ]:
display_stats(datasets['bureau_balance'], 'bureau_balance')
--------------------------------------------------------------------------------
                    bureau_balance                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27299925 entries, 0 to 27299924
Data columns (total 3 columns):
 #   Column          Non-Null Count     Dtype 
---  ------          --------------     ----- 
 0   SK_ID_BUREAU    27299925 non-null  int64 
 1   MONTHS_BALANCE  27299925 non-null  int64 
 2   STATUS          27299925 non-null  object
dtypes: int64(2), object(1)
memory usage: 624.8+ MB
None
---------------------------------------------------------------------------
Shape of the df bureau_balance is (27299925, 3) 

---------------------------------------------------------------------------
Statistical summary of bureau_balance is :
---------------------------------------------------------------------------
Description of the df bureau_balance:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
In [ ]:
display_feature_info(datasets['bureau_balance'], 'bureau_balance')
Description of the df continued for bureau_balance:

---------------------------------------------------------------------------
Data type value counts: 
 int64     2
object    1
dtype: int64

Return number of unique elements in the object. 

STATUS    8
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of bureau_balance.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_BUREAU', 'MONTHS_BALANCE'], dtype='object')}
------------------------------
{'object': Index(['STATUS'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

No missing Data

Observation 4

  • Bureau balance & bureau have no missing data. These datasets can provide accurate aggreagte features.

Summary of credit_card_balance

In [ ]:
display_stats(datasets['credit_card_balance'], 'credit_card_balance') 
--------------------------------------------------------------------------------
                    credit_card_balance                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3840312 entries, 0 to 3840311
Data columns (total 23 columns):
 #   Column                      Non-Null Count    Dtype  
---  ------                      --------------    -----  
 0   SK_ID_PREV                  3840312 non-null  int64  
 1   SK_ID_CURR                  3840312 non-null  int64  
 2   MONTHS_BALANCE              3840312 non-null  int64  
 3   AMT_BALANCE                 3840312 non-null  float64
 4   AMT_CREDIT_LIMIT_ACTUAL     3840312 non-null  int64  
 5   AMT_DRAWINGS_ATM_CURRENT    3090496 non-null  float64
 6   AMT_DRAWINGS_CURRENT        3840312 non-null  float64
 7   AMT_DRAWINGS_OTHER_CURRENT  3090496 non-null  float64
 8   AMT_DRAWINGS_POS_CURRENT    3090496 non-null  float64
 9   AMT_INST_MIN_REGULARITY     3535076 non-null  float64
 10  AMT_PAYMENT_CURRENT         3072324 non-null  float64
 11  AMT_PAYMENT_TOTAL_CURRENT   3840312 non-null  float64
 12  AMT_RECEIVABLE_PRINCIPAL    3840312 non-null  float64
 13  AMT_RECIVABLE               3840312 non-null  float64
 14  AMT_TOTAL_RECEIVABLE        3840312 non-null  float64
 15  CNT_DRAWINGS_ATM_CURRENT    3090496 non-null  float64
 16  CNT_DRAWINGS_CURRENT        3840312 non-null  int64  
 17  CNT_DRAWINGS_OTHER_CURRENT  3090496 non-null  float64
 18  CNT_DRAWINGS_POS_CURRENT    3090496 non-null  float64
 19  CNT_INSTALMENT_MATURE_CUM   3535076 non-null  float64
 20  NAME_CONTRACT_STATUS        3840312 non-null  object 
 21  SK_DPD                      3840312 non-null  int64  
 22  SK_DPD_DEF                  3840312 non-null  int64  
dtypes: float64(15), int64(7), object(1)
memory usage: 673.9+ MB
None
---------------------------------------------------------------------------
Shape of the df credit_card_balance is (3840312, 23) 

---------------------------------------------------------------------------
Statistical summary of credit_card_balance is :
---------------------------------------------------------------------------
Description of the df credit_card_balance:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
In [ ]:
display_feature_info(datasets['credit_card_balance'], 'credit_card_balance') 
Description of the df continued for credit_card_balance:

---------------------------------------------------------------------------
Data type value counts: 
 float64    15
int64       7
object      1
dtype: int64

Return number of unique elements in the object. 

NAME_CONTRACT_STATUS    7
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of credit_card_balance.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'AMT_CREDIT_LIMIT_ACTUAL',
       'CNT_DRAWINGS_CURRENT', 'SK_DPD', 'SK_DPD_DEF'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_BALANCE', 'AMT_DRAWINGS_ATM_CURRENT', 'AMT_DRAWINGS_CURRENT',
       'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_DRAWINGS_POS_CURRENT',
       'AMT_INST_MIN_REGULARITY', 'AMT_PAYMENT_CURRENT',
       'AMT_PAYMENT_TOTAL_CURRENT', 'AMT_RECEIVABLE_PRINCIPAL',
       'AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE', 'CNT_DRAWINGS_ATM_CURRENT',
       'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT',
       'CNT_INSTALMENT_MATURE_CUM'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
AMT_PAYMENT_CURRENT 20.00 767988
AMT_DRAWINGS_OTHER_CURRENT 19.52 749816
CNT_DRAWINGS_POS_CURRENT 19.52 749816
CNT_DRAWINGS_OTHER_CURRENT 19.52 749816
CNT_DRAWINGS_ATM_CURRENT 19.52 749816
AMT_DRAWINGS_ATM_CURRENT 19.52 749816
AMT_DRAWINGS_POS_CURRENT 19.52 749816
CNT_INSTALMENT_MATURE_CUM 7.95 305236
AMT_INST_MIN_REGULARITY 7.95 305236
---------------------------------------------------------------------------

Summary of installments_payments

In [ ]:
display_stats(datasets['installments_payments'], 'installments_payments') 
--------------------------------------------------------------------------------
                    installments_payments                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7217242 entries, 0 to 7217241
Data columns (total 8 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   SK_ID_PREV              7217242 non-null  int64  
 1   SK_ID_CURR              7217242 non-null  int64  
 2   NUM_INSTALMENT_VERSION  7217242 non-null  float64
 3   NUM_INSTALMENT_NUMBER   7217242 non-null  int64  
 4   DAYS_INSTALMENT         7217242 non-null  float64
 5   DAYS_ENTRY_PAYMENT      7216482 non-null  float64
 6   AMT_INSTALMENT          7217242 non-null  float64
 7   AMT_PAYMENT             7216482 non-null  float64
dtypes: float64(5), int64(3)
memory usage: 440.5 MB
None
---------------------------------------------------------------------------
Shape of the df installments_payments is (7217242, 8) 

---------------------------------------------------------------------------
Statistical summary of installments_payments is :
---------------------------------------------------------------------------
Description of the df installments_payments:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
In [ ]:
display_feature_info(datasets['installments_payments'], 'installments_payments') 
Description of the df continued for installments_payments:

---------------------------------------------------------------------------
Data type value counts: 
 float64    5
int64      3
dtype: int64

Return number of unique elements in the object. 

Series([], dtype: float64)
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of installments_payments.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'NUM_INSTALMENT_NUMBER'], dtype='object')}
------------------------------
{'float64': Index(['NUM_INSTALMENT_VERSION', 'DAYS_INSTALMENT', 'DAYS_ENTRY_PAYMENT',
       'AMT_INSTALMENT', 'AMT_PAYMENT'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
AMT_PAYMENT 0.01 760
DAYS_ENTRY_PAYMENT 0.01 760
---------------------------------------------------------------------------

Summary of POS_CASH_balance

In [ ]:
display_stats(datasets['POS_CASH_balance'], 'POS_CASH_balance') 
--------------------------------------------------------------------------------
                    POS_CASH_balance                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10001358 entries, 0 to 10001357
Data columns (total 8 columns):
 #   Column                 Non-Null Count     Dtype  
---  ------                 --------------     -----  
 0   SK_ID_PREV             10001358 non-null  int64  
 1   SK_ID_CURR             10001358 non-null  int64  
 2   MONTHS_BALANCE         10001358 non-null  int64  
 3   CNT_INSTALMENT         9975287 non-null   float64
 4   CNT_INSTALMENT_FUTURE  9975271 non-null   float64
 5   NAME_CONTRACT_STATUS   10001358 non-null  object 
 6   SK_DPD                 10001358 non-null  int64  
 7   SK_DPD_DEF             10001358 non-null  int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 610.4+ MB
None
---------------------------------------------------------------------------
Shape of the df POS_CASH_balance is (10001358, 8) 

---------------------------------------------------------------------------
Statistical summary of POS_CASH_balance is :
---------------------------------------------------------------------------
Description of the df POS_CASH_balance:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
In [ ]:
display_feature_info(datasets['POS_CASH_balance'], 'POS_CASH_balance') 
Description of the df continued for POS_CASH_balance:

---------------------------------------------------------------------------
Data type value counts: 
 int64      5
float64    2
object     1
dtype: int64

Return number of unique elements in the object. 

NAME_CONTRACT_STATUS    9
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of POS_CASH_balance.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_PREV', 'SK_ID_CURR', 'MONTHS_BALANCE', 'SK_DPD', 'SK_DPD_DEF'], dtype='object')}
------------------------------
{'float64': Index(['CNT_INSTALMENT', 'CNT_INSTALMENT_FUTURE'], dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_STATUS'], dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
CNT_INSTALMENT_FUTURE 0.26 26087
CNT_INSTALMENT 0.26 26071
---------------------------------------------------------------------------

Summary of application_test

In [ ]:
display_stats(datasets['application_test'], 'application_test') 
--------------------------------------------------------------------------------
                    application_test                    
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Data columns (total 121 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   SK_ID_CURR                    48744 non-null  int64  
 1   NAME_CONTRACT_TYPE            48744 non-null  object 
 2   CODE_GENDER                   48744 non-null  object 
 3   FLAG_OWN_CAR                  48744 non-null  object 
 4   FLAG_OWN_REALTY               48744 non-null  object 
 5   CNT_CHILDREN                  48744 non-null  int64  
 6   AMT_INCOME_TOTAL              48744 non-null  float64
 7   AMT_CREDIT                    48744 non-null  float64
 8   AMT_ANNUITY                   48720 non-null  float64
 9   AMT_GOODS_PRICE               48744 non-null  float64
 10  NAME_TYPE_SUITE               47833 non-null  object 
 11  NAME_INCOME_TYPE              48744 non-null  object 
 12  NAME_EDUCATION_TYPE           48744 non-null  object 
 13  NAME_FAMILY_STATUS            48744 non-null  object 
 14  NAME_HOUSING_TYPE             48744 non-null  object 
 15  REGION_POPULATION_RELATIVE    48744 non-null  float64
 16  DAYS_BIRTH                    48744 non-null  int64  
 17  DAYS_EMPLOYED                 48744 non-null  int64  
 18  DAYS_REGISTRATION             48744 non-null  float64
 19  DAYS_ID_PUBLISH               48744 non-null  int64  
 20  OWN_CAR_AGE                   16432 non-null  float64
 21  FLAG_MOBIL                    48744 non-null  int64  
 22  FLAG_EMP_PHONE                48744 non-null  int64  
 23  FLAG_WORK_PHONE               48744 non-null  int64  
 24  FLAG_CONT_MOBILE              48744 non-null  int64  
 25  FLAG_PHONE                    48744 non-null  int64  
 26  FLAG_EMAIL                    48744 non-null  int64  
 27  OCCUPATION_TYPE               33139 non-null  object 
 28  CNT_FAM_MEMBERS               48744 non-null  float64
 29  REGION_RATING_CLIENT          48744 non-null  int64  
 30  REGION_RATING_CLIENT_W_CITY   48744 non-null  int64  
 31  WEEKDAY_APPR_PROCESS_START    48744 non-null  object 
 32  HOUR_APPR_PROCESS_START       48744 non-null  int64  
 33  REG_REGION_NOT_LIVE_REGION    48744 non-null  int64  
 34  REG_REGION_NOT_WORK_REGION    48744 non-null  int64  
 35  LIVE_REGION_NOT_WORK_REGION   48744 non-null  int64  
 36  REG_CITY_NOT_LIVE_CITY        48744 non-null  int64  
 37  REG_CITY_NOT_WORK_CITY        48744 non-null  int64  
 38  LIVE_CITY_NOT_WORK_CITY       48744 non-null  int64  
 39  ORGANIZATION_TYPE             48744 non-null  object 
 40  EXT_SOURCE_1                  28212 non-null  float64
 41  EXT_SOURCE_2                  48736 non-null  float64
 42  EXT_SOURCE_3                  40076 non-null  float64
 43  APARTMENTS_AVG                24857 non-null  float64
 44  BASEMENTAREA_AVG              21103 non-null  float64
 45  YEARS_BEGINEXPLUATATION_AVG   25888 non-null  float64
 46  YEARS_BUILD_AVG               16926 non-null  float64
 47  COMMONAREA_AVG                15249 non-null  float64
 48  ELEVATORS_AVG                 23555 non-null  float64
 49  ENTRANCES_AVG                 25165 non-null  float64
 50  FLOORSMAX_AVG                 25423 non-null  float64
 51  FLOORSMIN_AVG                 16278 non-null  float64
 52  LANDAREA_AVG                  20490 non-null  float64
 53  LIVINGAPARTMENTS_AVG          15964 non-null  float64
 54  LIVINGAREA_AVG                25192 non-null  float64
 55  NONLIVINGAPARTMENTS_AVG       15397 non-null  float64
 56  NONLIVINGAREA_AVG             22660 non-null  float64
 57  APARTMENTS_MODE               24857 non-null  float64
 58  BASEMENTAREA_MODE             21103 non-null  float64
 59  YEARS_BEGINEXPLUATATION_MODE  25888 non-null  float64
 60  YEARS_BUILD_MODE              16926 non-null  float64
 61  COMMONAREA_MODE               15249 non-null  float64
 62  ELEVATORS_MODE                23555 non-null  float64
 63  ENTRANCES_MODE                25165 non-null  float64
 64  FLOORSMAX_MODE                25423 non-null  float64
 65  FLOORSMIN_MODE                16278 non-null  float64
 66  LANDAREA_MODE                 20490 non-null  float64
 67  LIVINGAPARTMENTS_MODE         15964 non-null  float64
 68  LIVINGAREA_MODE               25192 non-null  float64
 69  NONLIVINGAPARTMENTS_MODE      15397 non-null  float64
 70  NONLIVINGAREA_MODE            22660 non-null  float64
 71  APARTMENTS_MEDI               24857 non-null  float64
 72  BASEMENTAREA_MEDI             21103 non-null  float64
 73  YEARS_BEGINEXPLUATATION_MEDI  25888 non-null  float64
 74  YEARS_BUILD_MEDI              16926 non-null  float64
 75  COMMONAREA_MEDI               15249 non-null  float64
 76  ELEVATORS_MEDI                23555 non-null  float64
 77  ENTRANCES_MEDI                25165 non-null  float64
 78  FLOORSMAX_MEDI                25423 non-null  float64
 79  FLOORSMIN_MEDI                16278 non-null  float64
 80  LANDAREA_MEDI                 20490 non-null  float64
 81  LIVINGAPARTMENTS_MEDI         15964 non-null  float64
 82  LIVINGAREA_MEDI               25192 non-null  float64
 83  NONLIVINGAPARTMENTS_MEDI      15397 non-null  float64
 84  NONLIVINGAREA_MEDI            22660 non-null  float64
 85  FONDKAPREMONT_MODE            15947 non-null  object 
 86  HOUSETYPE_MODE                25125 non-null  object 
 87  TOTALAREA_MODE                26120 non-null  float64
 88  WALLSMATERIAL_MODE            24851 non-null  object 
 89  EMERGENCYSTATE_MODE           26535 non-null  object 
 90  OBS_30_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 91  DEF_30_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 92  OBS_60_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 93  DEF_60_CNT_SOCIAL_CIRCLE      48715 non-null  float64
 94  DAYS_LAST_PHONE_CHANGE        48744 non-null  float64
 95  FLAG_DOCUMENT_2               48744 non-null  int64  
 96  FLAG_DOCUMENT_3               48744 non-null  int64  
 97  FLAG_DOCUMENT_4               48744 non-null  int64  
 98  FLAG_DOCUMENT_5               48744 non-null  int64  
 99  FLAG_DOCUMENT_6               48744 non-null  int64  
 100 FLAG_DOCUMENT_7               48744 non-null  int64  
 101 FLAG_DOCUMENT_8               48744 non-null  int64  
 102 FLAG_DOCUMENT_9               48744 non-null  int64  
 103 FLAG_DOCUMENT_10              48744 non-null  int64  
 104 FLAG_DOCUMENT_11              48744 non-null  int64  
 105 FLAG_DOCUMENT_12              48744 non-null  int64  
 106 FLAG_DOCUMENT_13              48744 non-null  int64  
 107 FLAG_DOCUMENT_14              48744 non-null  int64  
 108 FLAG_DOCUMENT_15              48744 non-null  int64  
 109 FLAG_DOCUMENT_16              48744 non-null  int64  
 110 FLAG_DOCUMENT_17              48744 non-null  int64  
 111 FLAG_DOCUMENT_18              48744 non-null  int64  
 112 FLAG_DOCUMENT_19              48744 non-null  int64  
 113 FLAG_DOCUMENT_20              48744 non-null  int64  
 114 FLAG_DOCUMENT_21              48744 non-null  int64  
 115 AMT_REQ_CREDIT_BUREAU_HOUR    42695 non-null  float64
 116 AMT_REQ_CREDIT_BUREAU_DAY     42695 non-null  float64
 117 AMT_REQ_CREDIT_BUREAU_WEEK    42695 non-null  float64
 118 AMT_REQ_CREDIT_BUREAU_MON     42695 non-null  float64
 119 AMT_REQ_CREDIT_BUREAU_QRT     42695 non-null  float64
 120 AMT_REQ_CREDIT_BUREAU_YEAR    42695 non-null  float64
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB
None
---------------------------------------------------------------------------
Shape of the df application_test is (48744, 121) 

---------------------------------------------------------------------------
Statistical summary of application_test is :
---------------------------------------------------------------------------
Description of the df application_test:

SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI TOTALAREA_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
count 307511.00 307511.00 307511.00 3.075110e+05 307511.00 307499.00 307233.00 307511.00 307511.00 307511.00 307511.00 307511.00 104582.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307509.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 134133.00 306851.00 246546.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 151450.00 127568.00 157504.00 103023.00 92646.00 143620.00 152683.00 154491.00 98869.00 124921.00 97312.00 153161.00 93997.00 137829.00 159080.00 306490.00 306490.00 306490.00 306490.00 307510.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.0 307511.00 307511.0 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 307511.00 265992.00 265992.00 265992.00 265992.00 265992.00 265992.00
mean 278180.52 0.08 0.42 1.687979e+05 599026.00 27108.57 538396.21 0.02 -16037.00 63815.05 -4986.12 -2994.20 12.06 1.0 0.82 0.2 1.00 0.28 0.06 2.15 2.05 2.03 12.06 0.02 0.05 0.04 0.08 0.23 0.18 0.50 0.51 0.51 0.12 0.09 0.98 0.75 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.11 0.09 0.98 0.76 0.04 0.07 0.15 0.22 0.23 0.06 0.11 0.11 0.01 0.03 0.12 0.09 0.98 0.76 0.04 0.08 0.15 0.23 0.23 0.07 0.10 0.11 0.01 0.03 0.10 1.42 0.14 1.41 0.10 -962.86 0.00 0.71 0.00 0.02 0.09 0.00 0.08 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.01 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.03 0.27 0.27 1.90
std 102790.18 0.27 0.72 2.371231e+05 402490.78 14493.74 369446.46 0.01 4363.99 141275.77 3522.89 1509.45 11.94 0.0 0.38 0.4 0.04 0.45 0.23 0.91 0.51 0.50 3.27 0.12 0.22 0.20 0.27 0.42 0.38 0.21 0.19 0.19 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.14 0.16 0.08 0.09 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.07 0.13 0.10 0.14 0.16 0.08 0.10 0.11 0.05 0.07 0.11 0.08 0.06 0.11 0.08 0.13 0.10 0.15 0.16 0.08 0.09 0.11 0.05 0.07 0.11 2.40 0.45 2.38 0.36 826.81 0.01 0.45 0.01 0.12 0.28 0.01 0.27 0.06 0.0 0.06 0.0 0.06 0.05 0.03 0.10 0.02 0.09 0.02 0.02 0.02 0.08 0.11 0.20 0.92 0.79 1.87
min 100002.00 0.00 0.00 2.565000e+04 45000.00 1615.50 40500.00 0.00 -25229.00 -17912.00 -24672.00 -7197.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -4292.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 189145.50 0.00 0.00 1.125000e+05 270000.00 16524.00 238500.00 0.01 -19682.00 -2760.00 -7479.50 -4299.00 5.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 10.00 0.00 0.00 0.00 0.00 0.00 0.00 0.33 0.39 0.37 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.05 0.04 0.98 0.70 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.04 0.00 0.00 0.06 0.04 0.98 0.69 0.01 0.00 0.07 0.17 0.08 0.02 0.05 0.05 0.00 0.00 0.04 0.00 0.00 0.00 0.00 -1570.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
50% 278202.00 0.00 0.00 1.471500e+05 513531.00 24903.00 450000.00 0.02 -15750.00 -1213.00 -4504.00 -3254.00 9.00 1.0 1.00 0.0 1.00 0.00 0.00 2.00 2.00 2.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.51 0.57 0.54 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.08 0.07 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.09 0.08 0.98 0.76 0.02 0.00 0.14 0.17 0.21 0.05 0.08 0.07 0.00 0.00 0.07 0.00 0.00 0.00 0.00 -757.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00
75% 367142.50 0.00 1.00 2.025000e+05 808650.00 34596.00 679500.00 0.03 -12413.00 -289.00 -2010.00 -1720.00 15.00 1.0 1.00 0.0 1.00 1.00 0.00 3.00 2.00 2.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 0.66 0.67 0.15 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.14 0.11 0.99 0.82 0.05 0.12 0.21 0.33 0.38 0.08 0.13 0.13 0.00 0.02 0.15 0.11 0.99 0.83 0.05 0.12 0.21 0.33 0.38 0.09 0.12 0.13 0.00 0.03 0.13 2.00 0.00 2.00 0.00 -274.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00
max 456255.00 1.00 19.00 1.170000e+08 4050000.00 258025.50 4050000.00 0.07 -7489.00 365243.00 0.00 0.00 91.00 1.0 1.00 1.0 1.00 1.00 1.00 20.00 3.00 3.00 23.00 1.00 1.00 1.00 1.00 1.00 1.00 0.96 0.85 0.90 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 348.00 34.00 344.00 24.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0 1.00 1.0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 9.00 8.00 27.00 261.00 25.00
None
In [ ]:
display_feature_info(datasets['application_test'], 'application_test') 
Description of the df continued for application_test:

---------------------------------------------------------------------------
Data type value counts: 
 float64    65
int64      40
object     16
dtype: int64

Return number of unique elements in the object. 

NAME_CONTRACT_TYPE             2
CODE_GENDER                    2
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               7
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             5
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
---------------------------------------------------------------------------
Categorical and Numerical(int + float) features  of application_test.
---------------------------------------------------------------------------

{'int64': Index(['SK_ID_CURR', 'CNT_CHILDREN', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
       'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2',
       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5',
       'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8',
       'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11',
       'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14',
       'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17',
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21'],
      dtype='object')}
------------------------------
{'float64': Index(['AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE',
       'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG',
       'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG',
       'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG',
       'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG',
       'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE',
       'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE',
       'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE',
       'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE',
       'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI',
       'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI',
       'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI',
       'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI',
       'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI',
       'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
       'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
       'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object')}
------------------------------
{'object': Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')}
------------------------------

 

---------------------------------------------------------------------------
---------------------------------------------------------------------------

 The Missing Data: 

Percent Train Missing Count
COMMONAREA_MEDI 68.72 33495
COMMONAREA_AVG 68.72 33495
COMMONAREA_MODE 68.72 33495
NONLIVINGAPARTMENTS_MODE 68.41 33347
NONLIVINGAPARTMENTS_MEDI 68.41 33347
NONLIVINGAPARTMENTS_AVG 68.41 33347
FONDKAPREMONT_MODE 67.28 32797
LIVINGAPARTMENTS_AVG 67.25 32780
LIVINGAPARTMENTS_MEDI 67.25 32780
LIVINGAPARTMENTS_MODE 67.25 32780
FLOORSMIN_MEDI 66.61 32466
FLOORSMIN_MODE 66.61 32466
FLOORSMIN_AVG 66.61 32466
OWN_CAR_AGE 66.29 32312
YEARS_BUILD_MEDI 65.28 31818
YEARS_BUILD_MODE 65.28 31818
YEARS_BUILD_AVG 65.28 31818
LANDAREA_AVG 57.96 28254
LANDAREA_MODE 57.96 28254
LANDAREA_MEDI 57.96 28254
BASEMENTAREA_AVG 56.71 27641
BASEMENTAREA_MODE 56.71 27641
BASEMENTAREA_MEDI 56.71 27641
NONLIVINGAREA_MODE 53.51 26084
NONLIVINGAREA_AVG 53.51 26084
NONLIVINGAREA_MEDI 53.51 26084
ELEVATORS_AVG 51.68 25189
ELEVATORS_MEDI 51.68 25189
ELEVATORS_MODE 51.68 25189
WALLSMATERIAL_MODE 49.02 23893
APARTMENTS_AVG 49.01 23887
APARTMENTS_MEDI 49.01 23887
APARTMENTS_MODE 49.01 23887
HOUSETYPE_MODE 48.46 23619
ENTRANCES_MEDI 48.37 23579
ENTRANCES_AVG 48.37 23579
ENTRANCES_MODE 48.37 23579
LIVINGAREA_MEDI 48.32 23552
LIVINGAREA_MODE 48.32 23552
LIVINGAREA_AVG 48.32 23552
FLOORSMAX_MODE 47.84 23321
FLOORSMAX_MEDI 47.84 23321
FLOORSMAX_AVG 47.84 23321
YEARS_BEGINEXPLUATATION_AVG 46.89 22856
YEARS_BEGINEXPLUATATION_MEDI 46.89 22856
YEARS_BEGINEXPLUATATION_MODE 46.89 22856
TOTALAREA_MODE 46.41 22624
EMERGENCYSTATE_MODE 45.56 22209
EXT_SOURCE_1 42.12 20532
OCCUPATION_TYPE 32.01 15605
EXT_SOURCE_3 17.78 8668
AMT_REQ_CREDIT_BUREAU_QRT 12.41 6049
AMT_REQ_CREDIT_BUREAU_YEAR 12.41 6049
AMT_REQ_CREDIT_BUREAU_MON 12.41 6049
AMT_REQ_CREDIT_BUREAU_WEEK 12.41 6049
AMT_REQ_CREDIT_BUREAU_HOUR 12.41 6049
AMT_REQ_CREDIT_BUREAU_DAY 12.41 6049
NAME_TYPE_SUITE 1.87 911
OBS_30_CNT_SOCIAL_CIRCLE 0.06 29
OBS_60_CNT_SOCIAL_CIRCLE 0.06 29
DEF_60_CNT_SOCIAL_CIRCLE 0.06 29
DEF_30_CNT_SOCIAL_CIRCLE 0.06 29
AMT_ANNUITY 0.05 24
EXT_SOURCE_2 0.02 8
---------------------------------------------------------------------------

Correlation Analysis

The top 20 correlated features (positive and negative) for application train datset are listed below.

In [ ]:
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10)) 
Most Positive Correlations:
 FLAG_DOCUMENT_3                0.044346
REG_CITY_NOT_LIVE_CITY         0.044395
FLAG_EMP_PHONE                 0.045982
REG_CITY_NOT_WORK_CITY         0.050994
DAYS_ID_PUBLISH                0.051457
DAYS_LAST_PHONE_CHANGE         0.055218
REGION_RATING_CLIENT           0.058899
REGION_RATING_CLIENT_W_CITY    0.060893
DAYS_BIRTH                     0.078239
TARGET                         1.000000
Name: TARGET, dtype: float64

Most Negative Correlations:
 EXT_SOURCE_3                 -0.178919
EXT_SOURCE_2                 -0.160472
EXT_SOURCE_1                 -0.155317
DAYS_EMPLOYED                -0.044932
FLOORSMAX_AVG                -0.044003
FLOORSMAX_MEDI               -0.043768
FLOORSMAX_MODE               -0.043226
AMT_GOODS_PRICE              -0.039645
REGION_POPULATION_RELATIVE   -0.037227
ELEVATORS_AVG                -0.034199
Name: TARGET, dtype: float64
In [ ]:
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL',  'AMT_CREDIT', 'DAYS_EMPLOYED',
               'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets["application_train"].copy()
df2 = df[num_attribs]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2) 
Out[ ]:
TARGET AMT_INCOME_TOTAL AMT_CREDIT DAYS_EMPLOYED DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 AMT_GOODS_PRICE
TARGET 1.00 -0.00 -0.03 -0.04 0.08 -0.16 -0.16 -0.18 -0.04
AMT_INCOME_TOTAL -0.00 1.00 0.16 -0.06 0.03 0.03 0.06 -0.03 0.16
AMT_CREDIT -0.03 0.16 1.00 -0.07 -0.06 0.17 0.13 0.04 0.99
DAYS_EMPLOYED -0.04 -0.06 -0.07 1.00 -0.62 0.29 -0.02 0.11 -0.06
DAYS_BIRTH 0.08 0.03 -0.06 -0.62 1.00 -0.60 -0.09 -0.21 -0.05
EXT_SOURCE_1 -0.16 0.03 0.17 0.29 -0.60 1.00 0.21 0.19 0.18
EXT_SOURCE_2 -0.16 0.06 0.13 -0.02 -0.09 0.21 1.00 0.11 0.14
EXT_SOURCE_3 -0.18 -0.03 0.04 0.11 -0.21 0.19 0.11 1.00 0.05
AMT_GOODS_PRICE -0.04 0.16 0.99 -0.06 -0.05 0.18 0.14 0.05 1.00
In [ ]:
gc.collect()
Out[ ]:
67268

Correlation Plots

Observing Highly correlated features from all input datasets

The distribution of the top correlated features are plotted below

In [ ]:
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]

plt.figure(figsize=(15,20))
for i,var in enumerate(var_neg_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    
    plt.subplot(numVar,4,i+1)
    datasets["application_train"][var].hist()
    plt.title(var, fontsize = 10)
    plt.tight_layout()
plt.show() 
In [ ]:
var_pos_corr = correlations.tail(10).index.values
numVar = var_pos_corr.shape[0]

plt.figure(figsize=(15,20))
for i,var in enumerate(var_pos_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    
    plt.subplot(numVar,4,i+1)
    datasets["application_train"][var].hist()
    plt.title(var, fontsize = 10)
    plt.tight_layout()
plt.show() 
In [ ]:
def correlation_files_target(df_name):
  A = datasets["application_train"].copy()
  B = datasets[df_name].copy()
  correlation_matrix =  pd.concat([A.TARGET, B], axis=1).corr().filter(B.columns).filter(A.columns, axis=0)
  del A
  del B
  return correlation_matrix
In [ ]:
df_name = "previous_application"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the previous_application against the Target is :
Out[ ]:
AMT_DOWN_PAYMENT             0.002496
CNT_PAYMENT                  0.002341
DAYS_LAST_DUE_1ST_VERSION    0.001908
AMT_CREDIT                   0.001833
AMT_APPLICATION              0.001689
AMT_GOODS_PRICE              0.001676
SK_ID_CURR                   0.001107
NFLAG_INSURED_ON_APPROVAL    0.000879
RATE_DOWN_PAYMENT            0.000850
RATE_INTEREST_PRIMARY        0.000542
SK_ID_PREV                   0.000362
DAYS_DECISION               -0.000482
AMT_ANNUITY                 -0.000492
DAYS_FIRST_DUE              -0.000943
SELLERPLACE_AREA            -0.000954
DAYS_TERMINATION            -0.001072
NFLAG_LAST_APPL_IN_DAY      -0.001256
DAYS_FIRST_DRAWING          -0.001293
DAYS_LAST_DUE               -0.001940
HOUR_APPR_PROCESS_START     -0.002285
RATE_INTEREST_PRIVILEGED    -0.026427
Name: TARGET, dtype: float64
In [ ]:
df_name = "bureau"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the bureau against the Target is :
Out[ ]:
DAYS_CREDIT_UPDATE        0.002159
DAYS_CREDIT_ENDDATE       0.002048
SK_ID_BUREAU              0.001550
DAYS_CREDIT               0.001443
AMT_CREDIT_SUM            0.000218
DAYS_ENDDATE_FACT         0.000203
AMT_ANNUITY               0.000189
AMT_CREDIT_MAX_OVERDUE   -0.000389
CNT_CREDIT_PROLONG       -0.000495
AMT_CREDIT_SUM_LIMIT     -0.000558
AMT_CREDIT_SUM_DEBT      -0.000946
SK_ID_CURR               -0.001070
AMT_CREDIT_SUM_OVERDUE   -0.001464
CREDIT_DAY_OVERDUE       -0.001815
Name: TARGET, dtype: float64
In [ ]:
df_name = "bureau_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the bureau_balance against the Target is :
Out[ ]:
SK_ID_BUREAU      0.001223
MONTHS_BALANCE   -0.005262
Name: TARGET, dtype: float64
In [ ]:
df_name = "credit_card_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the credit_card_balance against the Target is :
Out[ ]:
CNT_DRAWINGS_ATM_CURRENT      0.001908
AMT_DRAWINGS_ATM_CURRENT      0.001520
AMT_INST_MIN_REGULARITY       0.001435
SK_ID_CURR                    0.001086
AMT_CREDIT_LIMIT_ACTUAL       0.000515
AMT_BALANCE                   0.000448
SK_ID_PREV                    0.000446
AMT_RECIVABLE                 0.000412
AMT_TOTAL_RECEIVABLE          0.000407
AMT_RECEIVABLE_PRINCIPAL      0.000383
SK_DPD                        0.000092
SK_DPD_DEF                   -0.000201
CNT_INSTALMENT_MATURE_CUM    -0.000342
MONTHS_BALANCE               -0.000768
AMT_PAYMENT_CURRENT          -0.001129
AMT_PAYMENT_TOTAL_CURRENT    -0.001395
AMT_DRAWINGS_CURRENT         -0.001419
CNT_DRAWINGS_CURRENT         -0.001764
CNT_DRAWINGS_OTHER_CURRENT   -0.001833
CNT_DRAWINGS_POS_CURRENT     -0.002387
AMT_DRAWINGS_OTHER_CURRENT   -0.002672
AMT_DRAWINGS_POS_CURRENT     -0.003518
Name: TARGET, dtype: float64
In [ ]:
df_name = "installments_payments"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the installments_payments against the Target is :
Out[ ]:
SK_ID_PREV                0.002891
NUM_INSTALMENT_VERSION    0.002511
NUM_INSTALMENT_NUMBER     0.000626
SK_ID_CURR               -0.000781
AMT_PAYMENT              -0.003512
DAYS_INSTALMENT          -0.003955
AMT_INSTALMENT           -0.003972
DAYS_ENTRY_PAYMENT       -0.004046
Name: TARGET, dtype: float64
In [ ]:
df_name = "POS_CASH_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")

correlation_matrix.T.TARGET.sort_values(ascending= False)
Correlation of the POS_CASH_balance against the Target is :
Out[ ]:
CNT_INSTALMENT_FUTURE    0.002811
MONTHS_BALANCE           0.002775
SK_ID_PREV               0.002164
CNT_INSTALMENT           0.001434
SK_DPD                   0.000050
SK_ID_CURR              -0.000136
SK_DPD_DEF              -0.001362
Name: TARGET, dtype: float64
In [ ]:
gc.collect()
Out[ ]:
52460

Distribution of the Datasets

In [ ]:
datasets['application_train']['TARGET'].value_counts()
Out[ ]:
0    282686
1     24825
Name: TARGET, dtype: int64
In [ ]:
fig, ax = plt.subplots(figsize =(10, 8))
plt.pie(datasets['application_train']['TARGET'].value_counts(), labels=['No Default', 'Default'],
        autopct='%1.2f%%',
        shadow=True, startangle=90)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
plt.suptitle("Distribution of Target",fontsize=30, ha='center')
ax.legend(bbox_to_anchor =(1, 0))
plt.legend(loc="upper left")
Out[ ]:
<matplotlib.legend.Legend at 0x7f673cd38710>

OHE Comparison

Categorical variables to OHE

In [ ]:
print(datasets['application_train'].select_dtypes('object').columns)
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE',
       'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE'],
      dtype='object')
In [ ]:
# Number of unique classes in each object column
datasets['application_train'].select_dtypes('object').apply(pd.Series.nunique, axis = 0)
Out[ ]:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
In [ ]:
# one-hot encoding of categorical variables
app_cat_train= copy.deepcopy(datasets['application_train'])
app_cat_train = pd.get_dummies(app_cat_train)
print(app_cat_train.shape)
print(app_cat_train.dtypes.value_counts())
display(app_cat_train.select_dtypes(include=['uint8']).head())
(307511, 246)
uint8      140
float64     65
int64       41
dtype: int64
NAME_CONTRACT_TYPE_Cash loans NAME_CONTRACT_TYPE_Revolving loans CODE_GENDER_F CODE_GENDER_M CODE_GENDER_XNA FLAG_OWN_CAR_N FLAG_OWN_CAR_Y FLAG_OWN_REALTY_N FLAG_OWN_REALTY_Y NAME_TYPE_SUITE_Children NAME_TYPE_SUITE_Family NAME_TYPE_SUITE_Group of people NAME_TYPE_SUITE_Other_A NAME_TYPE_SUITE_Other_B NAME_TYPE_SUITE_Spouse, partner NAME_TYPE_SUITE_Unaccompanied NAME_INCOME_TYPE_Businessman NAME_INCOME_TYPE_Commercial associate NAME_INCOME_TYPE_Maternity leave NAME_INCOME_TYPE_Pensioner NAME_INCOME_TYPE_State servant NAME_INCOME_TYPE_Student NAME_INCOME_TYPE_Unemployed NAME_INCOME_TYPE_Working NAME_EDUCATION_TYPE_Academic degree NAME_EDUCATION_TYPE_Higher education NAME_EDUCATION_TYPE_Incomplete higher NAME_EDUCATION_TYPE_Lower secondary NAME_EDUCATION_TYPE_Secondary / secondary special NAME_FAMILY_STATUS_Civil marriage NAME_FAMILY_STATUS_Married NAME_FAMILY_STATUS_Separated NAME_FAMILY_STATUS_Single / not married NAME_FAMILY_STATUS_Unknown NAME_FAMILY_STATUS_Widow NAME_HOUSING_TYPE_Co-op apartment NAME_HOUSING_TYPE_House / apartment NAME_HOUSING_TYPE_Municipal apartment NAME_HOUSING_TYPE_Office apartment NAME_HOUSING_TYPE_Rented apartment NAME_HOUSING_TYPE_With parents OCCUPATION_TYPE_Accountants OCCUPATION_TYPE_Cleaning staff OCCUPATION_TYPE_Cooking staff OCCUPATION_TYPE_Core staff OCCUPATION_TYPE_Drivers OCCUPATION_TYPE_HR staff OCCUPATION_TYPE_High skill tech staff OCCUPATION_TYPE_IT staff OCCUPATION_TYPE_Laborers OCCUPATION_TYPE_Low-skill Laborers OCCUPATION_TYPE_Managers OCCUPATION_TYPE_Medicine staff OCCUPATION_TYPE_Private service staff OCCUPATION_TYPE_Realty agents OCCUPATION_TYPE_Sales staff OCCUPATION_TYPE_Secretaries OCCUPATION_TYPE_Security staff OCCUPATION_TYPE_Waiters/barmen staff WEEKDAY_APPR_PROCESS_START_FRIDAY WEEKDAY_APPR_PROCESS_START_MONDAY WEEKDAY_APPR_PROCESS_START_SATURDAY WEEKDAY_APPR_PROCESS_START_SUNDAY WEEKDAY_APPR_PROCESS_START_THURSDAY WEEKDAY_APPR_PROCESS_START_TUESDAY WEEKDAY_APPR_PROCESS_START_WEDNESDAY ORGANIZATION_TYPE_Advertising ORGANIZATION_TYPE_Agriculture ORGANIZATION_TYPE_Bank ORGANIZATION_TYPE_Business Entity Type 1 ORGANIZATION_TYPE_Business Entity Type 2 ORGANIZATION_TYPE_Business Entity Type 3 ORGANIZATION_TYPE_Cleaning ORGANIZATION_TYPE_Construction ORGANIZATION_TYPE_Culture ORGANIZATION_TYPE_Electricity ORGANIZATION_TYPE_Emergency ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Hotel ORGANIZATION_TYPE_Housing ORGANIZATION_TYPE_Industry: type 1 ORGANIZATION_TYPE_Industry: type 10 ORGANIZATION_TYPE_Industry: type 11 ORGANIZATION_TYPE_Industry: type 12 ORGANIZATION_TYPE_Industry: type 13 ORGANIZATION_TYPE_Industry: type 2 ORGANIZATION_TYPE_Industry: type 3 ORGANIZATION_TYPE_Industry: type 4 ORGANIZATION_TYPE_Industry: type 5 ORGANIZATION_TYPE_Industry: type 6 ORGANIZATION_TYPE_Industry: type 7 ORGANIZATION_TYPE_Industry: type 8 ORGANIZATION_TYPE_Industry: type 9 ORGANIZATION_TYPE_Insurance ORGANIZATION_TYPE_Kindergarten ORGANIZATION_TYPE_Legal Services ORGANIZATION_TYPE_Medicine ORGANIZATION_TYPE_Military ORGANIZATION_TYPE_Mobile ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Police ORGANIZATION_TYPE_Postal ORGANIZATION_TYPE_Realtor ORGANIZATION_TYPE_Religion ORGANIZATION_TYPE_Restaurant ORGANIZATION_TYPE_School ORGANIZATION_TYPE_Security ORGANIZATION_TYPE_Security Ministries ORGANIZATION_TYPE_Self-employed ORGANIZATION_TYPE_Services ORGANIZATION_TYPE_Telecom ORGANIZATION_TYPE_Trade: type 1 ORGANIZATION_TYPE_Trade: type 2 ORGANIZATION_TYPE_Trade: type 3 ORGANIZATION_TYPE_Trade: type 4 ORGANIZATION_TYPE_Trade: type 5 ORGANIZATION_TYPE_Trade: type 6 ORGANIZATION_TYPE_Trade: type 7 ORGANIZATION_TYPE_Transport: type 1 ORGANIZATION_TYPE_Transport: type 2 ORGANIZATION_TYPE_Transport: type 3 ORGANIZATION_TYPE_Transport: type 4 ORGANIZATION_TYPE_University ORGANIZATION_TYPE_XNA FONDKAPREMONT_MODE_not specified FONDKAPREMONT_MODE_org spec account FONDKAPREMONT_MODE_reg oper account FONDKAPREMONT_MODE_reg oper spec account HOUSETYPE_MODE_block of flats HOUSETYPE_MODE_specific housing HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No EMERGENCYSTATE_MODE_Yes
0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0
1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0
2 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Plotting the feature type distribution

In [ ]:
fig, ax = plt.subplots(figsize =(10, 8))
data = app_cat_train.dtypes.value_counts()
plt.pie(data, autopct='%1.2f%%',labels=['float64','int64','object'],
        shadow=True, startangle=90)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
# plt.setup(size = 8, weight ="bold")
# ax.set_title("Customizing pie chart",fontsize=30, ha='center')
plt.suptitle("Distribution of Features dtypes.",fontsize=30, ha='center')
ax.legend(bbox_to_anchor =(1, 0))
plt.legend(loc="upper left")
Out[ ]:
<matplotlib.legend.Legend at 0x7f672adb3b10>

Observation 5

  • Histograms used to plot the distribution of the highly correlated variables
In [ ]:
def cat_features_plot(datasets, df_name):
    df = copy.deepcopy(datasets[df_name])
    df['TARGET'].replace(0, "No Default", inplace=True)
    df['TARGET'].replace(1, "Default", inplace=True)

#     df.select_dtypes('object')
    categorical_col = []
    
    for col in df:
        if df[col].dtype == 'object':
            categorical_col.append(col)

    # print("The numerical olumns are: \n \n ",numerical_col)
    #print("The categorical columns are: \n \n ",categorical_col)

    # categorical_col = categorical_col[0:8]
    #print(int(len(categorical_col)))
    plot_x = int(len(categorical_col)/2)
    fig, ax = plt.subplots(plot_x, 2, figsize=(20, 50))
    #plt.subplots_adjust(left=None, bottom=None, right=None,
                        #top=None, wspace=None, hspace=0.45)

    num = 0
    for i in range(0, 8):
        for j in range(0,2):
            tst = sns.countplot(x=categorical_col[num],
                               data=df, hue='TARGET', ax=ax[i][j])
            tst.set_title(f"Distribution of the {categorical_col[num]}  Variable.")
            tst.set_xticklabels(tst.get_xticklabels(), rotation=90)
            plt.subplots_adjust(left=None, bottom=None, right=None,
                        top=None, wspace=None, hspace=0.45)
            num = num + 1
            plt.tight_layout() 
In [ ]:
 cat_features_plot(datasets, "application_train") 

Observation 6

  • Defaulters among the highly categorical features are seen in most, highlighting Organization Type, Family Type , Occupation Type & Education.
In [ ]:
df = copy.deepcopy(datasets["application_train"])
df['DAYS_BIRTH'] = round(abs(df['DAYS_BIRTH'])/365).astype(int)
df['DAYS_EMPLOYED'] = round(abs(df['DAYS_EMPLOYED'])/365).astype(int)
In [ ]:
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL',  'AMT_CREDIT', 'DAYS_EMPLOYED','DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df2 = df[num_attribs]
df2.fillna(0, inplace=True)
df2.head(5)
Out[ ]:
TARGET AMT_INCOME_TOTAL AMT_CREDIT DAYS_EMPLOYED DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 AMT_GOODS_PRICE
0 1 202500.0 406597.5 2 26 0.083037 0.262949 0.139376 351000.0
1 0 270000.0 1293502.5 3 46 0.311267 0.622246 0.000000 1129500.0
2 0 67500.0 135000.0 1 52 0.000000 0.555912 0.729567 135000.0
3 0 135000.0 312682.5 8 52 0.000000 0.650442 0.000000 297000.0
4 0 121500.0 513000.0 8 55 0.000000 0.322738 0.000000 513000.0
In [ ]:
# Scatter-plot
df2.fillna(0, inplace=True)
print('Numerical variables - Scatter-Matrix')
grr = pd.plotting.scatter_matrix(df2.loc[:, df2.columns != 'TARGET'], 
                                     c =df['TARGET'],
                                     figsize=(15, 15), marker='.',
                                     hist_kwds={'bins': 10}, s=60, alpha=.2)
Numerical variables - Scatter-Matrix
In [ ]:
 # Pair-plot
df2['TARGET'].replace(0, "No Default", inplace=True)
df2['TARGET'].replace(1, "Default", inplace=True)
print('Numerical variables - Pair-Plot')    
num_sns = sns.pairplot(df2, hue="TARGET", markers=["s", "o"])

    #    num_sns.title("Numerical variables - Pair-Plot")
Numerical variables - Pair-Plot
In [ ]:
collecttrash()
before collection :  (62, 0, 19)
after collection :  (6, 0, 0)

Observation 7

Correlation Map of Numerical Variables

  • Strong correlation between amount credit and amount goods price
  • Strong correlation between days birth and days employed
  • strong correlation between ext source 1 and days birth
  • These might be good candidates to do some feature engineering in.

Density Plots

Distribution of Target Value

In [ ]:
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]

plt.figure(figsize=(10,40))
for i,var in enumerate(var_neg_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    
    plt.subplot(numVar,3,i+1)
    plt.subplots_adjust(wspace=2)
    sns.kdeplot(dflt_var,label='Default')
    sns.kdeplot(dflt_non_var,label='No Default')
    #plt.xlabel(var)
    plt.ylabel('Density')
    plt.title(var, fontsize = 10)
    plt.tight_layout()
plt.show() 
In [ ]:
var_pos_corr = correlations.tail(10).index.values
numVar = var_pos_corr.shape[0]

plt.figure(figsize=(10,40))
for i,var in enumerate(var_pos_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    if var=='TARGET':
      pass
    else:
      plt.subplot(numVar,3,i+1)
      plt.subplots_adjust(wspace=2)
      sns.kdeplot(dflt_var,label='Default')
      sns.kdeplot(dflt_non_var,label='No Default')
      #plt.xlabel(var)
      plt.ylabel('Density')
      plt.title(var, fontsize = 10)
      plt.tight_layout()
plt.show() 

Observation 8

  • We plot the KDEs of the most positively (negatively) correlated features with the TARGET. This is to evaluate whether there are any strange distributions between the default and do not default items.

  • If the distributions for each feature are very different for default and do not default, this is good and we should look out for this. So we can see that EXT_SOURCE_3 has the most different distributions between default and no default.

Observation 9

Overall View of Categorical values in Train & Test

For any categorical variable (dtype == object) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding.

In [ ]:
datasets['application_train'].select_dtypes('object').apply(pd.Series.nunique, axis = 0) 
Out[ ]:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
In [ ]:
datasets['application_test'].select_dtypes('object').apply(pd.Series.nunique, axis = 0) 
Out[ ]:
NAME_CONTRACT_TYPE             2
CODE_GENDER                    2
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               7
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             5
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64
In [ ]:
import time
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
import warnings
import seaborn as sns
color = sns.color_palette()
import pickle
from scipy.cluster.hierarchy import dendrogram, linkage

warnings.filterwarnings("ignore")
rcParams['figure.figsize'] = 12, 8
np.random.seed(23)

nb_levels_sr = datasets['application_train'].nunique()
binary_features_lst = nb_levels_sr.loc[nb_levels_sr == 2].index.tolist()
categorical_features_lst = list(set(datasets['application_train'].select_dtypes(["object"]).columns.tolist()) - set(binary_features_lst))

for feature in categorical_features_lst:
    fig, ax = plt.subplots(1, 2, sharex = False, sharey = False, figsize = (20, 10))
    # Plot levels distribution
    if datasets['application_train'][feature].nunique() < 10:
        sns.countplot(x = datasets['application_train'][feature], ax = ax[0], order = datasets['application_train'][feature].value_counts().index.tolist())
    else:
        sns.countplot(y = datasets['application_train'][feature], ax = ax[0], order = datasets['application_train'][feature].value_counts().index.tolist())
    ax[0].set_title("Count plot of each level of the feature: " + feature)

    # Plot target distribution among levels
    table_df = pd.crosstab(datasets['application_train']["TARGET"], datasets['application_train'][feature], normalize = True)
    table_df = table_df.div(table_df.sum(axis = 0), axis = 1)
    table_df = pd.crosstab(datasets['application_train']["TARGET"], datasets['application_train'][feature], normalize = True)
    table_df = table_df.div(table_df.sum(axis = 0), axis = 1)
    table_df = table_df.transpose().reset_index()
    order_lst = table_df.sort_values(by = 1)[feature].tolist()
    table_df = table_df.melt(id_vars = [feature])
    if datasets['application_train'][feature].nunique() < 10:
        ax2 = sns.barplot(x = table_df[feature], y = table_df["value"] * 100, hue = table_df["TARGET"], ax = ax[1], order = order_lst)
        for p in ax2.patches:
            height = p.get_height()
            ax2.text(p.get_x() + p.get_width() / 2., height + 1, "{:1.2f}".format(height), ha = "center")
    else:
        ax2 = sns.barplot(x = table_df["value"] * 100, y = table_df[feature], hue = table_df["TARGET"], ax = ax[1], order = order_lst)
        for p in ax2.patches:
            width = p.get_width()
            ax2.text(width + 3.1, p.get_y() + p.get_height() / 2. + 0.35, "{:1.2f}".format(width), ha = "center")

    ax[1].set_title("Target distribution among " +  feature + " levels")
    ax[1].set_ylabel("Percentage") 
In [ ]:
numerical_features_lst = list(set(datasets['application_train'].columns.tolist()) - set(categorical_features_lst) - set(binary_features_lst))
binary_features_lst = list(set(binary_features_lst) - {"TARGET"})

	# generate the linkage matrix
numerical_features_df = datasets['application_train'][numerical_features_lst + ["TARGET"]]
numerical_features_df.fillna(-1, inplace = True) # We need to impute missing values before creating the dendrogram
numerical_features_df = numerical_features_df.transpose()
Z = linkage(numerical_features_df, "ward")
plt.figure(figsize = (20, 15))
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("feature")
plt.ylabel("distance")
dend = dendrogram(
    Z,
    leaf_rotation = 90.,  # rotates the x axis labels
    leaf_font_size = 8.,  # font size for the x axis labels
    labels = numerical_features_df.index.tolist()
) 

Observation 10

Imbalanced data

In [ ]:
train_labels = datasets['application_train']['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = datasets['application_train'].align(datasets['application_test'], join = 'inner', axis = 1)

# Add the target back in
app_train['TARGET'] = train_labels

print('Training Features shape: ', datasets['application_train'].shape)
print('Testing Features shape: ', datasets['application_test'].shape) 
Training Features shape:  (307511, 122)
Testing Features shape:  (48744, 121)

Feature Engineering & Aggregation

Data preparation accounts for about 80% of the work of data scientists

image-2.png

The feature engineering we performed can be classified into - sub-parts which include

  • Including Custom domain knowledge based features
  • Creating engineered aggregated features
  • Experimental modelling of the data
  • Validating Manual OHE
  • Creating Poly Features to degree 4 for selected features
  • Merging all datasets
  • Drop Columns with Missing Values

image.png

An essential part of any feature engineering process is the domain knowledge based features which will help improve the accuracy of a model. The first step was to identify these for each dataset. Some of the new custom features included were credit card amount balance after payment based on due amount, application amount average , the credit average, Available credit as a percentage of income , Annuity as a percentage of income , Annuity as a percentage of available credit

The next step involved was to identify the numerical features and aggregate them to mean, min & max values. An attempt was made to apply label encoding for unique values more than 5 at the engineering phase. However, a design decision was made to apply OHE at the pipeline level for specific highly correlated fields on the final merged dataset to optimize the amount of code to handle the same functionality.

Extensive feature engineering was conducted by attempting multiple modelling approaches with primary, secondary and tertiary tables prior to finalizing an optimized approach with the least amount of memory usage. Attempt one involved creating engineered and aggregated features for Tier 3 tables: bureau_balance, credit_card_installment, installment_payments and point_of_sale_systems_cash_balance. This was then merged with Tier 2 tables i.e prev_application_balance with credit_card_installment, installment_payments and point_of_sale_systems_cash_balance & bureau with bureau_balance, along with aggregated features. A flat view combining all of the above tables were merged along with the primary dataset application train. This resulted in a high number of redundant features occupying large memory.

Attempt 2 involved creating custom and aggregated features for tier 3 tables and merging with tier 2 tables based on the primary key provided, which was later “extended” to the tier1 tables based on the additional aggregated columns. This approach created less duplicates, was optimized and occupied less memory by using a garbage collector after each merge.

In Attempt 3, the merged dataframe in the previous attempt were merged with the polynomial features with a degree of 4.

A final merge of the Tier3, Tier2 and Tier1 datasets were used to create a train dataframe. Special care was taken to ensure that there are no columns which have more than 50% of the data missing.
Engineering the features and including them in the model with small splits helped test the model but provided low accuracy. However, using these merged features along with reasonable splits during the training face did provide a better accuracy and less possibility of overfitting especially for Random forest and XGBoost.
Future work and experiments include Label encoding for the unique categorical values in all categorical fields and not select few. Attempting PCA or custom function to handle multicollinearity in the pipeline and eliminate features of low importance and verify its impacts on accuracy.

The steps involved in Feature engineering were: Separate the files into Tiers.

image.png

The pipeline for tier 3 files,

image-2.png

We include the manually engineered features for each file if any (InstallmentPaymentFeaturesAdder). Replace the missing values with the most frequent values for the categorical feature columns and perform One Hot Encoding for all the categorical columns (getDummies). Create aggregated features using the(FeaturesAggregator). Removed the features which have greater than 60% of null values(MissingFeatureRemover). Removed features with multicollinearity, dropped features which have a correlation value of greater than the threshold of 0.9(CollinearFeatureRemover).

The aggregated features with OHE are merged with the tier 2 files, those are the previous_application and the bureau datasets. The tier 3 pipelines are repeated on the tier 2 files as well. The tier 2 aggregated features with OHE are merged with application_train /test datasets. The final pipeline on the application_train is as mentioned below. image-3.png

The final features in the application_train / test dataset is: image-4.png

Manual Feature Engineering for below mentioned files.

  1. installments_payments
  2. credit_card_balance
  3. previous_application
  4. application_train
In [ ]:
# Create installment features
class InstallmentPaymentFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # new feature creation from the Installment Payment file
        X['PAY_IS_LATE'] = X['DAYS_INSTALMENT'] - X['DAYS_ENTRY_PAYMENT']
        X['AMT_MISSED'] = X['AMT_INSTALMENT'] - X['AMT_PAYMENT']
        
        return X
In [ ]:
# Create installment features
class CCBalFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # new feature creation from the Credit Card file
        X['DPD_MISSED'] = X['SK_DPD'] - X['SK_DPD_DEF']
        X['CREDIT_UTILIZED'] = X['AMT_CREDIT_LIMIT_ACTUAL'] - X['AMT_DRAWINGS_CURRENT']
        X['MIN_CREDIT_AMTMISS'] = X['AMT_INST_MIN_REGULARITY'] - X['AMT_PAYMENT_CURRENT']

        # Difference between the monthly amount paid - the expected monthly amount
        X['PAYMENT_DIFF_CURR_PAY'] = X['AMT_PAYMENT_TOTAL_CURRENT'] - X['AMT_PAYMENT_CURRENT']
        X['PAYMENT_DIFF_MIN_PAY'] = X['AMT_PAYMENT_TOTAL_CURRENT'] - X['AMT_INST_MIN_REGULARITY']
        # Difference between the monthly amount paid - the minimum monthly amount
        return X
In [ ]:
# Create previous application features
class PrevAppFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # new feature in the Previous Application file
        X['INTEREST'] = X['CNT_PAYMENT'] * X['AMT_ANNUITY'] - X['AMT_CREDIT']
        X['INTEREST_PER_CREDIT'] = X['INTEREST'] / X['AMT_CREDIT']
        X['CREDIT_SUCCESS'] = X['AMT_APPLICATION'] - X['AMT_CREDIT']
        X['INTEREST_RT'] = 2 * 12 * X['INTEREST'] / (X['AMT_CREDIT'] * (X['CNT_PAYMENT'] + 1))
        return X
In [ ]:
# Create application features
class ApplicationTrainTestFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # credit to income ratio
        X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
        
        # annuity to income ratio
        X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
        
        # length of the credit term
        X['CREDIT_LENGTH'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']
        
        # what is income to age ratio
        X['INCOME_AGE_RATIO'] = X['AMT_INCOME_TOTAL'] / X['DAYS_BIRTH']
        
        # what is credit to age ratio
        X['CREDIT_AGE_RATIO'] = X['AMT_CREDIT'] / X['DAYS_BIRTH']
        
        # what percent of applicants life have they been working at recent company
        X['DAYS_EMPLOYED_PERCENT'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
        
        # add liability feature code
        conditions_temp = [
            (X['FLAG_OWN_CAR'] == 'Y') & (X['FLAG_OWN_REALTY'] == 'Y'),
            (X['FLAG_OWN_CAR'] == 'N') & (X['FLAG_OWN_REALTY'] == 'Y'),
            (X['FLAG_OWN_CAR'] == 'Y') & (X['FLAG_OWN_REALTY'] == 'N'),
            (X['FLAG_OWN_CAR'] == 'N') & (X['FLAG_OWN_REALTY'] == 'N')]
        
        values_temp = ['0', '1', '2', '3']
        
        X['HAS_LIBAILITY'] = np.select(conditions_temp, values_temp)
        X['DAYS_EMPLOYED_PCT'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
        X['CREDIT_INCOME_PCT'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
        X['ANNUITY_INCOME_PCT'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
        X['CREDIT_TERM'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']

        
        return X

One Hot Encoder

In [ ]:
# Create aggregate features (via pipeline)
class getDummies(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None): # no *args or **kargs
        self.columns = columns

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):      
        X.fillna(X.select_dtypes(include = 'object').mode().iloc[0], inplace = True)
        result = pd.get_dummies(X, columns = self.columns)

        #('imputer', SimpleImputer(strategy='most_frequent')),
        return result

Feature Aggregator Helper Function

Functions required to perform feature aggregations are listed below

In [ ]:
# function to get the numerical features
def get_numattribs(dataDF):
  num_attribs=(dataDF.select_dtypes(include=['int64', 'float64']).columns.tolist())
  print()
  print('Numerical attributes for',ds_name,' : ',num_attribs)
  print()
  return num_attribs
In [ ]:
class FeaturesAggregator(BaseEstimator, TransformerMixin):
    def __init__(self, file_name=None, features=None, primary_id = None): 
        self.prefix = file_name
        self.features = features
        self.numeric_stats = ["min", "max", "mean", "count", "sum"]
        self.categorical_stats = ["mean", "count", "sum"]
        self.primary_id = primary_id
        self.agg_op_features = {}
        self.agg_features_names = [self.primary_id]
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        numeric_cols = list(X.columns[X.columns.isin(self.features)])
        numeric_cols = [num for num in numeric_cols if num not in ['SK_ID_CURR','SK_ID_PREV','SK_ID_BUREAU']]
        categorical_cols = list(X.columns[~X.columns.isin(self.features)])
       
        for f in numeric_cols:
            self.agg_op_features[f] = self.numeric_stats
            self.agg_features_names = self.agg_features_names + [self.prefix + "_" + f + "_" + s for s in self.numeric_stats]
            
        for f in categorical_cols:
            self.agg_op_features[f] = self.categorical_stats
            self.agg_features_names = self.agg_features_names + [self.prefix + "_" + f + "_" + s for s in self.categorical_stats]       
       
        result = X.groupby(self.primary_id).agg(self.agg_op_features)
        result.columns = result.columns.droplevel()
        result = result.reset_index(level=[self.primary_id])
        result.columns = self.agg_features_names
        return result
In [ ]:
class engineer_features(BaseEstimator, TransformerMixin):
    def __init__(self, features=None):
        self
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):

# FROM APPLICATION
        # ADD INCOME CREDIT PERCENTAGE
        X['ef_INCOME_CREDIT_PERCENT'] = (
            X.AMT_INCOME_TOTAL / X.AMT_CREDIT).replace(np.inf, 0)
    
        # ADD INCOME PER FAMILY MEMBER
        X['ef_FAM_MEMBER_INCOME'] = (
            X.AMT_INCOME_TOTAL / X.CNT_FAM_MEMBERS).replace(np.inf, 0)
    
        # ADD ANNUITY AS PERCENTAGE OF ANNUAL INCOME
        X['ef_ANN_INCOME_PERCENT'] = (
            X.AMT_ANNUITY / X.AMT_INCOME_TOTAL).replace(np.inf, 0)

        return X
In [ ]:
# Creates the following date features
# But could do so much more with these features
#    E.g., 
#      extract the domain address of the homepage and OneHotEncode it
# 
# ['release_month','release_day','release_year', 'release_dayofweek','release_quarter']
class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
    def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
        self.features = features
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        df = pd.DataFrame(X, columns=self.features)
        #from IPython.core.debugger import Pdb as pdb;    pdb().set_trace() #breakpoint; dont forget to quit         
        df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)   
        #df.drop(self.features, axis=1, inplace=True)
        return np.array(df.values) 

Missing Data Removal

In [ ]:
# Remove missing columns
class MissingFeatureRemover(BaseEstimator, TransformerMixin):
    def __init__(self, threshold = .6):
        self.threshold = threshold
        
    def fit(self, X, y=None):
        
        # get the percent of missingness in features
        percent = (X.isnull().sum()/X.isnull().count()).sort_values(ascending = False)
        
        # turn into a data frame
        missing_application_train_data  = pd.DataFrame(percent, columns=['Percent'])
        
        # get the columns with missingness exceeding the threshold
        self.columns_to_drop = list(missing_application_train_data.index[missing_application_train_data['Percent'] > self.threshold])
        
        return self
    
    def transform(self, X, y=None):
        
        # drop the columns with missingness over the threshold
        X = X.drop(columns = self.columns_to_drop, axis=1)
        
        return X

Collinear Feature Removal

In [ ]:
# Remove features with high colli
class CollinearFeatureRemover(BaseEstimator, TransformerMixin):
    def __init__(self, threshold = .9):
        self.threshold = threshold
        
    def fit(self, X, y=None):
        
        # get the correlation matrix for the entire dataset after one hot encoding features
#        correlation_matrix = X.head(1000).corr().abs()
        correlation_matrix = X.sample(10000).corr().abs()

        # get only the lower portion of collinearity matrix
        lower = correlation_matrix.where(np.tril(np.ones(correlation_matrix.shape), k=-1).astype(np.bool))
        
        # get the fields with correlation above threshold
        self.columns_to_drop = [index for index in lower.index if any(lower.loc[index] > self.threshold)]
        
        return self
    
    def transform(self, X, y=None):
        
        # drop the columns with collinearity over the threshold
        X = X.drop(columns = self.columns_to_drop, axis=1)
        
        return X

Removal of Zero Variance

In [ ]:
# Remove features with near zero variance
class NearZeroVarianceFeatureRemover(BaseEstimator, TransformerMixin):
    def __init__(self, threshold = 0):
        self.threshold = threshold
        
    def fit(self, X, y=None):
        
        # get the fields with correlation above threshold
        self.columns_to_drop = [col for col in X.select_dtypes([np.number]).columns if np.nanvar(X[col]) <= self.threshold]
        
        return self
    
    def transform(self, X, y=None):
        
        # drop the columns with collinearity over the threshold
        X = X.drop(columns = self.columns_to_drop, axis=1)
        
        return X
In [ ]:
gc.collect()
Out[ ]:
302

Creating a base copy of the data

In [ ]:
appsTrainDF = datasets['application_train'].copy()
X_kaggle_test = datasets['application_test'].copy()
prevAppsDF = datasets["previous_application"].copy() 
bureauDF = datasets["bureau"].copy()

bureaubalDF = datasets['bureau_balance'].copy()
ccbalDF = datasets["credit_card_balance"].copy()
installmentspaymentsDF = datasets["installments_payments"].copy()
pos_cash_bal_DF = datasets["POS_CASH_balance"].copy() 

Tertiary Datasets

The tertiary datasets or tables refer to bureau_balance, POS_CASH_balance, instalments_payments, credit_card_balance

In [ ]:
tertiaty_datasets=['bureau_balance','credit_card_balance','installments_payments','POS_CASH_balance']

Third Tier datasets Numerical feature aggregation

Feature aggregation for the tertiary datasets

In [ ]:
primary_id1 = "SK_ID_PREV"
primary_id2 = "SK_ID_BUREAU"


posBal_features = pos_cash_bal_DF.columns.to_list()
instalPay_features = installmentspaymentsDF.columns.to_list()
instalPay_features.extend(['PAY_IS_LATE', 'AMT_MISSED'])

ccBal_features = ccbalDF.columns.to_list()
ccBal_features.extend(['DPD_MISSED', 'CREDIT_UTILIZED', 'MIN_CREDIT_AMTMISS', 
                       'PAYMENT_DIFF_CURR_PAY','PAYMENT_DIFF_MIN_PAY']) 
  
burBal_features = bureaubalDF.columns.to_list()

fn_POS_CASH ='POS_CASH_balance'
fn_ins_pay = 'installments_payments'
fn_ccbal = 'credit_card_balance'
fn_bbal ='bureau_balance'

Define Pipeline to create aggregator and OHE features.

In [ ]:
# set pos cash pipeline
pos_cash_pipe = Pipeline([
    #('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_POS_CASH,posBal_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])


# set installments_payments pipeline
install_pay_pipe = Pipeline([
    ('install_pay_new_features', InstallmentPaymentFeaturesAdder()),  
    #('imputer', SimpleImputer(strategy='most_frequent')),                       
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_ins_pay,instalPay_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])

# set credit_card_balance pipeline
cc_bal_pipe = Pipeline([
    ('install_pay_new_features', CCBalFeaturesAdder()), 
    #('imputer', SimpleImputer(strategy='most_frequent')),                        
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_ccbal,ccBal_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])

# set bureau_balance pipeline
bureau_bal_pipe = Pipeline([
    #('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_bbal, burBal_features, primary_id2)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])

Run the pipelines for tier 3

In [ ]:
pos_cash_bal_aggregated = pos_cash_pipe.fit_transform(pos_cash_bal_DF)
del pos_cash_bal_DF

installments_pmnts_aggregated = install_pay_pipe.fit_transform(installmentspaymentsDF)
del installmentspaymentsDF

ccblance_aggregated = cc_bal_pipe.fit_transform(ccbalDF)
del ccbalDF  

bureaubal_aggregated = bureau_bal_pipe.fit_transform(bureaubalDF)
del bureaubalDF

gc.collect()
Out[ ]:
0

Step 1: Merge Tier 3 with Tier 2

In [ ]:
print(pos_cash_bal_aggregated.shape)
print(ccblance_aggregated.shape)
print(installments_pmnts_aggregated.shape)
print(bureaubal_aggregated.shape)
(936325, 27)
(104307, 65)
(558183, 21)
(817395, 19)

Merging the aggregated features for pos_cash_bal , installments_pmnts , credit card balance with Previous application

In [ ]:
prevApps_ThirdTierMerge = True

posBal_join_feature = 'SK_ID_PREV'
instalPay_join_feature = 'SK_ID_PREV'
ccBal_join_feature = 'SK_ID_PREV'
burBal_join_feature = 'SK_ID_BUREAU'
prevApps_join_feature = 'SK_ID_CURR'
bureau_join_feature = 'SK_ID_CURR'

if prevApps_ThirdTierMerge:
  # Merge Datasets
  prevAppsDF = prevAppsDF.merge(pos_cash_bal_aggregated, how='left', on=posBal_join_feature)
  prevAppsDF = prevAppsDF.merge(installments_pmnts_aggregated, how='left', on=instalPay_join_feature)
  prevAppsDF = prevAppsDF.merge(ccblance_aggregated, how='left', on=ccBal_join_feature)

Merging the aggregated features the dataset Bureau Balance with Bureau as per the data model.

In [ ]:
bureau_ThirdTierMerge = True

if bureau_ThirdTierMerge:
  bureauDF = bureauDF.merge(bureaubal_aggregated, how='left', on=burBal_join_feature)
In [ ]:
print(prevAppsDF.shape)
print(bureauDF.shape)
(1670214, 147)
(1716428, 35)
In [ ]:
gc.collect()
Out[ ]:
152

Secondary Datasets

Second Tier datasets feature aggregation and OHE pipeline

In [ ]:
primary_id1 = "SK_ID_CURR"

fn_bureau = 'bureau'
fn_prevapps = 'previous_application'
fn_appsTrain = 'application_train'
fn_appsTest = 'application_test'

# dataframe names for reference
# appsTrainDF 
# appsTestDF
# prevAppsDF
# bureauDF 

Define the second tier pipeline

In [ ]:
# get column names
prevApps_features = prevAppsDF.columns.to_list()
prevApps_features.extend(['INTEREST', 'INTEREST_PER_CREDIT', 'CREDIT_SUCCESS', 'INTEREST_RT'])

bureau_features = bureauDF.columns.to_list()

# set previous_application pipeline
prev_app_pipe = Pipeline([
    ('prev_app_feature_adder', PrevAppFeaturesAdder()), 
    #('imputer', SimpleImputer(strategy='most_frequent')),                        
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_prevapps,prevApps_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])

# set bureau pipeline
bureau_pipe = Pipeline([  
    #('imputer', SimpleImputer(strategy='most_frequent')),                      
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_bureau,bureau_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])
In [ ]:
prevApps_aggregated = prev_app_pipe.fit_transform(prevAppsDF)
bureau_aggregated = bureau_pipe.fit_transform(bureauDF)

del bureauDF
del prevAppsDF

gc.collect()
Out[ ]:
0
In [ ]:
print(prevApps_aggregated.shape)
print(bureau_aggregated.shape)
(338857, 312)
(305811, 79)

Primary Datasets

Merge Aggregated Dataset With Tier 1 Tables - Train and Test

Prior to merging with the Primary data, we will be dropping columns with more than 50% missing values because they are not reliable parameters.

In [ ]:
prevApps_join_feature = 'SK_ID_CURR'
bureau_join_feature = 'SK_ID_CURR'

merge_all_data = True

if merge_all_data:
# 1. Join/Merge in prevApps Data
    # Merge all the features with Application_train
    appsTrainDF = appsTrainDF.merge(prevApps_aggregated, how = 'left', on = prevApps_join_feature)
    appsTrainDF = appsTrainDF.merge(bureau_aggregated, how = 'left', on = bureau_join_feature)

    # Merge all the features with Application_train
    X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how = 'left', on = prevApps_join_feature)
    X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how = 'left', on = bureau_join_feature)
In [ ]:
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
(307511, 511)
(48744, 510)
In [ ]:
gc.collect()
Out[ ]:
100

Categorical feature mapping

In [ ]:
def get_cat_attribs():
  cat_cols = []
  cat_cols=list(appsTrainDF.select_dtypes(include=['object']).columns)
  return cat_cols

cat_attribs = get_cat_attribs()

over_5_unique = ([])

for att in cat_attribs:
  if (len(appsTrainDF[att].unique()) > 5):
    over_5_unique.append(att)

print(f'{len(over_5_unique)} cat attributes with more than 5 unique values')
8 cat attributes with more than 5 unique values
In [ ]:
for att in over_5_unique:

  print(f'{att}:')

  column_total = appsTrainDF[att].shape[0]

  for v in appsTrainDF[att].unique():

    print(f"Rows for {v}: {sum(appsTrainDF[att] == v)} - {round(100 * (sum(appsTrainDF[att] == v) / column_total))}%")
NAME_TYPE_SUITE:
Rows for Unaccompanied: 248526 - 81%
Rows for Family: 40149 - 13%
Rows for Spouse, partner: 11370 - 4%
Rows for Children: 3267 - 1%
Rows for Other_A: 866 - 0%
Rows for nan: 0 - 0%
Rows for Other_B: 1770 - 1%
Rows for Group of people: 271 - 0%
NAME_INCOME_TYPE:
Rows for Working: 158774 - 52%
Rows for State servant: 21703 - 7%
Rows for Commercial associate: 71617 - 23%
Rows for Pensioner: 55362 - 18%
Rows for Unemployed: 22 - 0%
Rows for Student: 18 - 0%
Rows for Businessman: 10 - 0%
Rows for Maternity leave: 5 - 0%
NAME_FAMILY_STATUS:
Rows for Single / not married: 45444 - 15%
Rows for Married: 196432 - 64%
Rows for Civil marriage: 29775 - 10%
Rows for Widow: 16088 - 5%
Rows for Separated: 19770 - 6%
Rows for Unknown: 2 - 0%
NAME_HOUSING_TYPE:
Rows for House / apartment: 272868 - 89%
Rows for Rented apartment: 4881 - 2%
Rows for With parents: 14840 - 5%
Rows for Municipal apartment: 11183 - 4%
Rows for Office apartment: 2617 - 1%
Rows for Co-op apartment: 1122 - 0%
OCCUPATION_TYPE:
Rows for Laborers: 55186 - 18%
Rows for Core staff: 27570 - 9%
Rows for Accountants: 9813 - 3%
Rows for Managers: 21371 - 7%
Rows for nan: 0 - 0%
Rows for Drivers: 18603 - 6%
Rows for Sales staff: 32102 - 10%
Rows for Cleaning staff: 4653 - 2%
Rows for Cooking staff: 5946 - 2%
Rows for Private service staff: 2652 - 1%
Rows for Medicine staff: 8537 - 3%
Rows for Security staff: 6721 - 2%
Rows for High skill tech staff: 11380 - 4%
Rows for Waiters/barmen staff: 1348 - 0%
Rows for Low-skill Laborers: 2093 - 1%
Rows for Realty agents: 751 - 0%
Rows for Secretaries: 1305 - 0%
Rows for IT staff: 526 - 0%
Rows for HR staff: 563 - 0%
WEEKDAY_APPR_PROCESS_START:
Rows for WEDNESDAY: 51934 - 17%
Rows for MONDAY: 50714 - 16%
Rows for THURSDAY: 50591 - 16%
Rows for SUNDAY: 16181 - 5%
Rows for SATURDAY: 33852 - 11%
Rows for FRIDAY: 50338 - 16%
Rows for TUESDAY: 53901 - 18%
ORGANIZATION_TYPE:
Rows for Business Entity Type 3: 67992 - 22%
Rows for School: 8893 - 3%
Rows for Government: 10404 - 3%
Rows for Religion: 85 - 0%
Rows for Other: 16683 - 5%
Rows for XNA: 55374 - 18%
Rows for Electricity: 950 - 0%
Rows for Medicine: 11193 - 4%
Rows for Business Entity Type 2: 10553 - 3%
Rows for Self-employed: 38412 - 12%
Rows for Transport: type 2: 2204 - 1%
Rows for Construction: 6721 - 2%
Rows for Housing: 2958 - 1%
Rows for Kindergarten: 6880 - 2%
Rows for Trade: type 7: 7831 - 3%
Rows for Industry: type 11: 2704 - 1%
Rows for Military: 2634 - 1%
Rows for Services: 1575 - 1%
Rows for Security Ministries: 1974 - 1%
Rows for Transport: type 4: 5398 - 2%
Rows for Industry: type 1: 1039 - 0%
Rows for Emergency: 560 - 0%
Rows for Security: 3247 - 1%
Rows for Trade: type 2: 1900 - 1%
Rows for University: 1327 - 0%
Rows for Transport: type 3: 1187 - 0%
Rows for Police: 2341 - 1%
Rows for Business Entity Type 1: 5984 - 2%
Rows for Postal: 2157 - 1%
Rows for Industry: type 4: 877 - 0%
Rows for Agriculture: 2454 - 1%
Rows for Restaurant: 1811 - 1%
Rows for Culture: 379 - 0%
Rows for Hotel: 966 - 0%
Rows for Industry: type 7: 1307 - 0%
Rows for Trade: type 3: 3492 - 1%
Rows for Industry: type 3: 3278 - 1%
Rows for Bank: 2507 - 1%
Rows for Industry: type 9: 3368 - 1%
Rows for Insurance: 597 - 0%
Rows for Trade: type 6: 631 - 0%
Rows for Industry: type 2: 458 - 0%
Rows for Transport: type 1: 201 - 0%
Rows for Industry: type 12: 369 - 0%
Rows for Mobile: 317 - 0%
Rows for Trade: type 1: 348 - 0%
Rows for Industry: type 5: 599 - 0%
Rows for Industry: type 10: 109 - 0%
Rows for Legal Services: 305 - 0%
Rows for Advertising: 429 - 0%
Rows for Trade: type 5: 49 - 0%
Rows for Cleaning: 260 - 0%
Rows for Industry: type 13: 67 - 0%
Rows for Trade: type 4: 64 - 0%
Rows for Telecom: 577 - 0%
Rows for Industry: type 8: 24 - 0%
Rows for Realtor: 396 - 0%
Rows for Industry: type 6: 112 - 0%
WALLSMATERIAL_MODE:
Rows for Stone, brick: 64815 - 21%
Rows for Block: 9253 - 3%
Rows for nan: 0 - 0%
Rows for Panel: 66040 - 21%
Rows for Mixed: 2296 - 1%
Rows for Wooden: 5362 - 2%
Rows for Others: 1625 - 1%
Rows for Monolithic: 1779 - 1%
In [ ]:
 appsTrainDF['NAME_TYPE_SUITE'] = appsTrainDF['NAME_TYPE_SUITE'].replace({
                       'Family' : 'other',
                       'Spouse, partner' : 'other',
                       'Children' : 'other',
                       'Other_A' : 'other',
                       'Other_B' : 'other',
                       'Group of people' : 'other',})

 appsTrainDF['NAME_INCOME_TYPE'] = appsTrainDF['NAME_INCOME_TYPE'].replace({
                       'Unemployed' : 'other',
                       'Student' : 'other',
                       'Businessman' : 'other',
                       'Maternity leave' : 'other',})

 appsTrainDF['NAME_FAMILY_STATUS'] = appsTrainDF['NAME_FAMILY_STATUS'].replace({
                       'Single / not married' : 'Not Married',
                       'Married' : 'Married',
                       'Civil marriage' : 'Married',
                       'Widow' : 'Not Married',
                       'Separated' : 'Not Married',
                       'Unknown' : 'Not Married',})

 appsTrainDF['NAME_HOUSING_TYPE'] = appsTrainDF['NAME_HOUSING_TYPE'].replace({
                       'House / apartment' : 'House / apartment',
                       'Rented apartment' : 'other',
                       'With parents' : 'other',
                       'Municipal apartment' : 'other',
                       'Office apartment' : 'other',
                       'Co-op apartment' : 'other',})

 appsTrainDF['OCCUPATION_TYPE'] = appsTrainDF['OCCUPATION_TYPE'].replace({
                       'Laborers' : 'Service Industry',
                       'Drivers' : 'Service Industry',
                       'Cleaning staff' : 'Service Industry',
                       'Cooking staff' : 'Service Industry',
                       'Private service staff' : 'Service Industry',
                       'Security staff' : 'Service Industry',
                       'Waiters/barmen staff' : 'Service Industry',
                       'Low-skill Laborers' : 'Service Industry',
                       'Core staff' : 'Office',
                       'Accountants' : 'Office',
                       'Managers' : 'Office',
                       'Sales staff' : 'Office',
                       'Medicine staff' : 'Office',
                       'High skill tech staff' : 'Office',
                       'Realty agents' : 'Office',
                       'Secretaries' : 'Office',
                       'IT staff' : 'Office',
                       'HR staff' : 'Office',})


 appsTrainDF['WEEKDAY_APPR_PROCESS_START'] = appsTrainDF['WEEKDAY_APPR_PROCESS_START'].replace({
                       'SUNDAY' : 'Weekend',
                       'MONDAY' : 'Weekday',
                       'TUESDAY' : 'Weekday',
                       'WEDNESDAY' : 'Weekday',
                       'THURSDAY' : 'Weekday',
                       'FRIDAY' : 'Weekday',
                       'SATURDAY' : 'Weekend',})

 appsTrainDF['ORGANIZATION_TYPE'] = appsTrainDF['ORGANIZATION_TYPE'].replace({
                         'Advertising' : 'Business',
                         'Agriculture' : 'Industrial',
                         'Bank' : 'Business',
                         'Business Entity Type 1' : 'Business',
                         'Business Entity Type 2' : 'Business',
                         'Business Entity Type 3' : 'Business',
                         'Cleaning' : 'Service',
                         'Construction' : 'Industrial',
                         'Culture' : 'Other',
                         'Electricity' : 'Industrial',
                         'Emergency' : 'Government',
                         'Government' : 'Government',
                         'Hotel' : 'Service',
                         'Housing' : 'Other',
                         'Industry: type 1' : 'Industrial',
                         'Industry: type 10' : 'Industrial',
                         'Industry: type 11' : 'Industrial',
                         'Industry: type 12' : 'Industrial',
                         'Industry: type 13' : 'Industrial',
                         'Industry: type 2' : 'Industrial',
                         'Industry: type 3' : 'Industrial',
                         'Industry: type 4' : 'Industrial',
                         'Industry: type 5' : 'Industrial',
                         'Industry: type 6' : 'Industrial',
                         'Industry: type 7' : 'Industrial',
                         'Industry: type 8' : 'Industrial',
                         'Industry: type 9' : 'Industrial',
                         'Insurance' : 'Business',
                         'Kindergarten' : 'Government',
                         'Legal Services' : 'Business',
                         'Medicine' : 'Government',
                         'Military' : 'Government',
                         'Mobile' : 'Other',
                         'Other' : 'Other',
                         'Police' : 'Government',
                         'Postal' : 'Government',
                         'Realtor' : 'Business',
                         'Religion' : 'Other',
                         'Restaurant' : 'Government',
                         'School' : 'Government',
                         'Security' : 'Other',
                         'Security Ministries' : 'Other',
                         'Self-employed' : 'Other',
                         'Services' : 'Service',
                         'Telecom' : 'Business',
                         'Trade: type 1' : 'Trade',
                         'Trade: type 2' : 'Trade',
                         'Trade: type 3' : 'Trade',
                         'Trade: type 4' : 'Trade',
                         'Trade: type 5' : 'Trade',
                         'Trade: type 6' : 'Trade',
                         'Trade: type 7' : 'Trade',
                         'Transport: type 1' : 'Service',
                         'Transport: type 2' : 'Service',
                         'Transport: type 3' : 'Service',
                         'Transport: type 4' : 'Service',
                         'University' : 'Government',
                         'XNA' : 'XNA'})
In [ ]:
 X_kaggle_test['NAME_TYPE_SUITE'] = X_kaggle_test['NAME_TYPE_SUITE'].replace({
                       'Family' : 'other',
                       'Spouse, partner' : 'other',
                       'Children' : 'other',
                       'Other_A' : 'other',
                       'Other_B' : 'other',
                       'Group of people' : 'other',})

 X_kaggle_test['NAME_INCOME_TYPE'] = X_kaggle_test['NAME_INCOME_TYPE'].replace({
                       'Unemployed' : 'other',
                       'Student' : 'other',
                       'Businessman' : 'other',
                       'Maternity leave' : 'other',})

 X_kaggle_test['NAME_FAMILY_STATUS'] = X_kaggle_test['NAME_FAMILY_STATUS'].replace({
                       'Single / not married' : 'Not Married',
                       'Married' : 'Married',
                       'Civil marriage' : 'Married',
                       'Widow' : 'Not Married',
                       'Separated' : 'Not Married',
                       'Unknown' : 'Not Married',})

 X_kaggle_test['NAME_HOUSING_TYPE'] = X_kaggle_test['NAME_HOUSING_TYPE'].replace({
                       'House / apartment' : 'House / apartment',
                       'Rented apartment' : 'other',
                       'With parents' : 'other',
                       'Municipal apartment' : 'other',
                       'Office apartment' : 'other',
                       'Co-op apartment' : 'other',})

 X_kaggle_test['OCCUPATION_TYPE'] = X_kaggle_test['OCCUPATION_TYPE'].replace({
                       'Laborers' : 'Service Industry',
                       'Drivers' : 'Service Industry',
                       'Cleaning staff' : 'Service Industry',
                       'Cooking staff' : 'Service Industry',
                       'Private service staff' : 'Service Industry',
                       'Security staff' : 'Service Industry',
                       'Waiters/barmen staff' : 'Service Industry',
                       'Low-skill Laborers' : 'Service Industry',
                       'Core staff' : 'Office',
                       'Accountants' : 'Office',
                       'Managers' : 'Office',
                       'Sales staff' : 'Office',
                       'Medicine staff' : 'Office',
                       'High skill tech staff' : 'Office',
                       'Realty agents' : 'Office',
                       'Secretaries' : 'Office',
                       'IT staff' : 'Office',
                       'HR staff' : 'Office',})


 X_kaggle_test['WEEKDAY_APPR_PROCESS_START'] = X_kaggle_test['WEEKDAY_APPR_PROCESS_START'].replace({
                       'SUNDAY' : 'Weekend',
                       'MONDAY' : 'Weekday',
                       'TUESDAY' : 'Weekday',
                       'WEDNESDAY' : 'Weekday',
                       'THURSDAY' : 'Weekday',
                       'FRIDAY' : 'Weekday',
                       'SATURDAY' : 'Weekend',})

 X_kaggle_test['ORGANIZATION_TYPE'] = X_kaggle_test['ORGANIZATION_TYPE'].replace({
                         'Advertising' : 'Business',
                         'Agriculture' : 'Industrial',
                         'Bank' : 'Business',
                         'Business Entity Type 1' : 'Business',
                         'Business Entity Type 2' : 'Business',
                         'Business Entity Type 3' : 'Business',
                         'Cleaning' : 'Service',
                         'Construction' : 'Industrial',
                         'Culture' : 'Other',
                         'Electricity' : 'Industrial',
                         'Emergency' : 'Government',
                         'Government' : 'Government',
                         'Hotel' : 'Service',
                         'Housing' : 'Other',
                         'Industry: type 1' : 'Industrial',
                         'Industry: type 10' : 'Industrial',
                         'Industry: type 11' : 'Industrial',
                         'Industry: type 12' : 'Industrial',
                         'Industry: type 13' : 'Industrial',
                         'Industry: type 2' : 'Industrial',
                         'Industry: type 3' : 'Industrial',
                         'Industry: type 4' : 'Industrial',
                         'Industry: type 5' : 'Industrial',
                         'Industry: type 6' : 'Industrial',
                         'Industry: type 7' : 'Industrial',
                         'Industry: type 8' : 'Industrial',
                         'Industry: type 9' : 'Industrial',
                         'Insurance' : 'Business',
                         'Kindergarten' : 'Government',
                         'Legal Services' : 'Business',
                         'Medicine' : 'Government',
                         'Military' : 'Government',
                         'Mobile' : 'Other',
                         'Other' : 'Other',
                         'Police' : 'Government',
                         'Postal' : 'Government',
                         'Realtor' : 'Business',
                         'Religion' : 'Other',
                         'Restaurant' : 'Government',
                         'School' : 'Government',
                         'Security' : 'Other',
                         'Security Ministries' : 'Other',
                         'Self-employed' : 'Other',
                         'Services' : 'Service',
                         'Telecom' : 'Business',
                         'Trade: type 1' : 'Trade',
                         'Trade: type 2' : 'Trade',
                         'Trade: type 3' : 'Trade',
                         'Trade: type 4' : 'Trade',
                         'Trade: type 5' : 'Trade',
                         'Trade: type 6' : 'Trade',
                         'Trade: type 7' : 'Trade',
                         'Transport: type 1' : 'Service',
                         'Transport: type 2' : 'Service',
                         'Transport: type 3' : 'Service',
                         'Transport: type 4' : 'Service',
                         'University' : 'Government',
                         'XNA' : 'XNA'})
In [ ]:
for att in over_5_unique:

  print(f'{att}:')

  column_total = appsTrainDF[att].shape[0]

  for v in appsTrainDF[att].unique():

    print(f"Rows for {v}: {sum(appsTrainDF[att] == v)} - {round(100 * (sum(appsTrainDF[att] == v) / column_total))}%")
NAME_TYPE_SUITE:
Rows for Unaccompanied: 248526 - 81%
Rows for other: 57693 - 19%
Rows for nan: 0 - 0%
NAME_INCOME_TYPE:
Rows for Working: 158774 - 52%
Rows for State servant: 21703 - 7%
Rows for Commercial associate: 71617 - 23%
Rows for Pensioner: 55362 - 18%
Rows for other: 55 - 0%
NAME_FAMILY_STATUS:
Rows for Not Married: 81304 - 26%
Rows for Married: 226207 - 74%
NAME_HOUSING_TYPE:
Rows for House / apartment: 272868 - 89%
Rows for other: 34643 - 11%
OCCUPATION_TYPE:
Rows for Service Industry: 97202 - 32%
Rows for Office: 113918 - 37%
Rows for nan: 0 - 0%
WEEKDAY_APPR_PROCESS_START:
Rows for Weekday: 257478 - 84%
Rows for Weekend: 50033 - 16%
ORGANIZATION_TYPE:
Rows for Business: 89340 - 29%
Rows for Government: 48200 - 16%
Rows for Other: 64055 - 21%
Rows for XNA: 55374 - 18%
Rows for Industrial: 24436 - 8%
Rows for Service: 11791 - 4%
Rows for Trade: 14315 - 5%
WALLSMATERIAL_MODE:
Rows for Stone, brick: 64815 - 21%
Rows for Block: 9253 - 3%
Rows for nan: 0 - 0%
Rows for Panel: 66040 - 21%
Rows for Mixed: 2296 - 1%
Rows for Wooden: 5362 - 2%
Rows for Others: 1625 - 1%
Rows for Monolithic: 1779 - 1%
In [ ]:
# # Convert categorical features to numerical approximations (via pipeline)
# class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
#     def fit(self, X, y=None):
#         return self
#     def transform(self, X, y=None): 
#         charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
#         los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
#           '1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
#         X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
#         X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
#         X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
#         X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
#         return X
    
In [ ]:
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
(307511, 511)
(48744, 510)

Final Application_train pipeline

In [ ]:
# set application pipeline
application_pipe = Pipeline([
    ('app_train_features', ApplicationTrainTestFeaturesAdder()),
    ('ohe', getDummies()),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover()),
    ('near zero variance remover', NearZeroVarianceFeatureRemover())
])
In [ ]:
appsTrainDF = application_pipe.fit_transform(appsTrainDF)
X_kaggle_test = application_pipe.transform(X_kaggle_test)
In [ ]:
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
(307511, 497)
(48744, 495)
In [ ]:
drop_uncommon=(list(set(appsTrainDF.columns.tolist()) - set(X_kaggle_test.columns.tolist())))
drop_uncommon.remove('TARGET')
drop_uncommon
Out[ ]:
['CODE_GENDER_XNA']
In [ ]:
appsTrainDF=appsTrainDF.drop(columns=drop_uncommon)
In [ ]:
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
(307511, 496)
(48744, 495)

Output Dataframes to files

In [ ]:
appsTrainDF.to_csv("/content/drive/My Drive/AML Project/Data/appsTrainDF.csv",index=False)
X_kaggle_test.to_csv("/content/drive/My Drive/AML Project/Data/X_kaggle_test.csv",index=False)

Data Preparation Ends here with all numeric aggregated features and polynomial features all accumulation to :

Application_train -- (307511, 496)
Application_test -- (48744, 495)

Data Summary

In [ ]:
print(appsTrainDF.shape)
print(X_kaggle_test.shape) 
(307511, 496)
(48744, 495)

Total numeric features in the application train df.

In [ ]:
appsTrainDF.select_dtypes(exclude=['object']).columns
Out[ ]:
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH',
       'DAYS_EMPLOYED', 'DAYS_REGISTRATION',
       ...
       'WALLSMATERIAL_MODE_Monolithic', 'WALLSMATERIAL_MODE_Others',
       'WALLSMATERIAL_MODE_Panel', 'WALLSMATERIAL_MODE_Stone, brick',
       'WALLSMATERIAL_MODE_Wooden', 'EMERGENCYSTATE_MODE_No',
       'HAS_LIBAILITY_0', 'HAS_LIBAILITY_1', 'HAS_LIBAILITY_2',
       'HAS_LIBAILITY_3'],
      dtype='object', length=496)

Total Categorical features in the application train df.

In [ ]:
appsTrainDF.select_dtypes(include=['object']).columns
Out[ ]:
Index([], dtype='object')

Deductions from the list of dtypes of the appsTrainDF

  • There 705 numerical features.
  • There are 44 categorical features which have been encoded.
In [ ]:
appsTrainDF.dtypes.value_counts()
Out[ ]:
float64    414
uint8       43
int64       39
dtype: int64
In [ ]:
start = time()
correlation_with_all_features = appsTrainDF.corr()
end = time()
In [ ]:
print("Time taken for correlation ", ctime(end - start))
print()
correlation_with_all_features['TARGET'].sort_values()
Time taken for correlation  Thu Jan  1 00:03:21 1970

Out[ ]:
EXT_SOURCE_3                                              -0.178919
EXT_SOURCE_2                                              -0.160472
EXT_SOURCE_1                                              -0.155317
OCCUPATION_TYPE_Office                                    -0.066085
previous_application_NAME_CONTRACT_STATUS_Approved_mean   -0.063521
                                                             ...   
bureau_CREDIT_ACTIVE_Active_mean                           0.077356
previous_application_NAME_CONTRACT_STATUS_Refused_mean     0.077671
DAYS_BIRTH                                                 0.078239
bureau_DAYS_CREDIT_mean                                    0.089729
TARGET                                                     1.000000
Name: TARGET, Length: 496, dtype: float64
In [ ]:
# correlation_with_all_features.reset_index(inplace= True)
len(correlation_with_all_features.index)
Out[ ]:
496
In [ ]:
# set this value to choose the number of positive and negative correlated features
n_val = 50


print("---"*50)
print("---"*50)

print("    Total correlation of all the features.    " )

print("---"*50)
print("---"*50)

print(f"Top {n_val} negative correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).head(n_val))
print()
print()
print(f"Top {n_val} positive correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).tail(n_val))
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
    Total correlation of all the features.    
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
Top 50 negative correlated features

EXT_SOURCE_3                                                               -0.178919
EXT_SOURCE_2                                                               -0.160472
EXT_SOURCE_1                                                               -0.155317
OCCUPATION_TYPE_Office                                                     -0.066085
previous_application_NAME_CONTRACT_STATUS_Approved_mean                    -0.063521
NAME_EDUCATION_TYPE_Higher education                                       -0.056593
CODE_GENDER_F                                                              -0.054704
previous_application_DAYS_FIRST_DRAWING_mean                               -0.048803
DAYS_EMPLOYED                                                              -0.044932
previous_application_DAYS_FIRST_DRAWING_min                                -0.044643
FLOORSMAX_AVG                                                              -0.044003
previous_application_RATE_DOWN_PAYMENT_sum                                 -0.041693
previous_application_NAME_YIELD_GROUP_low_normal_mean                      -0.041134
previous_application_RATE_DOWN_PAYMENT_max                                 -0.040096
previous_application_INTEREST_RT_sum                                       -0.039533
previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_mean             -0.037494
REGION_POPULATION_RELATIVE                                                 -0.037227
previous_application_INTEREST_RT_mean                                      -0.036350
previous_application_HOUR_APPR_PROCESS_START_mean                          -0.035927
previous_application_AMT_ANNUITY_mean                                      -0.034871
previous_application_NAME_PAYMENT_TYPE_Cash through the bank_mean          -0.034669
ELEVATORS_AVG                                                              -0.034199
previous_application_PRODUCT_COMBINATION_POS industry with interest_mean   -0.033942
previous_application_RATE_DOWN_PAYMENT_mean                                -0.033601
previous_application_NAME_CONTRACT_TYPE_Consumer loans_mean                -0.032624
previous_application_AMT_ANNUITY_min                                       -0.032249
previous_application_DAYS_FIRST_DRAWING_count                              -0.031833
previous_application_HOUR_APPR_PROCESS_START_min                           -0.031427
previous_application_HOUR_APPR_PROCESS_START_max                           -0.030847
previous_application_PRODUCT_COMBINATION_POS industry with interest_sum    -0.030827
AMT_CREDIT                                                                 -0.030369
previous_application_NAME_GOODS_CATEGORY_Furniture_mean                    -0.030174
APARTMENTS_AVG                                                             -0.029498
previous_application_NAME_YIELD_GROUP_low_action_mean                      -0.029340
previous_application_AMT_ANNUITY_max                                       -0.028966
previous_application_NAME_GOODS_CATEGORY_Furniture_sum                     -0.028924
FLAG_DOCUMENT_6                                                            -0.028602
NAME_HOUSING_TYPE_House / apartment                                        -0.028555
previous_application_NAME_YIELD_GROUP_low_normal_sum                       -0.028017
previous_application_CREDIT_SUCCESS_sum                                    -0.027266
previous_application_NAME_CLIENT_TYPE_Refreshed_mean                       -0.026820
bureau_CREDIT_TYPE_Consumer credit_mean                                    -0.026258
previous_application_AMT_DOWN_PAYMENT_max                                  -0.025290
previous_application_NAME_YIELD_GROUP_low_action_sum                       -0.025025
HOUR_APPR_PROCESS_START                                                    -0.024166
FLAG_PHONE                                                                 -0.023806
previous_application_AMT_DOWN_PAYMENT_count                                -0.023725
NAME_INCOME_TYPE_State servant                                             -0.023447
previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_sum              -0.023395
previous_application_INTEREST_PER_CREDIT_min                               -0.023315
Name: TARGET, dtype: float64


Top 50 positive correlated features

previous_application_CHANNEL_TYPE_AP+ (Cash loan)_sum             0.034715
bureau_CREDIT_TYPE_Credit card_sum                                0.034818
previous_application_CHANNEL_TYPE_AP+ (Cash loan)_mean            0.034828
previous_application_PRODUCT_COMBINATION_Cash X-Sell: high_sum    0.036291
bureau_DAYS_CREDIT_ENDDATE_max                                    0.036590
previous_application_NAME_YIELD_GROUP_high_sum                    0.036931
previous_application_NAME_YIELD_GROUP_high_mean                   0.037568
previous_application_NAME_PAYMENT_TYPE_XNA_sum                    0.039469
previous_application_CODE_REJECT_REASON_LIMIT_mean                0.039842
previous_application_PRODUCT_COMBINATION_Card Street_mean         0.040242
previous_application_CODE_REJECT_REASON_LIMIT_sum                 0.040503
DAYS_REGISTRATION                                                 0.041975
bureau_DAYS_CREDIT_sum                                            0.042000
previous_application_NAME_YIELD_GROUP_XNA_mean                    0.042848
bureau_DAYS_CREDIT_UPDATE_min                                     0.042864
FLAG_DOCUMENT_3                                                   0.044346
REG_CITY_NOT_LIVE_CITY                                            0.044395
bureau_CREDIT_TYPE_Microloan_mean                                 0.044439
previous_application_NAME_CONTRACT_TYPE_Revolving loans_sum       0.045602
previous_application_NAME_CLIENT_TYPE_New_sum                     0.046048
previous_application_DAYS_DECISION_mean                           0.046864
bureau_DAYS_CREDIT_ENDDATE_mean                                   0.046983
previous_application_CODE_REJECT_REASON_HC_sum                    0.047067
previous_application_PRODUCT_COMBINATION_Card Street_sum          0.047953
bureau_DAYS_CREDIT_max                                            0.049782
NAME_EDUCATION_TYPE_Secondary / secondary special                 0.049824
REG_CITY_NOT_WORK_CITY                                            0.050994
DAYS_ID_PUBLISH                                                   0.051457
bureau_DAYS_ENDDATE_FACT_mean                                     0.053200
previous_application_DAYS_DECISION_min                            0.053434
bureau_DAYS_CREDIT_ENDDATE_sum                                    0.053735
previous_application_CODE_REJECT_REASON_HC_mean                   0.054531
DAYS_LAST_PHONE_CHANGE                                            0.055218
previous_application_CODE_REJECT_REASON_SCOFR_mean                0.055865
bureau_DAYS_ENDDATE_FACT_min                                      0.055887
previous_application_CODE_REJECT_REASON_SCOFR_sum                 0.056284
previous_application_NAME_PRODUCT_TYPE_walk-in_mean               0.057412
NAME_INCOME_TYPE_Working                                          0.057481
REGION_RATING_CLIENT                                              0.058899
previous_application_NAME_PRODUCT_TYPE_walk-in_sum                0.062628
previous_application_NAME_CONTRACT_STATUS_Refused_sum             0.064469
bureau_CREDIT_ACTIVE_Active_sum                                   0.067128
bureau_DAYS_CREDIT_UPDATE_mean                                    0.068927
previous_application_INTEREST_PER_CREDIT_max                      0.069125
bureau_DAYS_CREDIT_min                                            0.075248
bureau_CREDIT_ACTIVE_Active_mean                                  0.077356
previous_application_NAME_CONTRACT_STATUS_Refused_mean            0.077671
DAYS_BIRTH                                                        0.078239
bureau_DAYS_CREDIT_mean                                           0.089729
TARGET                                                            1.000000
Name: TARGET, dtype: float64
In [ ]:
correlation_with_all_features.TARGET.sort_values(ascending = True)[-n_val:]
Out[ ]:
previous_application_CHANNEL_TYPE_AP+ (Cash loan)_sum             0.034715
bureau_CREDIT_TYPE_Credit card_sum                                0.034818
previous_application_CHANNEL_TYPE_AP+ (Cash loan)_mean            0.034828
previous_application_PRODUCT_COMBINATION_Cash X-Sell: high_sum    0.036291
bureau_DAYS_CREDIT_ENDDATE_max                                    0.036590
previous_application_NAME_YIELD_GROUP_high_sum                    0.036931
previous_application_NAME_YIELD_GROUP_high_mean                   0.037568
previous_application_NAME_PAYMENT_TYPE_XNA_sum                    0.039469
previous_application_CODE_REJECT_REASON_LIMIT_mean                0.039842
previous_application_PRODUCT_COMBINATION_Card Street_mean         0.040242
previous_application_CODE_REJECT_REASON_LIMIT_sum                 0.040503
DAYS_REGISTRATION                                                 0.041975
bureau_DAYS_CREDIT_sum                                            0.042000
previous_application_NAME_YIELD_GROUP_XNA_mean                    0.042848
bureau_DAYS_CREDIT_UPDATE_min                                     0.042864
FLAG_DOCUMENT_3                                                   0.044346
REG_CITY_NOT_LIVE_CITY                                            0.044395
bureau_CREDIT_TYPE_Microloan_mean                                 0.044439
previous_application_NAME_CONTRACT_TYPE_Revolving loans_sum       0.045602
previous_application_NAME_CLIENT_TYPE_New_sum                     0.046048
previous_application_DAYS_DECISION_mean                           0.046864
bureau_DAYS_CREDIT_ENDDATE_mean                                   0.046983
previous_application_CODE_REJECT_REASON_HC_sum                    0.047067
previous_application_PRODUCT_COMBINATION_Card Street_sum          0.047953
bureau_DAYS_CREDIT_max                                            0.049782
NAME_EDUCATION_TYPE_Secondary / secondary special                 0.049824
REG_CITY_NOT_WORK_CITY                                            0.050994
DAYS_ID_PUBLISH                                                   0.051457
bureau_DAYS_ENDDATE_FACT_mean                                     0.053200
previous_application_DAYS_DECISION_min                            0.053434
bureau_DAYS_CREDIT_ENDDATE_sum                                    0.053735
previous_application_CODE_REJECT_REASON_HC_mean                   0.054531
DAYS_LAST_PHONE_CHANGE                                            0.055218
previous_application_CODE_REJECT_REASON_SCOFR_mean                0.055865
bureau_DAYS_ENDDATE_FACT_min                                      0.055887
previous_application_CODE_REJECT_REASON_SCOFR_sum                 0.056284
previous_application_NAME_PRODUCT_TYPE_walk-in_mean               0.057412
NAME_INCOME_TYPE_Working                                          0.057481
REGION_RATING_CLIENT                                              0.058899
previous_application_NAME_PRODUCT_TYPE_walk-in_sum                0.062628
previous_application_NAME_CONTRACT_STATUS_Refused_sum             0.064469
bureau_CREDIT_ACTIVE_Active_sum                                   0.067128
bureau_DAYS_CREDIT_UPDATE_mean                                    0.068927
previous_application_INTEREST_PER_CREDIT_max                      0.069125
bureau_DAYS_CREDIT_min                                            0.075248
bureau_CREDIT_ACTIVE_Active_mean                                  0.077356
previous_application_NAME_CONTRACT_STATUS_Refused_mean            0.077671
DAYS_BIRTH                                                        0.078239
bureau_DAYS_CREDIT_mean                                           0.089729
TARGET                                                            1.000000
Name: TARGET, dtype: float64
In [ ]:
gc.collect()
Out[ ]:
217
In [ ]:
corrn=correlation_with_all_features.TARGET.sort_values(ascending = True).head(n_val).index.tolist()
corrp=correlation_with_all_features.TARGET.sort_values(ascending = True).tail(n_val).index.tolist()
corr=corrn + corrp
corr.remove('TARGET')
print(len(corr))
corr
99
Out[ ]:
['EXT_SOURCE_3',
 'EXT_SOURCE_2',
 'EXT_SOURCE_1',
 'OCCUPATION_TYPE_Office',
 'previous_application_NAME_CONTRACT_STATUS_Approved_mean',
 'NAME_EDUCATION_TYPE_Higher education',
 'CODE_GENDER_F',
 'previous_application_DAYS_FIRST_DRAWING_mean',
 'DAYS_EMPLOYED',
 'previous_application_DAYS_FIRST_DRAWING_min',
 'FLOORSMAX_AVG',
 'previous_application_RATE_DOWN_PAYMENT_sum',
 'previous_application_NAME_YIELD_GROUP_low_normal_mean',
 'previous_application_RATE_DOWN_PAYMENT_max',
 'previous_application_INTEREST_RT_sum',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_mean',
 'REGION_POPULATION_RELATIVE',
 'previous_application_INTEREST_RT_mean',
 'previous_application_HOUR_APPR_PROCESS_START_mean',
 'previous_application_AMT_ANNUITY_mean',
 'previous_application_NAME_PAYMENT_TYPE_Cash through the bank_mean',
 'ELEVATORS_AVG',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_mean',
 'previous_application_RATE_DOWN_PAYMENT_mean',
 'previous_application_NAME_CONTRACT_TYPE_Consumer loans_mean',
 'previous_application_AMT_ANNUITY_min',
 'previous_application_DAYS_FIRST_DRAWING_count',
 'previous_application_HOUR_APPR_PROCESS_START_min',
 'previous_application_HOUR_APPR_PROCESS_START_max',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_sum',
 'AMT_CREDIT',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_mean',
 'APARTMENTS_AVG',
 'previous_application_NAME_YIELD_GROUP_low_action_mean',
 'previous_application_AMT_ANNUITY_max',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_sum',
 'FLAG_DOCUMENT_6',
 'NAME_HOUSING_TYPE_House / apartment',
 'previous_application_NAME_YIELD_GROUP_low_normal_sum',
 'previous_application_CREDIT_SUCCESS_sum',
 'previous_application_NAME_CLIENT_TYPE_Refreshed_mean',
 'bureau_CREDIT_TYPE_Consumer credit_mean',
 'previous_application_AMT_DOWN_PAYMENT_max',
 'previous_application_NAME_YIELD_GROUP_low_action_sum',
 'HOUR_APPR_PROCESS_START',
 'FLAG_PHONE',
 'previous_application_AMT_DOWN_PAYMENT_count',
 'NAME_INCOME_TYPE_State servant',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_sum',
 'previous_application_INTEREST_PER_CREDIT_min',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_sum',
 'bureau_CREDIT_TYPE_Credit card_sum',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_mean',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: high_sum',
 'bureau_DAYS_CREDIT_ENDDATE_max',
 'previous_application_NAME_YIELD_GROUP_high_sum',
 'previous_application_NAME_YIELD_GROUP_high_mean',
 'previous_application_NAME_PAYMENT_TYPE_XNA_sum',
 'previous_application_CODE_REJECT_REASON_LIMIT_mean',
 'previous_application_PRODUCT_COMBINATION_Card Street_mean',
 'previous_application_CODE_REJECT_REASON_LIMIT_sum',
 'DAYS_REGISTRATION',
 'bureau_DAYS_CREDIT_sum',
 'previous_application_NAME_YIELD_GROUP_XNA_mean',
 'bureau_DAYS_CREDIT_UPDATE_min',
 'FLAG_DOCUMENT_3',
 'REG_CITY_NOT_LIVE_CITY',
 'bureau_CREDIT_TYPE_Microloan_mean',
 'previous_application_NAME_CONTRACT_TYPE_Revolving loans_sum',
 'previous_application_NAME_CLIENT_TYPE_New_sum',
 'previous_application_DAYS_DECISION_mean',
 'bureau_DAYS_CREDIT_ENDDATE_mean',
 'previous_application_CODE_REJECT_REASON_HC_sum',
 'previous_application_PRODUCT_COMBINATION_Card Street_sum',
 'bureau_DAYS_CREDIT_max',
 'NAME_EDUCATION_TYPE_Secondary / secondary special',
 'REG_CITY_NOT_WORK_CITY',
 'DAYS_ID_PUBLISH',
 'bureau_DAYS_ENDDATE_FACT_mean',
 'previous_application_DAYS_DECISION_min',
 'bureau_DAYS_CREDIT_ENDDATE_sum',
 'previous_application_CODE_REJECT_REASON_HC_mean',
 'DAYS_LAST_PHONE_CHANGE',
 'previous_application_CODE_REJECT_REASON_SCOFR_mean',
 'bureau_DAYS_ENDDATE_FACT_min',
 'previous_application_CODE_REJECT_REASON_SCOFR_sum',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_mean',
 'NAME_INCOME_TYPE_Working',
 'REGION_RATING_CLIENT',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_sum',
 'previous_application_NAME_CONTRACT_STATUS_Refused_sum',
 'bureau_CREDIT_ACTIVE_Active_sum',
 'bureau_DAYS_CREDIT_UPDATE_mean',
 'previous_application_INTEREST_PER_CREDIT_max',
 'bureau_DAYS_CREDIT_min',
 'bureau_CREDIT_ACTIVE_Active_mean',
 'previous_application_NAME_CONTRACT_STATUS_Refused_mean',
 'DAYS_BIRTH',
 'bureau_DAYS_CREDIT_mean']

Processing pipeline

Load Merged Files

In [201]:
DATA_DIR='/content/drive/My Drive/AML Project/Data/Phase3'
In [216]:
%%time
ds_names = ('appsTrainDF', 'X_kaggle_test')

for ds_name in ds_names:
    datasets[ds_name]= load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
appsTrainDF: shape is (307511, 496)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 496 entries, SK_ID_CURR to HAS_LIBAILITY_3
dtypes: float64(414), int64(82)
memory usage: 1.1 GB
None
SK_ID_CURR TARGET CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG NONLIVINGAREA_AVG OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE ... FLAG_OWN_REALTY_N NAME_TYPE_SUITE_Unaccompanied NAME_INCOME_TYPE_Commercial associate NAME_INCOME_TYPE_State servant NAME_INCOME_TYPE_Working NAME_INCOME_TYPE_other NAME_EDUCATION_TYPE_Academic degree NAME_EDUCATION_TYPE_Higher education NAME_EDUCATION_TYPE_Incomplete higher NAME_EDUCATION_TYPE_Lower secondary NAME_EDUCATION_TYPE_Secondary / secondary special NAME_FAMILY_STATUS_Married NAME_HOUSING_TYPE_House / apartment OCCUPATION_TYPE_Office WEEKDAY_APPR_PROCESS_START_Weekday ORGANIZATION_TYPE_Business ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Industrial ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Service ORGANIZATION_TYPE_Trade FONDKAPREMONT_MODE_not specified FONDKAPREMONT_MODE_org spec account FONDKAPREMONT_MODE_reg oper account FONDKAPREMONT_MODE_reg oper spec account HOUSETYPE_MODE_block of flats HOUSETYPE_MODE_specific housing HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No HAS_LIBAILITY_0 HAS_LIBAILITY_1 HAS_LIBAILITY_2 HAS_LIBAILITY_3
0 100002 1 0 202500.0 406597.5 24700.5 0.018801 -9461 -637 -3648.0 -2120 1 0 1 1 0 1.0 2 10 0 0 0 0 0 0 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.00 0.0690 0.0833 0.0369 0.0000 2.0 2.0 2.0 -1134.0 ... 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0
1 100003 0 0 270000.0 1293502.5 35698.5 0.003541 -16765 -1188 -1186.0 -291 1 0 1 1 0 2.0 1 11 0 0 0 0 0 0 0.311267 0.622246 NaN 0.0959 0.0529 0.9851 0.08 0.0345 0.2917 0.0130 0.0098 1.0 0.0 0.0 -828.0 ... 1 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1
2 100004 0 0 67500.0 135000.0 6750.0 0.010032 -19046 -225 -4260.0 -2531 1 1 1 1 0 1.0 2 9 0 0 0 0 0 0 NaN 0.555912 0.729567 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 -815.0 ... 0 1 0 0 1 0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0
3 100006 0 0 135000.0 312682.5 29686.5 0.008019 -19005 -3039 -9833.0 -2437 1 0 1 0 0 2.0 2 17 0 0 0 0 0 0 NaN 0.650442 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 0.0 0.0 -617.0 ... 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0
4 100007 0 0 121500.0 513000.0 21865.5 0.028663 -19932 -3038 -4311.0 -3458 1 0 1 0 0 1.0 2 11 0 0 0 0 1 1 NaN 0.322738 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 -1106.0 ... 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0

5 rows × 496 columns

X_kaggle_test1: shape is (48744, 495)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 495 entries, SK_ID_CURR to HAS_LIBAILITY_3
dtypes: float64(414), int64(81)
memory usage: 184.1 MB
None
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH FLAG_MOBIL FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL CNT_FAM_MEMBERS REGION_RATING_CLIENT HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG LANDAREA_AVG NONLIVINGAREA_AVG OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 ... FLAG_OWN_REALTY_N NAME_TYPE_SUITE_Unaccompanied NAME_INCOME_TYPE_Commercial associate NAME_INCOME_TYPE_State servant NAME_INCOME_TYPE_Working NAME_INCOME_TYPE_other NAME_EDUCATION_TYPE_Academic degree NAME_EDUCATION_TYPE_Higher education NAME_EDUCATION_TYPE_Incomplete higher NAME_EDUCATION_TYPE_Lower secondary NAME_EDUCATION_TYPE_Secondary / secondary special NAME_FAMILY_STATUS_Married NAME_HOUSING_TYPE_House / apartment OCCUPATION_TYPE_Office WEEKDAY_APPR_PROCESS_START_Weekday ORGANIZATION_TYPE_Business ORGANIZATION_TYPE_Government ORGANIZATION_TYPE_Industrial ORGANIZATION_TYPE_Other ORGANIZATION_TYPE_Service ORGANIZATION_TYPE_Trade FONDKAPREMONT_MODE_not specified FONDKAPREMONT_MODE_org spec account FONDKAPREMONT_MODE_reg oper account FONDKAPREMONT_MODE_reg oper spec account HOUSETYPE_MODE_block of flats HOUSETYPE_MODE_specific housing HOUSETYPE_MODE_terraced house WALLSMATERIAL_MODE_Block WALLSMATERIAL_MODE_Mixed WALLSMATERIAL_MODE_Monolithic WALLSMATERIAL_MODE_Others WALLSMATERIAL_MODE_Panel WALLSMATERIAL_MODE_Stone, brick WALLSMATERIAL_MODE_Wooden EMERGENCYSTATE_MODE_No HAS_LIBAILITY_0 HAS_LIBAILITY_1 HAS_LIBAILITY_2 HAS_LIBAILITY_3
0 100001 0 135000.0 568800.0 20560.5 0.018850 -19241 -2329 -5170.0 -812 1 0 1 0 1 2.0 2 18 0 0 0 0 0 0 0.752614 0.789654 0.159520 0.0660 0.0590 0.9732 NaN 0.1379 0.125 NaN NaN 0.0 0.0 0.0 -1740.0 0 ... 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0
1 100005 0 99000.0 222768.0 17370.0 0.035792 -18064 -4469 -9118.0 -1623 1 0 1 0 0 2.0 2 9 0 0 0 0 0 0 0.564990 0.291656 0.432962 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 0.0 0 ... 0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0
2 100013 0 202500.0 663264.0 69777.0 0.019101 -20038 -4458 -2175.0 -3503 1 0 1 0 0 2.0 2 14 0 0 0 0 0 0 NaN 0.699787 0.610991 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 -856.0 0 ... 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0
3 100028 2 315000.0 1575000.0 49018.5 0.026392 -13976 -1866 -2000.0 -4208 1 0 1 1 0 4.0 2 11 0 0 0 0 0 0 0.525734 0.509677 0.612704 0.3052 0.1974 0.9970 0.32 0.2759 0.375 0.2042 0.08 0.0 0.0 0.0 -1805.0 0 ... 0 1 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0
4 100038 1 180000.0 625500.0 32067.0 0.010032 -13040 -2191 -4000.0 -4262 1 1 1 0 0 3.0 2 5 0 0 0 0 1 1 0.202145 0.425687 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 0.0 -821.0 0 ... 1 1 0 0 1 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0

5 rows × 495 columns

CPU times: user 20.7 s, sys: 820 ms, total: 21.5 s
Wall time: 22 s
In [218]:
print(datasets['appsTrainDF'].shape)
(307511, 496)
In [219]:
print(datasets['X_kaggle_test'].shape)
(48744, 495)
In [220]:
X_kaggle_test=datasets['X_kaggle_test']
appsTrainDF=datasets['appsTrainDF']
In [ ]:
train_dataset=appsTrainDF
class_labels = ["No Default","Default"]

HCDR Data Pipeline

Column Selector

In [223]:
# Create a class to select numerical or categorical columns since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

Numerical Attributes

Identify the numeric features we wish to consider.

In [224]:
num_attribs=['EXT_SOURCE_3',
 'EXT_SOURCE_2',
 'EXT_SOURCE_1',
 'OCCUPATION_TYPE_Office',
 'previous_application_NAME_CONTRACT_STATUS_Approved_mean',
 'NAME_EDUCATION_TYPE_Higher education',
 'CODE_GENDER_F',
 'previous_application_DAYS_FIRST_DRAWING_mean',
 'DAYS_EMPLOYED',
 'previous_application_DAYS_FIRST_DRAWING_min',
 'FLOORSMAX_AVG',
 'previous_application_RATE_DOWN_PAYMENT_sum',
 'previous_application_NAME_YIELD_GROUP_low_normal_mean',
 'previous_application_RATE_DOWN_PAYMENT_max',
 'previous_application_INTEREST_RT_sum',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_mean',
 'REGION_POPULATION_RELATIVE',
 'previous_application_INTEREST_RT_mean',
 'previous_application_HOUR_APPR_PROCESS_START_mean',
 'previous_application_AMT_ANNUITY_mean',
 'previous_application_NAME_PAYMENT_TYPE_Cash through the bank_mean',
 'ELEVATORS_AVG',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_mean',
 'previous_application_RATE_DOWN_PAYMENT_mean',
 'previous_application_NAME_CONTRACT_TYPE_Consumer loans_mean',
 'previous_application_AMT_ANNUITY_min',
 'previous_application_DAYS_FIRST_DRAWING_count',
 'previous_application_HOUR_APPR_PROCESS_START_min',
 'previous_application_HOUR_APPR_PROCESS_START_max',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_sum',
 'AMT_CREDIT',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_mean',
 'APARTMENTS_AVG',
 'previous_application_NAME_YIELD_GROUP_low_action_mean',
 'previous_application_AMT_ANNUITY_max',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_sum',
 'FLAG_DOCUMENT_6',
 'NAME_HOUSING_TYPE_House / apartment',
 'previous_application_NAME_YIELD_GROUP_low_normal_sum',
 'previous_application_CREDIT_SUCCESS_sum',
 'previous_application_NAME_CLIENT_TYPE_Refreshed_mean',
 'bureau_CREDIT_TYPE_Consumer credit_mean',
 'previous_application_AMT_DOWN_PAYMENT_max',
 'previous_application_NAME_YIELD_GROUP_low_action_sum',
 'HOUR_APPR_PROCESS_START',
 'FLAG_PHONE',
 'previous_application_AMT_DOWN_PAYMENT_count',
 'NAME_INCOME_TYPE_State servant',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_sum',
 'previous_application_INTEREST_PER_CREDIT_min',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_sum',
 'bureau_CREDIT_TYPE_Credit card_sum',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_mean',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: high_sum',
 'bureau_DAYS_CREDIT_ENDDATE_max',
 'previous_application_NAME_YIELD_GROUP_high_sum',
 'previous_application_NAME_YIELD_GROUP_high_mean',
 'previous_application_NAME_PAYMENT_TYPE_XNA_sum',
 'previous_application_CODE_REJECT_REASON_LIMIT_mean',
 'previous_application_PRODUCT_COMBINATION_Card Street_mean',
 'previous_application_CODE_REJECT_REASON_LIMIT_sum',
 'DAYS_REGISTRATION',
 'bureau_DAYS_CREDIT_sum',
 'previous_application_NAME_YIELD_GROUP_XNA_mean',
 'bureau_DAYS_CREDIT_UPDATE_min',
 'FLAG_DOCUMENT_3',
 'REG_CITY_NOT_LIVE_CITY',
 'bureau_CREDIT_TYPE_Microloan_mean',
 'previous_application_NAME_CONTRACT_TYPE_Revolving loans_sum',
 'previous_application_NAME_CLIENT_TYPE_New_sum',
 'previous_application_DAYS_DECISION_mean',
 'bureau_DAYS_CREDIT_ENDDATE_mean',
 'previous_application_CODE_REJECT_REASON_HC_sum',
 'previous_application_PRODUCT_COMBINATION_Card Street_sum',
 'bureau_DAYS_CREDIT_max',
 'NAME_EDUCATION_TYPE_Secondary / secondary special',
 'REG_CITY_NOT_WORK_CITY',
 'DAYS_ID_PUBLISH',
 'bureau_DAYS_ENDDATE_FACT_mean',
 'previous_application_DAYS_DECISION_min',
 'bureau_DAYS_CREDIT_ENDDATE_sum',
 'previous_application_CODE_REJECT_REASON_HC_mean',
 'DAYS_LAST_PHONE_CHANGE',
 'previous_application_CODE_REJECT_REASON_SCOFR_mean',
 'bureau_DAYS_ENDDATE_FACT_min',
 'previous_application_CODE_REJECT_REASON_SCOFR_sum',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_mean',
 'NAME_INCOME_TYPE_Working',
 'REGION_RATING_CLIENT',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_sum',
 'previous_application_NAME_CONTRACT_STATUS_Refused_sum',
 'bureau_CREDIT_ACTIVE_Active_sum',
 'bureau_DAYS_CREDIT_UPDATE_mean',
 'previous_application_INTEREST_PER_CREDIT_max',
 'bureau_DAYS_CREDIT_min',
 'bureau_CREDIT_ACTIVE_Active_mean',
 'previous_application_NAME_CONTRACT_STATUS_Refused_mean',
 'DAYS_BIRTH',
 'bureau_DAYS_CREDIT_mean',
 'previous_application_INTEREST_PER_CREDIT_mean',
'previous_application_CREDIT_SUCCESS_mean',
'previous_application_INTEREST_RT_mean',
'HAS_LIBAILITY_0',
'HAS_LIBAILITY_1',
'HAS_LIBAILITY_2',
'HAS_LIBAILITY_3',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
  'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR'
]

Numerical Pipeline definition

In [225]:
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy='mean')),
        ('std_scaler', StandardScaler()),
    ])

Categorical Attributes

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

  • The OneHotEncoder is fitted to the training set, which means that for each unique value present in the training set, for each feature, a new column is created. Let's say we have 39 columns after the encoding up from 30 (before preprocessing).
  • The output is a numpy array (when the option sparse=False is used), which has the disadvantage of losing all the information about the original column names and values.
  • When we try to transform the test set, after having fitted the encoder to the training set, we obtain a ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])
In [226]:
# Identify the categorical features we wish to consider.
# cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
#                'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
#cat_attribs = train_dataset.select_dtypes(include=['object', ]).columns.tolist()
In [227]:
cat_attribs =[]
In [228]:
gc.collect()
Out[228]:
44

Categorical Pipeline definition

In [229]:
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
        #('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        #('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

Create Data Preparation Pipeline

With Feature union, combine numerical and categorical Pipeline together to prepare for Data pipeline

In [230]:
data_prep_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
    #    ("cat_pipeline", cat_pipeline),
    ])              

Selected Features

In [231]:
selected_features = num_attribs 
tot_features = f"{len(selected_features)}:   Num:{len(num_attribs)},    Cat:{len(cat_attribs)}"
#Total Feature selected for processing
tot_features
Out[231]:
'132:   Num:132,    Cat:0'
In [232]:
gc.collect()
Out[232]:
138

Evaluation metrics

Since HCDR is a Classification task, we are going to use the following metrics to measure the Model performance

In [233]:
def pct(x):
    return round(100*x,3)

Define dataframe with all metrics included

In [ ]:
#del expLog
In [304]:
try:
    expLog
except NameError:
    expLog = pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train AUC", 
                                   "Valid AUC",
                                   "Test  AUC",
                                   "Train F1 Score",
                                   "Valid F1 Score",
                                   "Test F1 Score",                                   
                                   "Train Log Loss",
                                   "Valid Log Loss",
                                   "Test Log Loss",
                                   "P Score",
                                   "Train Time",
                                   "Valid Time",
                                   "Test Time",
                                   "Description"
                                  ])
In [305]:
# roc curve, precision recall curve for each model
fprs, tprs, precisions, recalls, names, scores, cvscores, pvalues, accuracy, cnfmatrix = list(), list(), list(), list(), list(), list(), list(), list(), list(), list()
features_list, final_best_clf,results = {}, {},[]

Accuracy Score

This metric describes the fraction of correctly classified samples. In SKLearn, it can be modified to return solely the number of correct samples.Accuracy is the default scoring method for both logistic regression and k-Nearest Neighbors in scikit-learn. image.png

Precision

The precision is the ratio of true positives over the total number of predicted positives. image-2.png

Recall

The recall is the ratio of true positives over the true positives and false negatives. Recall is assessing the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0 image-3.png

In [306]:
def precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name):
    # plot precision_recall Test
    precision, recall, threshold = precision_recall_curve(y_test,model.predict_proba(X_test)[:, 1])
    precisions.append(precision)
    recalls.append(recall)
    
    # plot combined Precision Recall curve for train, valid, test
    show_train_precision = plot_precision_recall_curve(model, X_train, y_train, name="TrainPresRecal")
    show_test_precision = plot_precision_recall_curve(model, X_test, y_test, name="TestPresRecal", ax=show_train_precision.ax_)
    show_valid_precision = plot_precision_recall_curve(model, X_valid, y_valid, name="ValidPresRecal", ax=show_test_precision.ax_)
    show_valid_precision.ax_.set_title ("Precision Recall Curve Comparison - " + name)
    plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
    plt.show()
    return precisions,recalls

F1 score

The F1 score is a metric that has a value of 0 - 1, with 1 being the best value. The F1 score is a weighted average of the precision and recall, with the contributions of precision and recall are the same image.png

Confusion Matrix

The confusion matrix, in this case for a binary classification, is a 2x2 matrix that contains the count of the true positives, false positives, true negatives, and false negatives.
image-2.png

In [307]:
def confusion_matrix_def(model,X_train,y_train,X_test, y_test, X_valid, y_valid,cnfmatrix):
  #Prediction
  preds_test = model.predict(X_test)
  preds_train = model.predict(X_train)
  preds_valid = model.predict(X_valid)
    
  cm_train = confusion_matrix(y_train, preds_train).astype(np.float32)
  #print(cm_train)
  cm_train /= cm_train.sum(axis=1)[:, np.newaxis]

  cm_test = confusion_matrix(y_test, preds_test).astype(np.float32)
  #print(cm_test)
  cm_test /= cm_test.sum(axis=1)[:, np.newaxis]

  cm_valid = confusion_matrix(y_valid, preds_valid).astype(np.float32)
  cm_valid /= cm_valid.sum(axis=1)[:, np.newaxis]

  plt.figure(figsize=(16, 4))
  #plt.subplots(1,3,figsize=(12,4))

  plt.subplot(131)
  g = sns.heatmap(cm_train, vmin=0, vmax=1, annot=True, cmap="Reds")
  plt.xlabel("Predicted", fontsize=14)
  plt.ylabel("True", fontsize=14)
  g.set(xticklabels=class_labels, yticklabels=class_labels)
  plt.title("Train", fontsize=14)

  plt.subplot(132)
  g = sns.heatmap(cm_valid, vmin=0, vmax=1, annot=True, cmap="Reds")
  plt.xlabel("Predicted", fontsize=14)
  plt.ylabel("True", fontsize=14)
  g.set(xticklabels=class_labels, yticklabels=class_labels)
  plt.title("Validation set", fontsize=14);

  plt.subplot(133)
  g = sns.heatmap(cm_test, vmin=0, vmax=1, annot=True, cmap="Reds")
  plt.xlabel("Predicted", fontsize=14)
  plt.ylabel("True", fontsize=14)
  g.set(xticklabels=class_labels, yticklabels=class_labels)
  plt.title("Test", fontsize=14);
  cnfmatrix.append(cm_test)

  return cnfmatrix

AUC (Area under ROC curve)

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: ▪ True Positive Rate ▪ False Positive Rate image.png

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). image-2.png

AUC is desirable for the following two reasons:

  1. AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
  2. AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.
In [308]:
def roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name):
    fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
    fprs.append(fpr)
    tprs.append(tpr)
    # plot combined ROC curve for train, valid, test
    show_train_roc = plot_roc_curve(model, X_train, y_train, name="TrainRocAuc")
    show_test_roc = plot_roc_curve(model, X_test, y_test, name="TestRocAuc", ax=show_train_roc.ax_)
    show_valid_roc = plot_roc_curve(model, X_valid, y_valid, name="ValidRocAuc", ax=show_test_roc.ax_)
    show_valid_roc.ax_.set_title ("ROC Curve Comparison - " + name)
    plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
    plt.show()
    return fprs,tprs

Binary cross-entropy loss (CXE)

CXE measures the performance of a classification model whose output is a probability value between 0 and 1. CXE increases as the predicted probability diverges from the actual label. Therefore, we choose a parameter, which would minimize the binary CXE loss function.

The log loss formula for the binary case is as follows :

$$ -\frac{1}{m}\sum^m_{i=1}\left(y_i\cdot\:\log\:\left(p_i\right)\:+\:\left(1-y_i\right)\cdot\log\left(1-p_i\right)\right) $$
  • $y_i$: the label for $i_{th}$ observation
  • $m$: sample size
  • $p_i$: predicted probability of the point being in the label($y=1$) for $i_{th}$ observation

p-value

p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.

We will compare the classifiers with the baseline untuned model by conducting two-tailed hypothesis test.

Null Hypothesis, H0: There is no significant difference between the two machine learning pipelines. Alternate Hypothesis, HA: The two machine learning pipelines are different. A p-value less than or equal to the significance level is considered statistically significant.

In [575]:
metrics = {'accuracy': make_scorer(accuracy_score),
            'roc_auc': 'roc_auc',
            'f1': make_scorer(f1_score),
            'log_loss': make_scorer(log_loss)
          }

Baseline model with Imbalanced Dataset

Data Leakage

Phase 3 was significant in terms of feature engineering since we had a better understanding of the data and were able to fine tune the datasets to minimize data leakage. The “TARGET” feature was not used in the train test and the test set was managed separately from the merged train dataset. The preprocessing was applied to the merged train set only. We also ensured to remove multicollinearity as a separate preprocessing step and experimented with feature selection methods only within the modelling pipeline to avoid any issues with data leakage. The feature selection methods used were RFE, SelectKbest with mutual classification and Variance threshold.

Create Train and Test Datasets

In [310]:
for col in selected_features:
  if col not in  train_dataset.columns:
    selected_features.remove(col)
In [562]:
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size 
splits = 75

# Train Test split percentage
subsample_rate = 0.3

finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
X_kaggle_test= X_kaggle_test[selected_features]

## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
                                                    test_size=subsample_rate, random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)

print(f"X train           shape: {X_train.shape}")
print(f"X validation      shape: {X_valid.shape}")
print(f"X test            shape: {X_test.shape}")
print(f"X kaggle_test     shape: {X_kaggle_test.shape}")
X train           shape: (2439, 132)
X validation      shape: (431, 132)
X test            shape: (1231, 132)
X kaggle_test     shape: (48744, 132)
In [379]:
X_kaggle_test=datasets['X_kaggle_test1']
kaggle_test = X_kaggle_test[selected_features]
X_kaggle_test.shape,kaggle_test.shape
Out[379]:
((48744, 495), (48744, 132))

Define pipeline

Logistic regression model is used as a baseline Model, since it's easy to implement yet provides great efficiency. Training a logistic regression model doesn't require high computation power. We also tuned the regularization, tolerance, and C hyper parameters for the Logistic regression model and compared the results with the baseline model. We used 15 fold cross fold validation with hyperparameters to tune the model and apply GridSearchCV function in Sklearn.

In [312]:
%%time 
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
        ("preparation", data_prep_pipeline),
        ("linear", LogisticRegression())
    ])
CPU times: user 199 µs, sys: 12 µs, total: 211 µs
Wall time: 220 µs

Perform cross-fold validation and Train the model

Split the training data to 15 fold to perform Crossfold validation

In [313]:
cvSplits = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
In [314]:
X_train.head(5)
gc.collect()
Out[314]:
325
In [315]:
start = time()
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)

# Set up cross validation scores 
logit_scores = cross_validate(model, X_train, y_train,cv=cvSplits,scoring=metrics, return_train_score=True, n_jobs=-1)  
train_time = np.round(time() - start, 4)

# Time and score valid predictions
start = time()
logit_score_valid  = full_pipeline_with_predictor.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)

# Time and score test predictions
start = time()
logit_score_test  = full_pipeline_with_predictor.score(X_test, y_test)
test_time = np.round(time() - start, 4)

Calculate metrics

In [316]:
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_scores['train_accuracy'].mean(), 
                logit_scores['test_accuracy'].mean(),
                accuracy_score(y_test, model.predict(X_test)),
                logit_scores['train_roc_auc'].mean(),
                logit_scores['test_roc_auc'].mean(),
                roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                logit_scores['train_f1'].mean(),
                logit_scores['test_f1'].mean(),
                f1_score(y_test, model.predict(X_test)),
                logit_scores['train_log_loss'].mean(),
                logit_scores['test_log_loss'].mean(),
                log_loss(y_test, model.predict(X_test)),0 ],4)) \
                + [train_time, logit_scores['score_time'].mean(), test_time] + [f"Imbalanced Logistic reg features {tot_features} with 20% training data"]
expLog
Out[316]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.912 0.9188 0.8547 0.7024 0.75 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:13...

Confusion matrix

In [317]:
# Create confusion matrix for baseline model
_=confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)

AUC (Area under ROC curve)

In [318]:
_,_=roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,"Baseline Logistic Regression Model")

Precision Recall Curve

In [319]:
_,_=precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,"Baseline Logistic Regression Model")
In [320]:
gc.collect()
Out[320]:
17104

Tune Basline model with grid search & RFE

Various Classification algorithms were used to compare with the best model. Following metrics were used to find the best model

  • Cross fold Train Accuracy
  • Test Accuracy
  • p-value
  • Train ROC_AUC_Score
  • Test ROC_AUC_Score
  • Train F1_Score
  • Test F1_Score
  • Train LogLoss
  • Test LogLoss
  • Train Time
  • Test Time
  • Confusion matrix

We implemented the Logistic regression model as the baseline model, which didn’t require high computation power and was easy to implement, in addition we implemented KNN and tuned logistic models with balanced dataset to improve our model predictiveness. Our objective in current phase s to explore various classification models which would improve our prediction. Our primary focus is on boosting algorithms which are said to be highly efficient and moderately quicker. As shown in the diagram below is the modelling pipeline for current phase. We primarily experimented with Gradient Boosting, XGBoost, Light BGM, RandomForest and SVM.

image.png

Recursive Feature Elimination RFE is a wrapper-type feature selection algorithm. A different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features. We have chosen this model in contrast to filter-based feature selections that score each feature and select those features with the largest (or smallest) score.

Below is the reason for choosing the mentioned models.

  1. Gradient Boosting provides a better predictive model by forming an ensemble of weak predictors.
  2. XGBoost is one of the quickest implementations of gradient boosted trees. XGBoost is designed to handle missing values internally. This is helpful because there are many, many hyperparameters to tune.
  3. LightGBM in many cases provides results which are more effective and faster than XGBoost with lesser memory usage. It splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise.
  4. Random Forest is a tree-based machine learning algorithm that combines the output of multiple decision trees for making decisions. For each tree only a random subsample of the available features is selected for building the tree. Random Forest uses decision trees, which are more prone to overfitting.
  5. SVM performs similar to logistic regression when linear separation and performs well with non-linear boundaries depending on the kernel used. SVM is susceptible to overfitting/training issues depending on the kernel. A more complex kernel would overfit the model.

Gradient Boosting implementations have no regularisation like XGBoost, therefore it helps to reduce overfitting. But boosting algorithms can overfit if the number of trees is very large. We did two submission in Kaggle, one using Voting Classifier and the other one with best classifier i.e. XGBoost. A Voting Classifier is a machine learning model that trains on an ensemble of various models and predicts an output based on their highest probability of chosen class as the output. We have chosen soft voting instead of hard voting since the soft voting predicts based on the average of all models.

For discussion, Results and conclusions refer to the bottom of this page.

Classifiers

In [321]:
classifiers = [
        [('Logistic Regression', LogisticRegression(solver='saga',random_state=42),"RFE")],
        [('Support Vector', SVC(random_state=42,probability=True),"SVM")],
        [('Gradient Boosting', GradientBoostingClassifier(warm_start=True, random_state=42),"RFE")],
        [('XGBoost', XGBClassifier(random_state=42),"RFE")],
        [('Light GBM', LGBMClassifier(boosting_type='gbdt', random_state=42),"RFE")],
        [('RandomForest', RandomForestClassifier(random_state=42),"RFE")]
    ]

Hyper-parameters for different models

In [322]:
# Arrange grid search parameters for each classifier
params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2','elasticnet'),
            'tol': (0.0001, 0.00001), 
            'C': (10, 1, 0.1, 0.01),
        }
    ,
        'Support Vector' : {
            'kernel': ('rbf','poly'),     
            'degree': (4, 5),
            'C': ( 0.001, 0.01),   #Low C - allow for misclassification
            'gamma':(0.01,0.1,1)  #Low gamma - high variance and low bias
        }
    ,
    'Gradient Boosting':  {
            'max_depth': [5,10], # Lower helps with overfitting
            'max_features': [10,15],
            'validation_fraction': [0.2],
            'n_iter_no_change': [10],
            'tol': [0.01,0.0001],
            'n_estimators':[1000],
            'subsample' : [0.8],             #fraction of observations to be randomly samples for each tree.
    #        'min_samples_split' : [5], # Must have 'x' number of samples to split (Default = 2)
            'min_samples_leaf' : [3,5],        # (Default = 1) minimum number of samples in a leaf
        },
        'XGBoost':  {
            'max_depth': [3,5], # Lower helps with overfitting
            'n_estimators':[300,500],
            'learning_rate': [0.01,0.1],
#            'objective': ['binary:logistic'],
#            'eval_metric': ['auc'],
            'eta' : [0.01,0.1],
            'colsample_bytree' : [0.2,0.5], 
        },
        'Light GBM':  {
            'max_depth': [2,5],  # Lower helps with overfitting
            'num_leaves': [5,10], # Equivalent to max depth
            'n_estimators':[1000,5000],
            'learning_rate': [0.01,0.1],
 #           'reg_alpha': [0.1,0.01,1],
 #           'reg_lambda': [0.1,0.01,1],
            'boosting_type':['goss','dart'],
 #           'metric': ['auc'],
 #           'objective':['binary'],
            'max_bin' : [100,200],  #Setting it to high values has similar effect as caused by increasing value of num_leaves 
        },                          #small numbers reduces accuracy but runs faster 

        'RandomForest':  {
            'max_depth': [5,10],
            'max_features': [15,20],
            'min_samples_split': [5, 10],
            'min_samples_leaf': [3, 5],
            'bootstrap': [True],
            'n_estimators':[1000]},
    }
In [323]:
# Set feature selection settings
# Features removed each step
#feature_selection_steps=50
feature_selection_steps=0.5
# Number of features used
features_used=len(selected_features)
#features_used=100
In [324]:
results.append(logit_scores['train_accuracy'])
names = ['Baseline LR']
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))

            # generate the pipeline based on the feature selection method
            if feature_sel == "SVM":
                full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
            #    ("PCA",PCA(0.95)),
            #    ('RFE', RFE(estimator=classifier, n_features_to_select=features_used, step=feature_selection_steps)),
                ("predictor", classifier)
                ])
            else:
                full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ('RFE', RFE(estimator=classifier, n_features_to_select=features_used, step=feature_selection_steps)),
                ("predictor", classifier)
                ])

            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)

            # Best estimator score
            best_train = pct(grid_search.best_score_)

            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  

            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)

            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)

            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])

            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)

            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))

            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)

            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)

            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)

            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            #Feature importance 
            feature_name = num_attribs #+ cat_attribs
            feature_list = feature_name
            if feature_sel == "RFE":
            #    features_list[name]=pd.DataFrame({'feature_name': feature_list,
            #                                         'feature_importance': grid_search.best_estimator_.named_steps['PCA'].explained_variance_ratio_})
            #                             'feature_importance': grid_search.best_estimator_.named_steps['RFE'].ranking_})
               # print(grid_search.best_estimator_.named_steps['preparation'].get_feature_names())
               # print(len(grid_search.best_estimator_.named_steps['preparation'].get_feature_names()))
               # print(len(feature_list),feature_list)
               # print(len(grid_search.best_estimator_.named_steps['RFE'].ranking_))
            #          grid_search.best_estimator_.named_steps['RFE'].ranking_)
                features_list[name]=pd.DataFrame({'feature_name': feature_list,
                                         'feature_importance': grid_search.best_estimator_.named_steps['RFE'].ranking_[:132]})
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")

            # Record the results
            exp_name = name
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]

Logistic Regression Model

In [325]:
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Logistic Regression *****
Parameters:
	C: (10, 1, 0.1, 0.01)
	penalty: ('l1', 'l2', 'elasticnet')
	tol: (0.0001, 1e-05)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.7s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   35.0s finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__C: 0.1
	predictor__penalty: l1
	predictor__tol: 0.0001
****** FINISH Logistic Regression  *****

In [326]:
gc.collect()
Out[326]:
17473
In [327]:
expLog
Out[327]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.912 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:13...
1 Logistic Regression 0.9256 0.923 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty",...

Gradient Boosting

In [328]:
ConductGridSearch(classifiers[2],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Gradient Boosting *****
Parameters:
	max_depth: [5, 10]
	max_features: [10, 15]
	min_samples_leaf: [3, 5]
	n_estimators: [1000]
	n_iter_no_change: [10]
	subsample: [0.8]
	tol: [0.01, 0.0001]
	validation_fraction: [0.2]
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  2.2min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__max_depth: 5
	predictor__max_features: 10
	predictor__min_samples_leaf: 3
	predictor__n_estimators: 1000
	predictor__n_iter_no_change: 10
	predictor__subsample: 0.8
	predictor__tol: 0.0001
	predictor__validation_fraction: 0.2
****** FINISH Gradient Boosting  *****

In [ ]:
gc.collect()
In [330]:
expLog
Out[330]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:13...
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty",...
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max...

XGBoost

In [331]:
ConductGridSearch(classifiers[3],cnfmatrix,fprs,tprs,precisions,recalls)
****** START XGBoost *****
Parameters:
	colsample_bytree: [0.2, 0.5]
	eta: [0.01, 0.1]
	learning_rate: [0.01, 0.1]
	max_depth: [3, 5]
	n_estimators: [300, 500]
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   45.0s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  3.3min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__colsample_bytree: 0.2
	predictor__eta: 0.01
	predictor__learning_rate: 0.01
	predictor__max_depth: 3
	predictor__n_estimators: 500
****** FINISH XGBoost  *****

In [332]:
expLog
Out[332]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:13...
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty",...
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max...
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predi...

Light GBM

In [333]:
ConductGridSearch(classifiers[4],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Light GBM *****
Parameters:
	boosting_type: ['goss', 'dart']
	learning_rate: [0.01, 0.1]
	max_bin: [100, 200]
	max_depth: [2, 5]
	n_estimators: [1000, 5000]
	num_leaves: [5, 10]
Fitting 5 folds for each of 64 candidates, totalling 320 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  7.8min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 19.2min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__boosting_type: dart
	predictor__learning_rate: 0.01
	predictor__max_bin: 100
	predictor__max_depth: 5
	predictor__n_estimators: 1000
	predictor__num_leaves: 5
****** FINISH Light GBM  *****

In [334]:
expLog
Out[334]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:13...
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty",...
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max...
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predi...
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predi...

RandomForest

In [335]:
ConductGridSearch(classifiers[5],cnfmatrix,fprs,tprs,precisions,recalls)
****** START RandomForest *****
Parameters:
	bootstrap: [True]
	max_depth: [5, 10]
	max_features: [15, 20]
	min_samples_leaf: [3, 5]
	min_samples_split: [5, 10]
	n_estimators: [1000]
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  3.8min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__bootstrap: True
	predictor__max_depth: 5
	predictor__max_features: 15
	predictor__min_samples_leaf: 3
	predictor__min_samples_split: 10
	predictor__n_estimators: 1000
****** FINISH RandomForest  *****

In [ ]:
gc.collect()
In [336]:
expLog
Out[336]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:13...
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty",...
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max...
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predi...
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predi...
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__...

Support Vector

In [337]:
ConductGridSearch(classifiers[1],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Support Vector *****
Parameters:
	C: (0.001, 0.01)
	degree: (4, 5)
	gamma: (0.01, 0.1, 1)
	kernel: ('rbf', 'poly')
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.4s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  1.3min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__C: 0.01
	predictor__degree: 4
	predictor__gamma: 0.01
	predictor__kernel: rbf
****** FINISH Support Vector  *****

In [338]:
gc.collect()
Out[338]:
34552
In [339]:
expLog
Out[339]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:13...
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty",...
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max...
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predi...
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predi...
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__...
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree",...

Model Validation

Feature Importance based on all Models

In [340]:
# plot feature importance by their ranking for each model
for name in names[1:-1]:
    plt.figure(figsize=(10,10), dpi= 80)
    features_df = features_df = features_list[name].sort_values(['feature_importance','feature_name'], ascending=[False, False])
    sortedNames = np.array(features_df)[0:25, 0]
    sortedImportances = np.array(features_df)[0:25, 1]
    plt.title('Feature Importance  - ' + name)
    plt.barh(range(len(sortedNames)), sortedImportances, color='g', align='center')
    plt.yticks(range(len(sortedNames)), sortedNames)  
    plt.xlabel('Low Importance                                                           High Importance')
    plt.grid()
    plt.show()

Boxplot with all CV results

In [341]:
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names,rotation=90)
pyplot.grid()
pyplot.show()

AUC (Area Under the ROC Curve)

In [342]:
# roc curve fpr, tpr  for all classifiers 
plt.plot([0,1],[0,1], 'k--')
for i in range(len(names)-1):
    plt.plot(fprs[i],tprs[i],label = names[i] + '  ' + str(scores[i]))
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.show()

Precision Recall Curve

In [343]:
# precision recall curve  for all classifiers 
for i in range(len(names)-1):
    plt.plot(recalls[i],precisions[i],label = names[i])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Precision-Recall Curve')
plt.show()

Confusion Matrix

In [344]:
# plot confusion matrix for all classifiers 
f, axes = plt.subplots(1, len(names), figsize=(30, 8), sharey='row')
for i in range(len(names)):
    disp = ConfusionMatrixDisplay(cnfmatrix[i], display_labels=['0', '1'])
    disp.plot(ax=axes[i], xticks_rotation=0)
    disp.ax_.set_title("Confusion Matrix - " + names[i])
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')

f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.10, hspace=0.1)

f.colorbar(disp.im_, ax=axes)
plt.show()

Final results

In [352]:
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_RFE.csv",index=False)
expLog
Out[352]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
In [356]:
final_best_clf
final_best_cl_rfe_fdf = pd.DataFrame(list(final_best_clf.items()))#,columns = ['column1','column2']) 
with open('/content/drive/My Drive/AML Project/Data/Phase3/final_best_clf_RFE.txt', 'w') as file:
     file.write(str(final_best_clf))

Kaggle submission

Build Pipeline using best models and create an ensemble model for Kaggle submission

Voting Classifier to predict best results based on best Classifier Probability

In [384]:
def voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params):
  %%time 
  np.random.seed(42)
  print("Classifier with parameters")
  final_estimators = []
  for i,clf in enumerate(model_selection):
      model = final_best_clf[clf]['predictor'][0]
      print(i+1, " :",model)
      final_estimators.append((clf,make_pipeline(data_prep_pipeline,
                         RFE(estimator=model,n_features_to_select=features_used, step=feature_selection_steps),
                          model)))
  voting_classifier = Pipeline([("clf", VotingClassifier(estimators=final_estimators, voting='soft'))])
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  voting_classifier.fit(final_X_train, final_y_train)
  start = time()
  train_time = round(time() - start, 4)
  print("Voting Score:{0}".format(voting_classifier.score(final_X_train, final_y_train)))
  test_class_scores = voting_classifier.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
In [386]:
final_best_clf
model_selection = ['Logistic Regression','Gradient Boosting','XGBoost','Light GBM','RandomForest']
fs_type='RFE'
voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params)
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs
Classifier with parameters
1  : LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=42, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
2  : GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=5,
                           max_features=10, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=3, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=1000,
                           n_iter_no_change=10, presort='deprecated',
                           random_state=42, subsample=0.8, tol=0.0001,
                           validation_fraction=0.2, verbose=0, warm_start=True)
3  : XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.2, eta=0.01, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=500, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
4  : LGBMClassifier(boosting_type='dart', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.01, max_bin=100,
               max_depth=5, min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=1000, n_jobs=-1, num_leaves=5,
               objective=None, random_state=42, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)
5  : RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features=15,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=3, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
(4101, 132) (4101,) (48744, 132)
Voting Score:0.9246525237746891
[0.09292899 0.16168017 0.03304763 0.05404492 0.12422177 0.06133604
 0.03262728 0.04783795 0.05152891 0.17806004]
   SK_ID_CURR    TARGET
0      100001  0.092929
1      100005  0.161680

PCA & Handling MultiCollinearity

Multicollinearity highly affects the variance associated with the problem, and can also affect the interpretation of the model, as it undermines the statistical significance of independent variables.For a dataset, if some of the independent variables are highly independent of each other, it results in multicollinearity. A small change in any of the features can affect the model performance to a great extent. In other words, The coefficients of the model become very sensitive to small changes in the independent variables.The basic idea is to run a PCA on all predictors. Their ratio, the Condition Index, will be high if multicollinearity is present.
Reference : https://www.whitman.edu/Documents/Academics/Mathematics/2017/Perez

Logistic regression with PCA

In [387]:
for (name, classifier,feature_sel) in classifiers[0]:
        # Print classifier and parameters
        print('****** START', name,'*****')
        parameters = params_grid[name]
        print("Parameters:")
        for p in sorted(parameters.keys()):
            print("\t"+str(p)+": "+ str(parameters[p]))

        # generate the pipeline based on the feature selection method
        full_pipeline_with_predictor = Pipeline([
            ("preparation", data_prep_pipeline),
            ("PCA",PCA(0.95)),
            ("predictor", classifier)
            ])
 
        # Execute the grid search
        params = {}
        for p in parameters.keys():
            pipe_key = 'predictor__'+str(p)
            params[pipe_key] = parameters[p] 
        grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                   n_jobs=-1,verbose=1)
        grid_search.fit(X_train, y_train)

        # Best estimator score
        best_train = pct(grid_search.best_score_)

        # Best train scores
        print("Cross validation with best estimator")
        best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                           return_train_score=True, n_jobs=-1)  

        #get all scores
        best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
        best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
        best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
        best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)

        valid_time = np.round(best_train_scores['score_time'].mean(),4)
        best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
        best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
        best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
        best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)

        (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])

        #test and Prediction with whole data
        # Best estimator fitting time
        print("Fit and Prediction with best estimator")
        start = time()
        model = grid_search.best_estimator_.fit(X_train, y_train)
        train_time = round(time() - start, 4)

        # Best estimator prediction time
        start = time()
        y_test_pred = model.predict(X_test)
        test_time = round(time() - start, 4)

        # Collect the best parameters found by the grid search
        print("Best Parameters:")
        best_parameters = grid_search.best_estimator_.get_params()
        param_dump = []
        for param_name in sorted(params.keys()):
            param_dump.append((param_name, best_parameters[param_name]))
            print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
        print("****** FINISH",name," *****")
        print("")

        # Record the results
#        exp_name = "Logistic Regression with PCA"
        expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                [best_train_accuracy, 
                #pct(accuracy_score(y_valid, model.predict(X_valid))),
                best_valid_accuracy,
                accuracy_score(y_test, y_test_pred),
                best_train_roc_auc,
                best_valid_roc_auc,
                #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                best_train_f1,
                best_valid_f1,
                f1_score(y_test, y_test_pred),
                best_train_logloss,
                best_valid_logloss, 
                log_loss(y_test, y_test_pred),
                p_value
                ],4)) + [train_time,valid_time,test_time] \
                + [json.dumps(param_dump)]
****** START Logistic Regression *****
Parameters:
	C: (10, 1, 0.1, 0.01)
	penalty: ('l1', 'l2', 'elasticnet')
	tol: (0.0001, 1e-05)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    9.8s finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__C: 0.01
	predictor__penalty: l2
	predictor__tol: 1e-05
****** FINISH Logistic Regression  *****

In [388]:
expLog
Out[388]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
In [389]:
  fs_type='Logistic PCA'
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  grid_search.best_estimator_.fit(final_X_train, final_y_train)
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  start = time()
  train_time = round(time() - start, 4)
  print("Logistic PCA Score:{0}".format(grid_search.best_estimator_.score(final_X_train, final_y_train)))
  test_class_scores = grid_search.best_estimator_.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
(4101, 132) (4101,) (48744, 132)
Logistic PCA Score:0.9234333089490369
[0.05766601 0.22704034 0.01519151 0.08732243 0.17832727 0.06538762
 0.01624627 0.04306863 0.0169574  0.10466651]
   SK_ID_CURR    TARGET
0      100001  0.057666
1      100005  0.227040

Tune Basline model (Grid search & SelectKbest Feature Selection)

Various Classification algorithms were used to compare with the best model. Following metrics were used to find the best model

  • Cross fold Train Accuracy
  • Test Accuracy
  • p-value
  • Train ROC_AUC_Score
  • Test ROC_AUC_Score
  • Train F1_Score
  • Test F1_Score
  • Train LogLoss
  • Test LogLoss
  • Train Time
  • Test Time
  • Confusion matrix
In [390]:
# Clean up the  arrays

del fprs[1:]
del tprs[1:] 
del precisions[1:] 
del recalls[1:] 
del names[1:] 
del scores[1:] 
del cvscores[1:] 
del pvalues[1:] 
del accuracy[1:]
del cnfmatrix[1:] 
del results[1:]
final_best_clf,results = {}, {}
In [398]:
print(names)
['Baseline LR']

Classifiers

In [391]:
classifiers = [
        [('Logistic Regression', LogisticRegression(solver='saga',random_state=42),"SelectKbest")],
        [('XGBoost', XGBClassifier(random_state=42),"SelectKbest")],
        [('Light GBM', LGBMClassifier(boosting_type='gbdt', random_state=42),"SelectKbest")],
        [('RandomForest', RandomForestClassifier(random_state=42),"SelectKbest")]
    ]

Hyper-parameters for different models

In [393]:
# Arrange grid search parameters for each classifier
params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2','elasticnet'),
            'tol': (0.0001, 0.00001), 
            'C': (10, 1, 0.1, 0.01),
        }
    ,
        'XGBoost':  {
            'max_depth': [3,5], # Lower helps with overfitting
            'n_estimators':[300,500],
            'learning_rate': [0.01,0.1],
#            'objective': ['binary:logistic'],
#            'eval_metric': ['auc'],
            'eta' : [0.01,0.1],
            'colsample_bytree' : [0.2,0.5], 
        },
        'Light GBM':  {
            'max_depth': [2,5],  # Lower helps with overfitting
            'num_leaves': [5,10], # Equivalent to max depth
            'n_estimators':[1000,5000],
            'learning_rate': [0.01,0.1],
 #           'reg_alpha': [0.1,0.01,1],
 #           'reg_lambda': [0.1,0.01,1],
            'boosting_type':['goss','dart'],
 #           'metric': ['auc'],
 #           'objective':['binary'],
            'max_bin' : [100,200],  #Setting it to high values has similar effect as caused by increasing value of num_leaves 
        },                          #small numbers reduces accuracy but runs faster 

        'RandomForest':  {
            'max_depth': [5,10],
            'max_features': [15,20],
            'min_samples_split': [5, 10],
            'min_samples_leaf': [3, 5],
            'bootstrap': [True],
            'n_estimators':[1000]},
    }
In [403]:
results = []
results.append(logit_scores['train_accuracy'])
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))

            # generate the pipeline based on the feature selection method
            full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ('SelectKbest',SelectKBest(score_func=mutual_info_classif, k=features_used)),
                ("predictor", classifier)
                ])

            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)

            # Best estimator score
            best_train = pct(grid_search.best_score_)

            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  

            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)

            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)

            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])

            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)

            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))

            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)

            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)

            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)

            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")

            # Record the results
            exp_name = name+str('SelectKbest')
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]

Logistic Regression

In [404]:
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Logistic Regression *****
Parameters:
	C: (10, 1, 0.1, 0.01)
	penalty: ('l1', 'l2', 'elasticnet')
	tol: (0.0001, 1e-05)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   35.8s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:  1.6min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__C: 0.1
	predictor__penalty: l1
	predictor__tol: 1e-05
****** FINISH Logistic Regression  *****

In [405]:
expLog
Out[405]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]

Random Forest

In [406]:
ConductGridSearch(classifiers[3],cnfmatrix,fprs,tprs,precisions,recalls)
****** START RandomForest *****
Parameters:
	bootstrap: [True]
	max_depth: [5, 10]
	max_features: [15, 20]
	min_samples_leaf: [3, 5]
	min_samples_split: [5, 10]
	n_estimators: [1000]
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  4.0min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__bootstrap: True
	predictor__max_depth: 5
	predictor__max_features: 20
	predictor__min_samples_leaf: 3
	predictor__min_samples_split: 10
	predictor__n_estimators: 1000
****** FINISH RandomForest  *****

In [407]:
expLog
Out[407]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]

XGBoost

In [409]:
ConductGridSearch(classifiers[1],cnfmatrix,fprs,tprs,precisions,recalls)
****** START XGBoost *****
Parameters:
	colsample_bytree: [0.2, 0.5]
	eta: [0.01, 0.1]
	learning_rate: [0.01, 0.1]
	max_depth: [3, 5]
	n_estimators: [300, 500]
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   50.4s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  3.6min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__colsample_bytree: 0.2
	predictor__eta: 0.1
	predictor__learning_rate: 0.01
	predictor__max_depth: 3
	predictor__n_estimators: 500
****** FINISH XGBoost  *****

In [410]:
expLog
Out[410]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]

LightGBM

In [411]:
ConductGridSearch(classifiers[2],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Light GBM *****
Parameters:
	boosting_type: ['goss', 'dart']
	learning_rate: [0.01, 0.1]
	max_bin: [100, 200]
	max_depth: [2, 5]
	n_estimators: [1000, 5000]
	num_leaves: [5, 10]
Fitting 5 folds for each of 64 candidates, totalling 320 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  7.5min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 18.6min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__boosting_type: dart
	predictor__learning_rate: 0.01
	predictor__max_bin: 100
	predictor__max_depth: 5
	predictor__n_estimators: 1000
	predictor__num_leaves: 5
****** FINISH Light GBM  *****

In [412]:
expLog
Out[412]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]

Model Validation

Boxplot with all CV results

In [413]:
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names,rotation=90)
pyplot.grid()
pyplot.show()

AUC (Area Under the ROC Curve)

In [414]:
# roc curve fpr, tpr  for all classifiers 
plt.plot([0,1],[0,1], 'k--')
for i in range(len(names)):
    plt.plot(fprs[i],tprs[i],label = names[i] + '  ' + str(scores[i]))
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.show()

Precision Recall Curve

In [415]:
# precision recall curve  for all classifiers 
for i in range(len(names)):
    plt.plot(recalls[i],precisions[i],label = names[i])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Precision-Recall Curve')
plt.show()

Confusion Matrix

In [416]:
# plot confusion matrix for all classifiers 
f, axes = plt.subplots(1, len(names), figsize=(30, 8), sharey='row')
for i in range(len(names)):
    disp = ConfusionMatrixDisplay(cnfmatrix[i], display_labels=['0', '1'])
    disp.plot(ax=axes[i], xticks_rotation=0)
    disp.ax_.set_title("Confusion Matrix - " + names[i])
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')

f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.10, hspace=0.1)

f.colorbar(disp.im_, ax=axes)
plt.show()

Final results

In [417]:
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_SelectKbest.csv",index=False)
expLog
Out[417]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]

Kaggle submission

Voting Classifier to predict best results based on best Classifier Probability

In [420]:
def voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params):
  %%time 
  np.random.seed(42)
  print("Classifier with parameters")
  final_estimators = []
  for i,clf in enumerate(model_selection):
      model = final_best_clf[clf]['predictor'][0]
      print(i+1, " :",model)
      final_estimators.append((clf,make_pipeline(data_prep_pipeline,
                         SelectKBest(score_func=mutual_info_classif, k=features_used),
                          model)))
  voting_classifier = Pipeline([("clf", VotingClassifier(estimators=final_estimators, voting='soft'))])
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  voting_classifier.fit(final_X_train, final_y_train)
  start = time()
  train_time = round(time() - start, 4)
  print("Voting Score:{0}".format(voting_classifier.score(final_X_train, final_y_train)))
  test_class_scores = voting_classifier.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)

Submission File Prep

In [421]:
final_best_clf
model_selection = ['Logistic Regression','XGBoost','Light GBM','RandomForest']
fs_type='SelectKbest'
voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params)
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs
Classifier with parameters
1  : LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=42, solver='saga', tol=1e-05, verbose=0,
                   warm_start=False)
2  : XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.2, eta=0.1, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=500, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
3  : LGBMClassifier(boosting_type='dart', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.01, max_bin=100,
               max_depth=5, min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=1000, n_jobs=-1, num_leaves=5,
               objective=None, random_state=42, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)
4  : RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features=20,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=3, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
(4101, 132) (4101,) (48744, 132)
Voting Score:0.9244086808095586
[0.09728293 0.16706231 0.03504586 0.06002889 0.12528096 0.0729676
 0.03322035 0.05148104 0.04966292 0.1316742 ]
   SK_ID_CURR    TARGET
0      100001  0.097283
1      100005  0.167062

Models (Grid search & Variance Threshold Feature selection)

Various Classification algorithms were used to compare with the best model using Variance Thershold feature selection technique.

Variance Threshold The variance threshold is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples. We assume that features with a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or feature and target variables into account, which is one of the drawbacks of filter methods.

Following metrics were used to find the best model

  • Cross fold Train Accuracy
  • Test Accuracy
  • p-value
  • Train ROC_AUC_Score
  • Test ROC_AUC_Score
  • Train F1_Score
  • Test F1_Score
  • Train LogLoss
  • Test LogLoss
  • Train Time
  • Test Time
  • Confusion matrix
In [431]:
# Clean up the  arrays

del fprs[1:]
del tprs[1:] 
del precisions[1:] 
del recalls[1:] 
del names[1:] 
del scores[1:] 
del cvscores[1:] 
del pvalues[1:] 
del accuracy[1:]
del cnfmatrix[1:] 
del results[1:]
final_best_clf = {}

Classifiers

In [432]:
classifiers = [
        [('Logistic Regression', LogisticRegression(solver='saga',random_state=42),"VarianceThreshold")],
        [('XGBoost', XGBClassifier(random_state=42),"VarianceThreshold")],
        [('Light GBM', LGBMClassifier(boosting_type='gbdt', random_state=42),"VarianceThreshold")],
        [('RandomForest', RandomForestClassifier(random_state=42),"VarianceThreshold")]
    ]

Hyper-parameters for different models

In [433]:
# Arrange grid search parameters for each classifier
params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2','elasticnet'),
            'tol': (0.0001, 0.00001), 
            'C': (10, 1, 0.1, 0.01),
        }
    ,
        'XGBoost':  {
            'max_depth': [3,5], # Lower helps with overfitting
            'n_estimators':[300,500],
            'learning_rate': [0.01,0.1],
#            'objective': ['binary:logistic'],
#            'eval_metric': ['auc'],
            'eta' : [0.01,0.1],
            'colsample_bytree' : [0.2,0.5], 
        },
        'Light GBM':  {
            'max_depth': [2,5],  # Lower helps with overfitting
            'num_leaves': [5,10], # Equivalent to max depth
            'n_estimators':[1000,5000],
            'learning_rate': [0.01,0.1],
 #           'reg_alpha': [0.1,0.01,1],
 #           'reg_lambda': [0.1,0.01,1],
            'boosting_type':['goss','dart'],
 #           'metric': ['auc'],
 #           'objective':['binary'],
            'max_bin' : [100,200],  #Setting it to high values has similar effect as caused by increasing value of num_leaves 
        },                          #small numbers reduces accuracy but runs faster 

        'RandomForest':  {
            'max_depth': [5,10],
            'max_features': [15,20],
            'min_samples_split': [5, 10],
            'min_samples_leaf': [3, 5],
            'bootstrap': [True],
            'n_estimators':[1000]},
    }
In [439]:
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))

            # generate the pipeline based on the feature selection method
            full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ("VarianceThreshold", VarianceThreshold(threshold=0.9)),    
                ("predictor", classifier)
                ])

            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)

            # Best estimator score
            best_train = pct(grid_search.best_score_)

            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  

            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)

            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)

            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])

            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)

            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))

            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)

            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)

            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)

            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")

            # Record the results
            exp_name = name+str('Variance')
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]

Logistic Regression

In [440]:
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Logistic Regression *****
Parameters:
	C: (10, 1, 0.1, 0.01)
	penalty: ('l1', 'l2', 'elasticnet')
	tol: (0.0001, 1e-05)
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    4.9s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   10.2s finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__C: 0.1
	predictor__penalty: l1
	predictor__tol: 0.0001
****** FINISH Logistic Regression  *****

In [441]:
expLog
Out[441]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
13 Logistic RegressionVariance 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 0.4332 0.016000 0.0068 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]

XGBoost

In [442]:
ConductGridSearch(classifiers[1],cnfmatrix,fprs,tprs,precisions,recalls)
****** START XGBoost *****
Parameters:
	colsample_bytree: [0.2, 0.5]
	eta: [0.01, 0.1]
	learning_rate: [0.01, 0.1]
	max_depth: [3, 5]
	n_estimators: [300, 500]
Fitting 5 folds for each of 32 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   19.7s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  1.7min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__colsample_bytree: 0.2
	predictor__eta: 0.01
	predictor__learning_rate: 0.01
	predictor__max_depth: 3
	predictor__n_estimators: 500
****** FINISH XGBoost  *****

In [443]:
expLog
Out[443]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
13 Logistic RegressionVariance 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 0.4332 0.016000 0.0068 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
14 XGBoostVariance 0.9312 0.9230 0.9236 0.9604 0.7318 0.7388 0.1607 0.0000 0.0000 2.3754 2.6612 2.6374 0.0000 1.4006 0.058700 0.0269 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]

Light GBM

In [444]:
ConductGridSearch(classifiers[2],cnfmatrix,fprs,tprs,precisions,recalls)
****** START Light GBM *****
Parameters:
	boosting_type: ['goss', 'dart']
	learning_rate: [0.01, 0.1]
	max_bin: [100, 200]
	max_depth: [2, 5]
	n_estimators: [1000, 5000]
	num_leaves: [5, 10]
Fitting 5 folds for each of 64 candidates, totalling 320 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   51.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 15.1min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__boosting_type: dart
	predictor__learning_rate: 0.01
	predictor__max_bin: 100
	predictor__max_depth: 5
	predictor__n_estimators: 1000
	predictor__num_leaves: 5
****** FINISH Light GBM  *****

In [445]:
expLog
Out[445]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
13 Logistic RegressionVariance 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 0.4332 0.016000 0.0068 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
14 XGBoostVariance 0.9312 0.9230 0.9236 0.9604 0.7318 0.7388 0.1607 0.0000 0.0000 2.3754 2.6612 2.6374 0.0000 1.4006 0.058700 0.0269 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
15 Light GBMVariance 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 3.9899 0.112900 0.0687 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]

RandomForest

In [446]:
ConductGridSearch(classifiers[3],cnfmatrix,fprs,tprs,precisions,recalls)
****** START RandomForest *****
Parameters:
	bootstrap: [True]
	max_depth: [5, 10]
	max_features: [15, 20]
	min_samples_leaf: [3, 5]
	min_samples_split: [5, 10]
	n_estimators: [1000]
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:  3.2min finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__bootstrap: True
	predictor__max_depth: 5
	predictor__max_features: 20
	predictor__min_samples_leaf: 5
	predictor__min_samples_split: 5
	predictor__n_estimators: 1000
****** FINISH RandomForest  *****

In [447]:
expLog
Out[447]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
13 Logistic RegressionVariance 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 0.4332 0.016000 0.0068 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
14 XGBoostVariance 0.9312 0.9230 0.9236 0.9604 0.7318 0.7388 0.1607 0.0000 0.0000 2.3754 2.6612 2.6374 0.0000 1.4006 0.058700 0.0269 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
15 Light GBMVariance 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 3.9899 0.112900 0.0687 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
16 RandomForestVariance 0.9251 0.9232 0.9236 0.9511 0.7379 0.7243 0.0148 0.0000 0.0000 2.5859 2.6517 2.6374 0.0001 6.7289 0.510600 0.2204 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 5], ["predictor__min_samples_split", 5], ["predictor__n_estimators", 1000]]

Model Validation

Boxplot with all CV results

In [448]:
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names,rotation=90)
pyplot.grid()
pyplot.show()

AUC (Area Under the ROC Curve)

In [449]:
# roc curve fpr, tpr  for all classifiers 
plt.plot([0,1],[0,1], 'k--')
for i in range(len(names)):
    plt.plot(fprs[i],tprs[i],label = names[i] + '  ' + str(scores[i]))
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.show()

Precision Recall Curve

In [450]:
# precision recall curve  for all classifiers 
for i in range(len(names)):
    plt.plot(recalls[i],precisions[i],label = names[i])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Precision-Recall Curve')
plt.show()

Confusion Matrix

In [451]:
# plot confusion matrix for all classifiers 
f, axes = plt.subplots(1, len(names), figsize=(30, 8), sharey='row')
for i in range(len(names)):
    disp = ConfusionMatrixDisplay(cnfmatrix[i], display_labels=['0', '1'])
    disp.plot(ax=axes[i], xticks_rotation=0)
    disp.ax_.set_title("Confusion Matrix - " + names[i])
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')

f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.10, hspace=0.1)

f.colorbar(disp.im_, ax=axes)
plt.show()

Final results

In [452]:
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_Variance.csv",index=False)
expLog
Out[452]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
13 Logistic RegressionVariance 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 0.4332 0.016000 0.0068 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
14 XGBoostVariance 0.9312 0.9230 0.9236 0.9604 0.7318 0.7388 0.1607 0.0000 0.0000 2.3754 2.6612 2.6374 0.0000 1.4006 0.058700 0.0269 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
15 Light GBMVariance 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 3.9899 0.112900 0.0687 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
16 RandomForestVariance 0.9251 0.9232 0.9236 0.9511 0.7379 0.7243 0.0148 0.0000 0.0000 2.5859 2.6517 2.6374 0.0001 6.7289 0.510600 0.2204 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 5], ["predictor__min_samples_split", 5], ["predictor__n_estimators", 1000]]

Kaggle submission

Voting Classifier to predict best results based on best Classifier Probability

In [459]:
def voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params):
  %%time 
  np.random.seed(42)
  print("Classifier with parameters")
  final_estimators = []
  for i,clf in enumerate(model_selection):
      model = final_best_clf[clf]['predictor'][0]
      print(i+1, " :",model)
      final_estimators.append((clf,make_pipeline(data_prep_pipeline,
                         (VarianceThreshold(threshold=0.9)),
                          model)))
  voting_classifier = Pipeline([("clf", VotingClassifier(estimators=final_estimators, voting='soft'))])
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  voting_classifier.fit(final_X_train, final_y_train)
  start = time()
  train_time = round(time() - start, 4)
  print("Voting Score:{0}".format(voting_classifier.score(final_X_train, final_y_train)))
  test_class_scores = voting_classifier.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)

Submission File Prep

In [460]:
final_best_clf
model_selection = ['Logistic Regression','XGBoost','Light GBM','RandomForest']
fs_type='Variance'
voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params)
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 10.3 µs
Classifier with parameters
1  : LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=42, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
2  : XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.2, eta=0.01, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=500, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=42,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
3  : LGBMClassifier(boosting_type='dart', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.01, max_bin=100,
               max_depth=5, min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=1000, n_jobs=-1, num_leaves=5,
               objective=None, random_state=42, reg_alpha=0.0, reg_lambda=0.0,
               silent=True, subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)
4  : RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features=20,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
(4101, 132) (4101,) (48744, 132)
Voting Score:0.9244086808095586
[0.09795598 0.16244282 0.03414981 0.05942214 0.124789   0.06864761
 0.03231656 0.05208861 0.05032919 0.12928741]
   SK_ID_CURR    TARGET
0      100001  0.097956
1      100005  0.162443

XGBoost (SMOTE with Early Stopping)

In this section we are using the classifier XGBoost with SMOTE for imbalanced dataset with early stopping.

SMOTE: Synthetic Minority Oversampling Technique SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model. SMOTE works by taking advantage of thek-nearest neighbor algorithm to create synthetic data. This algorithm overcomes the overfitting issue raised by random oversampling. This also emphasizes on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

Early stopping is an approach to training complex machine learning models to avoid overfitting. It works by monitoring the performance of the model that is being trained on a separate test dataset and stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations. It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.The performance measure may be the loss function that is being optimized to train the model (such as logarithmic loss)

Classifiers

In [571]:
classifiers = [
        [('XGBoost SMOTE', XGBClassifier(random_state=42),"SMOTE")],]

Hyper-parameters for different models

In [572]:
params_grid = {
        'XGBoost SMOTE':  {
        'max_depth': [5], # Lower helps with overfitting
        'n_estimators':[1000],
        'learning_rate': [0.01],
        'objective': ['binary:logistic'],
        'eval_metric': ['auc'],
        'min_child_weight' : [15],
        'eta' : [0.01,],
        'subsample' : [0.5],
        'early_stopping_rounds':[5,10]
    },
    }
In [573]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
results=[]

def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))

            # generate the pipeline based on the feature selection method
            full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ('SMOTE', SMOTE(random_state=42, sampling_strategy=0.25, k_neighbors=3)),
                ("predictor", classifier)
                ])

            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)

            # Best estimator score
            best_train = pct(grid_search.best_score_)

            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  

            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)

            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)

            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])

            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)

            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))

            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)

            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)

            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)

            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")

            # Record the results
            exp_name = name+str('SMOTE')
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]

XGBoost

In [576]:
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
****** START XGBoost SMOTE *****
Parameters:
	early_stopping_rounds: [5, 10]
	eta: [0.01]
	eval_metric: ['auc']
	learning_rate: [0.01]
	max_depth: [5]
	min_child_weight: [15]
	n_estimators: [1000]
	objective: ['binary:logistic']
	subsample: [0.5]
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   41.6s finished
Cross validation with best estimator
Fit and Prediction with best estimator
Best Parameters:
	predictor__early_stopping_rounds: 5
	predictor__eta: 0.01
	predictor__eval_metric: auc
	predictor__learning_rate: 0.01
	predictor__max_depth: 5
	predictor__min_child_weight: 15
	predictor__n_estimators: 1000
	predictor__objective: binary:logistic
	predictor__subsample: 0.5
****** FINISH XGBoost SMOTE  *****

Final results

In [577]:
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_SMOTE.csv",index=False)
expLog
Out[577]:
exp_name Train Acc Valid Acc Test Acc Train AUC Valid AUC Test AUC Train F1 Score Valid F1 Score Test F1 Score Train Log Loss Valid Log Loss Test Log Loss P Score Train Time Valid Time Test Time Description
0 Baseline_132_features 0.9289 0.9120 0.9188 0.8547 0.7024 0.7500 0.2301 0.1298 0.1071 2.4564 3.0387 2.8058 0.0000 0.5625 0.018573 0.0077 Imbalanced Logistic reg features 132: Num:132, Cat:0 with 20% training data
1 Logistic Regression 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 1.0242 0.017700 0.0066 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
2 Gradient Boosting 0.9475 0.9224 0.9196 0.9412 0.7240 0.7212 0.4651 0.0212 0.0000 1.8129 2.6801 2.7777 0.0005 6.4526 0.022900 0.0114 [["predictor__max_depth", 5], ["predictor__max_features", 10], ["predictor__min_samples_leaf", 3], ["predictor__n_estimators", 1000], ["predictor__n_iter_no_change", 10], ["predictor__subsample", 0.8], ["predictor__tol", 0.0001], ["predictor__validation_fraction", 0.2]]
3 XGBoost 0.9310 0.9230 0.9236 0.9599 0.7325 0.7367 0.1543 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.3547 0.057700 0.0305 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
4 Light GBM 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 6.2475 0.116800 0.0649 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
5 RandomForest 0.9248 0.9232 0.9236 0.9510 0.7384 0.7207 0.0059 0.0000 0.0000 2.5980 2.6517 2.6374 0.0001 7.1055 0.504400 0.2098 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 15], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
6 Support Vector 0.9245 0.9232 0.9236 1.0000 0.6624 0.6721 0.0000 0.0000 0.0000 2.6061 2.6517 2.6374 0.0000 2.3175 0.251700 0.1993 [["predictor__C", 0.01], ["predictor__degree", 4], ["predictor__gamma", 0.01], ["predictor__kernel", "rbf"]]
7 Baseline_132_features 0.9256 0.9219 0.9228 0.8240 0.7427 0.7549 0.0503 0.0237 0.0206 2.5697 2.6989 2.6655 0.0006 0.3085 0.021900 0.0081 [["predictor__C", 0.01], ["predictor__penalty", "l2"], ["predictor__tol", 1e-05]]
8 Logistic Regression 0.9255 0.9230 0.9220 0.8156 0.7558 0.7517 0.0602 0.0396 0.0204 2.5737 2.6612 2.6935 0.0002 2.7305 0.018800 0.0074 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
9 Logistic RegressionSelectKbest 0.9257 0.9227 0.9228 0.8148 0.7578 0.7486 0.0629 0.0396 0.0206 2.5656 2.6706 2.6655 0.0003 2.6489 0.016300 0.0069 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 1e-05]]
10 RandomForestSelectKbest 0.9252 0.9232 0.9236 0.9498 0.7396 0.7177 0.0172 0.0000 0.0000 2.5818 2.6517 2.6374 0.0001 8.7915 0.504400 0.1934 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 3], ["predictor__min_samples_split", 10], ["predictor__n_estimators", 1000]]
11 XGBoostSelectKbest 0.9310 0.9230 0.9236 0.9591 0.7279 0.7384 0.1551 0.0000 0.0000 2.3835 2.6612 2.6374 0.0000 3.7312 0.059800 0.0314 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.1], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
12 Light GBMSelectKbest 0.9279 0.9227 0.9220 0.8901 0.7219 0.7208 0.0839 0.0083 0.0000 2.4887 2.6706 2.6935 0.0020 6.1818 0.125500 0.0553 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
13 Logistic RegressionVariance 0.9256 0.9230 0.9220 0.8162 0.7577 0.7522 0.0602 0.0396 0.0204 2.5697 2.6612 2.6935 0.0002 0.4332 0.016000 0.0068 [["predictor__C", 0.1], ["predictor__penalty", "l1"], ["predictor__tol", 0.0001]]
14 XGBoostVariance 0.9312 0.9230 0.9236 0.9604 0.7318 0.7388 0.1607 0.0000 0.0000 2.3754 2.6612 2.6374 0.0000 1.4006 0.058700 0.0269 [["predictor__colsample_bytree", 0.2], ["predictor__eta", 0.01], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 3], ["predictor__n_estimators", 500]]
15 Light GBMVariance 0.9281 0.9224 0.9228 0.8927 0.7277 0.7220 0.0864 0.0000 0.0000 2.4847 2.6801 2.6655 0.0012 3.9899 0.112900 0.0687 [["predictor__boosting_type", "dart"], ["predictor__learning_rate", 0.01], ["predictor__max_bin", 100], ["predictor__max_depth", 5], ["predictor__n_estimators", 1000], ["predictor__num_leaves", 5]]
16 RandomForestVariance 0.9251 0.9232 0.9236 0.9511 0.7379 0.7243 0.0148 0.0000 0.0000 2.5859 2.6517 2.6374 0.0001 6.7289 0.510600 0.2204 [["predictor__bootstrap", true], ["predictor__max_depth", 5], ["predictor__max_features", 20], ["predictor__min_samples_leaf", 5], ["predictor__min_samples_split", 5], ["predictor__n_estimators", 1000]]
17 XGBoost SMOTESMOTE 0.9434 0.9180 0.9220 0.9452 0.7119 0.7423 0.4424 0.0942 0.0943 1.9546 2.8311 2.6935 0.0001 15.2817 0.101200 0.0707 [["predictor__eta", 0.01], ["predictor__eval_metric", "auc"], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 5], ["predictor__min_child_weight", 15], ["predictor__n_estimators", 1000], ["predictor__objective", "binary:logistic"], ["predictor__subsample", 0.5]]
18 XGBoost SMOTESMOTE 0.9434 0.9180 0.9220 0.9452 0.7119 0.7423 0.4424 0.0942 0.0943 1.9546 2.8311 2.6935 0.0001 15.4611 0.107800 0.0690 [["predictor__early_stopping_rounds", 5], ["predictor__eta", 0.01], ["predictor__eval_metric", "auc"], ["predictor__learning_rate", 0.01], ["predictor__max_depth", 5], ["predictor__min_child_weight", 15], ["predictor__n_estimators", 1000], ["predictor__objective", "binary:logistic"], ["predictor__subsample", 0.5]]

Kaggle submission

In [578]:
  fs_type='XGBoost SMOTE'
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  grid_search.best_estimator_.fit(final_X_train, final_y_train)
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  start = time()
  train_time = round(time() - start, 4)
  print("XGBoost SMOTE Score:{0}".format(grid_search.best_estimator_.score(final_X_train, final_y_train)))
  test_class_scores = grid_search.best_estimator_.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
(4101, 132) (4101,) (48744, 132)
XGBoost SMOTE Score:0.9234333089490369
[0.05766601 0.22704034 0.01519151 0.08732243 0.17832727 0.06538762
 0.01624627 0.04306863 0.0169574  0.10466651]
   SK_ID_CURR    TARGET
0      100001  0.057666
1      100005  0.227040

Kaggle submission via the command line API

In [ ]:
#! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "Phase 2-Voting submission"

Deep Learning

Deep Learning Model Pipeline & Workflow

Deep learning is a sub field of machine learning. Deep learning is about learning from past data using artificial neural networks with multiple hidden layers (2 or more hidden layers). Deep neural networks uncrumple complex representation of data step-by-step, layer-by-layer (hence multiple hidden layers) into a clear representation of the data. Artificial neural networks having one hidden layer apart from input and output layer is called as multi-layer perceptron (MLP) network.

image.png

Deep Learning Pipeline Model workflow

image.png

Imports

In [499]:
import torch
import tensorflow as tf
import torch.nn as nn
import torch.nn.functional as func
import torch.optim as optim
from torch.utils.data import DataLoader

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers.normalization import BatchNormalization

import copy
from datetime import datetime
import pickle
import time
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.nn.functional as func
import torch.optim as optim
from torch.optim import lr_scheduler

# Metrics
from sklearn.metrics import auc

Single layer Neural Network

Data Preparation

Transform data using data pipeline and converted into Tensor for neural network pipeline.

In [475]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
cuda:0
In [477]:
full_X_train = data_prep_pipeline.fit_transform(X_train)
full_X_test = data_prep_pipeline.fit_transform(X_test)

full_X_train_gpu = torch.FloatTensor(full_X_train).cuda()
full_X_test_gpu = torch.FloatTensor(full_X_test).cuda()

y_train_gpu =  torch.FloatTensor(y_train.to_numpy()).cuda()
y_test_gpu = torch.FloatTensor(y_test.to_numpy()).cuda()
In [479]:
full_X_test_gpu.shape,full_X_train_gpu.shape
Out[479]:
(torch.Size([1231, 138]), torch.Size([2439, 138]))
In [480]:
results = pd.DataFrame(columns=["ExpID", 
              "Train Acc", "Val Acc", "Test Acc", "p-value",
              "Train AUC", "Val AUC", "Test AUC",
              "Train f1", "Val f1", "Test f1",
              "Train logloss", "Val logloss", "Test logloss",
              "Train Time(s)", "Val Time(s)", "Test Time(s)", 
              "Experiment description",
              "Top 10 Features"])

One layer : Linear and Sigmoid Activate Function

Signmoid layer is used to create the probability of prediction.

In [481]:
D_in = full_X_train_gpu.shape[1]
D_hidden1 = 20
D_hidden2 = 10
D_out= 1
model1 = torch.nn.Sequential( 
    torch.nn.Linear(D_in, D_out),
    nn.Sigmoid())
In [482]:
learning_rate = 0.01
optimizer = torch.optim.Adam(model1.parameters(), lr=learning_rate)
model1 = model1.cuda()
In [483]:
def return_report(y, y_prob):
  _, y_pred = torch.max(y_prob, dim = 1)
  y_pred = y_pred.cpu().numpy()
  acc = accuracy_score(y, y_pred)
  roc_auc = roc_auc_score(y, y_prob.cpu().detach().numpy())

  return_list = ([round(acc,4), round(roc_auc, 4)])

  return return_list
In [484]:
def print_report(y, y_prob):
  _, y_pred = torch.max(y_prob, dim = 1)
  y_pred = y_pred.cpu().numpy()
  acc = accuracy_score(y, y_pred)
  roc_auc = roc_auc_score(y, y_prob.cpu().detach().numpy())

  print(f'Accuracy : {round(acc,4)} ; ROC_AUC : {round(roc_auc, 4)}')

Train Neural Network

In [485]:
epochs = 500
y_train_gpu = y_train_gpu.reshape(-1, 1)
print('Train data : ')
model1.train()
for i in range(epochs):
  

  y_train_pred_prob = model1(full_X_train_gpu)

  loss = func.binary_cross_entropy(y_train_pred_prob, y_train_gpu)
  optimizer.zero_grad()
  #loss = loss_func(y_train_pred_prob, y_train_gpu)
  loss.backward()
  optimizer.step()

  if i % 50 == 49:
    print(f"Epoch {i + 1}:")
    print_report(y_train, y_train_pred_prob)
Train data : 
Epoch 50:
Accuracy : 0.9241 ; ROC_AUC : 0.8273
Epoch 100:
Accuracy : 0.9241 ; ROC_AUC : 0.8308
Epoch 150:
Accuracy : 0.9241 ; ROC_AUC : 0.8325
Epoch 200:
Accuracy : 0.9241 ; ROC_AUC : 0.8336
Epoch 250:
Accuracy : 0.9241 ; ROC_AUC : 0.8347
Epoch 300:
Accuracy : 0.9241 ; ROC_AUC : 0.8357
Epoch 350:
Accuracy : 0.9241 ; ROC_AUC : 0.8365
Epoch 400:
Accuracy : 0.9241 ; ROC_AUC : 0.8371
Epoch 450:
Accuracy : 0.9241 ; ROC_AUC : 0.8376
Epoch 500:
Accuracy : 0.9241 ; ROC_AUC : 0.8381

Evaluation of Neural Network model

In [486]:
model1.eval()
y_test_gpu = y_test_gpu.reshape(-1, 1)
with torch.no_grad():
    y_test_pred_prob=model1(full_X_test_gpu)
    print('-' * 50)
    print('Test data : ')
    print_report(y_test, y_test_pred_prob)
    print('-' * 50)
--------------------------------------------------
Test data : 
Accuracy : 0.9236 ; ROC_AUC : 0.746
--------------------------------------------------

Kaggle Submission

In [502]:
final_X_kaggle_test = kaggle_test
final_X_kaggle_test = data_prep_pipeline.fit_transform(final_X_kaggle_test)
full_X_kaggle_gpu = torch.FloatTensor(final_X_kaggle_test).cuda()
full_X_kaggle_gpu.shape
Out[502]:
torch.Size([48744, 138])
In [490]:
  model1.eval()
  test_class_scores = model1(full_X_kaggle_gpu)
  print(test_class_scores[0:10])
tensor([[0.0984],
        [0.2546],
        [0.0498],
        [0.0925],
        [0.2056],
        [0.0671],
        [0.0261],
        [0.0897],
        [0.0225],
        [0.1072]], device='cuda:0', grad_fn=<SliceBackward>)
In [493]:
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  fs_type = "simple_nn"
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores.detach().cpu().numpy()
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
   SK_ID_CURR    TARGET
0      100001  0.098399
1      100005  0.254585

Multi Layer NN model with custom Hinge & CXE Loss

Model Definition

The model contain one linear layer, one hidden layer with Relu function and sigmoid function for probability prediction.

In [524]:
## Model using hidden layers
class SVMNNmodel(nn.Module):
  def __init__(self, input_features , hidden1 = 80, hidden2 = 80, output_features = 1):
    super(SVMNNmodel, self).__init__()
    # self.f_connected1 = nn.Linear(input_features, hidden1)
    # self.f_connected2 = nn.Linear(hidden1, hidden2)
    # self.out = nn.Linear(hidden2, output_features)
    # self.sigmoid = nn.Sigmoid()
    self.f_connected1 = nn.Linear(input_features, hidden1)
    self.out = nn.Linear(hidden1, output_features)

  def forward(self, x):
    #x = func.relu(self.f_connected1(x))
    #x= func.relu(self.f_connected2(x))
    #x = self.out(x)
    #return self.sigmoid(x)
    h_relu = torch.relu(self.f_connected1(x))
    y_target_pred = torch.sigmoid(self.out(h_relu))
    return y_target_pred

Custom Hinge loss Definition

The model contain one linear layer, one hidden layer with Relu function and sigmoid function for probability prediction.

To extend a hard SVM to cases in which the data are not linearly separable (a little noisy), we introduce the hinge loss function,

$${\displaystyle \max \left(0,1-y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\right).} $$

This function is zero if the following constraint for a training example $y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\geq 1,$ is satisfied, in other words, if ${\displaystyle {\vec {x}}_{i}} $ lies on the correct side of the margin (DMZ demilitarized zone). For data on the wrong side of the margin, the function's value is proportional to the distance from the margin.

This type of data loss leads a soft-margin SVM classifier. Computing a (soft-margin) SVM classifier amounts to minimizing an expression of the form:

$$ {LinSVM}(\mathbf{w}, b) = \underset{W,b}{\operatorname{argmin}} \, \left( \overbrace{\dfrac{\lambda}{2}}^A \underbrace{\mathbf{w}^T \cdot \mathbf{w}}_B \quad + \quad \overbrace{\dfrac{1}{m} {\displaystyle \sum\limits_{i=1}^{m}max\left(0, 1 - \underbrace{y^{(i)}(\mathbf{w}^T \cdot \mathbf{x}^{(i)} + b)}_D \right)} }^{C}\right) \qquad (3) $$

where the parameter ${\displaystyle \lambda } $ (corresponding to the A-zone in the above formulation) determines the tradeoff between increasing the margin-size and ensuring that the ${\displaystyle {\vec {x}}_{i}} $ lie on the correct side of the margin (corresponding to the C-zone in the above formulation).

Here choosing a sufficiently small value for $ \lambda$ yields the hard-margin classifier for linearly classifiable input data. Classically, this problem can be solved via quadratic programming (see slides/textbook/wikipedia for details). More recently approaches such as sub-gradient descent and coordinate descent have been proposed and lead to more scaleable implementations without compromising quality.

In [525]:
class SVMLoss(nn.Module):
  def __init__(self):
    super(SVMLoss,self).__init__()
  def forward(self,outputs,labels,model2):
     C = 0.10
    # data_loss = torch.mean(torch.clamp(1 - outputs.squeeze().t() == labels,min=0))
     data_loss = torch.mean(torch.clamp(1 - outputs.squeeze(),min=0))
     weight = model2.out.weight.squeeze()
     reg_loss = weight.t() @ weight
     reg_loss = reg_loss + ( model2.out.bias.squeeze()**2)
     hinge = data_loss +( C*reg_loss/2)
     return (hinge)
In [527]:
class Converttensor(Dataset):
    def __init__(self, feature, label, mode ='train', transforms=None):
        """
        Initialize data set as a list of IDs corresponding to each item of data set

        :param feature: x - numpy array
        :param label: y - numpy array
        """

        self.x = feature
        self.y = label

    
    def __len__(self):
        """
        Return the length of data set using list of IDs

        :return: number of samples in data set
        """
        return (self.x.shape[0])

    def __getitem__(self, index):
        """
        Generate one item of data set.

        :param index: index of item in IDs list

        :return: image and label and bouding box params
        """
        x = self.x[index,:]
        y_target = self.y[index]

        x = torch.FloatTensor(x)
        y_target_arr = np.array(y_target)
        return x, y_target_arr
In [545]:
fprs_net_train, tprs_net_train, fprs_net_valid, tprs_net_valid = [], [], [], []
roc_auc_net_train = 0.0
roc_auc_net_valid = 0.0
num_epochs=25
batch_size=256
CASE_NAME = "NN"

Data Preparation

In [546]:
splits = 1

# Train Test split percentage
subsample_rate = 0.3

finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
final_X_kaggle_test = kaggle_test
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
                                                    test_size=subsample_rate, random_state=42)

X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)

nn_X_train = data_prep_pipeline.fit_transform(X_train)
nn_X_valid = data_prep_pipeline.fit_transform(X_valid)
nn_X_test = data_prep_pipeline.fit_transform(X_test)
nn_X_kaggle_test = data_prep_pipeline.fit_transform(final_X_kaggle_test)
full_X_kaggle_gpu = torch.FloatTensor(nn_X_kaggle_test).cuda()
nn_y_train = np.array(y_train)
nn_y_valid = np.array(y_valid)

in_feature_cnt = nn_X_train.shape[1]
out_feature_cnt = 1

print(f"X train           shape: {nn_X_train.shape}")
print(f"X validation      shape: {nn_X_valid.shape}")
print(f"X test            shape: {nn_X_test.shape}")
print(f"X kaggle_test     shape: {nn_X_kaggle_test.shape}")
print("Feature count           : ",in_feature_cnt)
X train           shape: (182968, 138)
X validation      shape: (32289, 138)
X test            shape: (92254, 138)
X kaggle_test     shape: (48744, 138)
Feature count           :  138
In [547]:
nn_dataset = {'train': nn_X_train, 'val': nn_X_valid}
dataset_sizes = {x_type : len(nn_dataset[x_type]) for x_type in ['train','val']}
In [548]:
dataset_sizes
Out[548]:
{'train': 182968, 'val': 32289}
In [549]:
## Transform dataset
nn_dataset['train'] = Converttensor(nn_dataset['train'] ,nn_y_train, mode='train')
In [550]:
## Transform validation dataset
nn_dataset['val'] = Converttensor(nn_dataset['val'] ,nn_y_valid, mode='validation')
In [551]:
nn_dataset
Out[551]:
{'train': <__main__.Converttensor at 0x7f83f166b0d0>,
 'val': <__main__.Converttensor at 0x7f83df5f3710>}
In [552]:
## Set dataloader
dataloaders = {x_type: torch.utils.data.DataLoader(nn_dataset[x_type], batch_size=batch_size,shuffle=True, num_workers=0)  
              for x_type in ['train', 'val']}  

Train Model

In [553]:
# Set model
nn_model = SVMNNmodel(input_features = in_feature_cnt, output_features= 1).cuda()
#nn_model = nn_model.float()
In [554]:
#del convergence
try:
       convergence
       epoch_offset=convergence.epoch.iloc[-1]+1
except NameError:
        convergence=pd.DataFrame(columns=['epoch','phase','roc_auc','accuracy','CXE','Hinge'])
        epoch_offset=0
In [555]:
# Code adapted from https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
def train(optimizer_cxe,optimizer_hinge,criteron,scheduler_cxe,scheduler_hinge,num_epochs=25, w_cel=1.0):
    
    global roc_auc_train
    global roc_auc_valid

    fac_cel=torch.tensor(w_cel)

    start = time.time()

    best_model_wts = copy.deepcopy(nn_model.state_dict())
    best_acc = 0.0

    # Store results to easier collect stats
    nn_y_pred = {x: np.zeros((dataset_sizes[x],1)) for x in ['train', 'val']}

    for epoch in range(num_epochs):

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            t0=time.time()
            # Reset to zero to be save
           
            nn_y_pred[phase].fill(0)
            if phase == 'train':
                nn_model.train()  # Set model to training mode
            else:
                nn_model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0
            running_hinge = 0.0
            running_cxe = 0.0

            # Iterate over data.
            ix=0
            for inputs, targets in dataloaders[phase]:
                n_batch = len(targets)
                
                #nn_y_pred[phase][ix:ix+n_batch,:] = targets.detach().numpy().reshape(-1,1)

                inputs = inputs.to(device)
                targets = targets.to(device).float()

                # zero the parameter gradients
                optimizer_hinge.zero_grad()
                optimizer_cxe.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    output_target = nn_model.forward(inputs)
                    preds = torch.where((output_target > .5), 1, 0)
                    #print(output_target.squeeze(),targets)
                    ix += n_batch
                    loss_cxe = func.binary_cross_entropy(output_target.squeeze(), targets)
                    loss_hinge = criteron.forward(output_target.squeeze(), targets,nn_model)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss_hinge.backward()
                        optimizer_hinge.step()
                        #loss_cxe.backward()
                        optimizer_cxe.step()

                # statistics
                running_hinge += loss_hinge.item() * inputs.size(0)
                running_corrects += 1*(preds == targets.data.int()).sum().item()
                running_cxe += loss_cxe.item() * inputs.size(0)

            if phase == 'train':
                scheduler_hinge.step()
                scheduler_cxe.step()

            epoch_cxe = running_cxe / dataset_sizes[phase]
            epoch_hinge = running_hinge / dataset_sizes[phase]
            epoch_acc = running_corrects / dataset_sizes[phase]                      

            epoch_roc_auc = 0.0 
            if (phase == 'train'):
                ## Calculate 'false_positive_rate' and 'True_positive_rate' of train
    
                nn_fprs_train, nn_tpr_train, nn_thresholds = roc_curve(targets.detach().cpu().numpy(), output_target.squeeze().detach().cpu().numpy())
                fprs_net_train.append(nn_fprs_train)
                tprs_net_train.append(nn_tpr_train)
                roc_auc_train = round(auc(nn_fprs_train, nn_tpr_train), 4)  
                epoch_roc_auc = roc_auc_train

            elif (phase == 'val'):
                ## Calculate 'false_positive_rate' and 'True_positive_rate' of valid
                nn_fpr_valid, nn_tpr_valid, thresholds = roc_curve(targets.detach().cpu().numpy(), output_target.squeeze().detach().cpu().numpy())
                fprs_net_valid.append(nn_fpr_valid)
                tprs_net_valid.append(nn_tpr_valid)
                roc_auc_valid = round(auc(nn_fpr_valid, nn_tpr_valid), 4)
                epoch_roc_auc = roc_auc_valid

            dt=time.time() - t0
            fmt='{:6s} ROC_AUC: {:.4f} Acc: {:.4f} CXE: {:.4f} Hinge: {:.4f}  DT={:.1f}'
            out_list=[phase, epoch_roc_auc, epoch_acc, epoch_cxe, epoch_hinge] + [dt]
            out_str=fmt.format(*out_list)
            if phase=='train':
                epoch_str='Epoch {}/{} '.format(epoch, num_epochs)
                out_str=epoch_str + out_str
            else:
                out_str = ' '*len(epoch_str) + out_str
            print(out_str)

            if (phase == 'val') and epoch == num_epochs-1:
                 plt.plot(nn_fprs_train, nn_tpr_train, color='blue') 
                 plt.plot(nn_fpr_valid, nn_tpr_valid, color='orange')
                 plt.xlim([0.0,1.0])
                 plt.ylim([0.0,1.0])
                 plt.xlabel('False Positive Rate')
                 plt.ylabel('True Positive Rate')
                 plt.title(f'ROC Curve Comparison')
                 plt.legend([f'TrainRocAuc (AUC = {roc_auc_train})', f'TestRocAuc (AUC = {roc_auc_valid})'])
                 plt.show()

            convergence.loc[len(convergence)] = [epoch+epoch_offset,phase,   
                        epoch_roc_auc, epoch_acc, epoch_cxe, epoch_hinge]
            
            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(nn_model.state_dict())
 
    time_elapsed = time.time() - start
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    nn_model.load_state_dict(best_model_wts)   
    

Execute Model

In [556]:
# Run all Code cells first

optimizer_cxe = optim.Adam(nn_model.parameters(), lr=0.0001)
optimizer_hinge = torch.optim.SGD(nn_model.parameters(), lr=learning_rate,momentum = 0.5,weight_decay = 0.1)
nn_model = nn_model.cuda()
scheduler_cxe = lr_scheduler.StepLR(optimizer_cxe, step_size=10, gamma=0.1)  
scheduler_hinge= lr_scheduler.StepLR(optimizer_hinge, step_size=10, gamma=0.1)
criteron = SVMLoss()
train(optimizer_cxe,optimizer_hinge,criteron,scheduler_cxe,scheduler_hinge,num_epochs=num_epochs, w_cel=0.000000001)

t0=time.time()
date_time = datetime.now().strftime("--%Y-%m-%d-%H-%M-%S-%f")
pickle.dump(nn_model,open(DATA_DIR + '/' + CASE_NAME + date_time + '.p','wb'))
print('Pickled in {:.2f} sec'.format(time.time()-t0))
Epoch 0/25 train  ROC_AUC: 0.5033 Acc: 21.6182 CXE: 2.3685 Hinge: 0.1433  DT=5.5
           val    ROC_AUC: 0.2812 Acc: 20.6624 CXE: 2.8362 Hinge: 0.0966  DT=0.7
Epoch 1/25 train  ROC_AUC: 0.5558 Acc: 20.6596 CXE: 2.9760 Hinge: 0.0884  DT=5.6
           val    ROC_AUC: 0.2903 Acc: 20.6555 CXE: 3.0710 Hinge: 0.0829  DT=0.7
Epoch 2/25 train  ROC_AUC: 0.5286 Acc: 20.6607 CXE: 3.1178 Hinge: 0.0809  DT=5.2
           val    ROC_AUC: 0.7778 Acc: 20.6486 CXE: 3.1476 Hinge: 0.0797  DT=0.8
Epoch 3/25 train  ROC_AUC: 0.6756 Acc: 20.6592 CXE: 3.1682 Hinge: 0.0790  DT=5.5
           val    ROC_AUC: 0.5889 Acc: 20.6486 CXE: 3.1924 Hinge: 0.0782  DT=0.7
Epoch 4/25 train  ROC_AUC: 0.5815 Acc: 20.6596 CXE: 3.2153 Hinge: 0.0774  DT=5.5
           val    ROC_AUC: 0.6214 Acc: 20.6348 CXE: 3.2403 Hinge: 0.0764  DT=0.7
Epoch 5/25 train  ROC_AUC: 0.5390 Acc: 20.6576 CXE: 3.2599 Hinge: 0.0757  DT=5.5
           val    ROC_AUC: nan Acc: 20.6693 CXE: 3.2784 Hinge: 0.0749  DT=0.7
Epoch 6/25 train  ROC_AUC: 0.5908 Acc: 20.6600 CXE: 3.2933 Hinge: 0.0745  DT=5.6
           val    ROC_AUC: 0.8444 Acc: 20.6486 CXE: 3.3055 Hinge: 0.0739  DT=0.7
Epoch 7/25 train  ROC_AUC: 0.6856 Acc: 20.6615 CXE: 3.3189 Hinge: 0.0735  DT=5.3
           val    ROC_AUC: 0.2258 Acc: 20.6555 CXE: 3.3283 Hinge: 0.0730  DT=0.7
Epoch 8/25 train  ROC_AUC: 0.5562 Acc: 20.6619 CXE: 3.3394 Hinge: 0.0727  DT=5.7
           val    ROC_AUC: 0.2188 Acc: 20.6624 CXE: 3.3500 Hinge: 0.0722  DT=0.7
Epoch 9/25 train  ROC_AUC: 0.4945 Acc: 20.6611 CXE: 3.3544 Hinge: 0.0720  DT=5.5
           val    ROC_AUC: 0.4397 Acc: 20.6417 CXE: 3.3639 Hinge: 0.0717  DT=0.7
Epoch 10/25 train  ROC_AUC: 0.6696 Acc: 20.6607 CXE: 3.3638 Hinge: 0.0717  DT=5.5
            val    ROC_AUC: 0.4630 Acc: 20.6279 CXE: 3.3631 Hinge: 0.0716  DT=0.7
Epoch 11/25 train  ROC_AUC: 0.5940 Acc: 20.6584 CXE: 3.3674 Hinge: 0.0716  DT=5.4
            val    ROC_AUC: 0.6452 Acc: 20.6555 CXE: 3.3675 Hinge: 0.0715  DT=0.7
Epoch 12/25 train  ROC_AUC: 0.5930 Acc: 20.6576 CXE: 3.3717 Hinge: 0.0715  DT=5.6
            val    ROC_AUC: 0.4194 Acc: 20.6555 CXE: 3.3752 Hinge: 0.0713  DT=0.7
Epoch 13/25 train  ROC_AUC: 0.4987 Acc: 20.6600 CXE: 3.3761 Hinge: 0.0713  DT=5.5
            val    ROC_AUC: 0.4569 Acc: 20.6417 CXE: 3.3790 Hinge: 0.0712  DT=0.7
Epoch 14/25 train  ROC_AUC: 0.4046 Acc: 20.6584 CXE: 3.3797 Hinge: 0.0712  DT=5.6
            val    ROC_AUC: 0.5345 Acc: 20.6417 CXE: 3.3820 Hinge: 0.0711  DT=0.7
Epoch 15/25 train  ROC_AUC: 0.3792 Acc: 20.6604 CXE: 3.3827 Hinge: 0.0710  DT=5.3
            val    ROC_AUC: 0.5948 Acc: 20.6417 CXE: 3.3843 Hinge: 0.0710  DT=0.7
Epoch 16/25 train  ROC_AUC: 0.4592 Acc: 20.6596 CXE: 3.3852 Hinge: 0.0709  DT=5.4
            val    ROC_AUC: 0.0806 Acc: 20.6555 CXE: 3.3870 Hinge: 0.0709  DT=0.7
Epoch 17/25 train  ROC_AUC: 0.6179 Acc: 20.6592 CXE: 3.3878 Hinge: 0.0708  DT=5.5
            val    ROC_AUC: 0.4556 Acc: 20.6486 CXE: 3.3892 Hinge: 0.0708  DT=0.7
Epoch 18/25 train  ROC_AUC: 0.5382 Acc: 20.6588 CXE: 3.3899 Hinge: 0.0707  DT=5.5
            val    ROC_AUC: 0.8438 Acc: 20.6624 CXE: 3.3906 Hinge: 0.0707  DT=0.7
Epoch 19/25 train  ROC_AUC: 0.4869 Acc: 20.6584 CXE: 3.3910 Hinge: 0.0707  DT=5.4
            val    ROC_AUC: 0.5556 Acc: 20.6486 CXE: 3.3930 Hinge: 0.0706  DT=0.8
Epoch 20/25 train  ROC_AUC: 0.5301 Acc: 20.6588 CXE: 3.3924 Hinge: 0.0706  DT=5.5
            val    ROC_AUC: 0.8125 Acc: 20.6624 CXE: 3.3932 Hinge: 0.0706  DT=0.7
Epoch 21/25 train  ROC_AUC: 0.5097 Acc: 20.6584 CXE: 3.3927 Hinge: 0.0706  DT=5.5
            val    ROC_AUC: 0.4071 Acc: 20.6348 CXE: 3.3935 Hinge: 0.0706  DT=0.7
Epoch 22/25 train  ROC_AUC: 0.4251 Acc: 20.6572 CXE: 3.3930 Hinge: 0.0706  DT=5.5
            val    ROC_AUC: 0.8065 Acc: 20.6555 CXE: 3.3937 Hinge: 0.0706  DT=0.7
Epoch 23/25 train  ROC_AUC: 0.4625 Acc: 20.6588 CXE: 3.3931 Hinge: 0.0706  DT=5.4
            val    ROC_AUC: 0.2258 Acc: 20.6555 CXE: 3.3939 Hinge: 0.0706  DT=0.8
Epoch 24/25 train  ROC_AUC: 0.4143 Acc: 20.6600 CXE: 3.3934 Hinge: 0.0706  DT=5.4
            val    ROC_AUC: 0.5938 Acc: 20.6624 CXE: 3.3941 Hinge: 0.0706  DT=0.7
Training complete in 2m 35s
Best val Acc: 20.669330
Pickled in 0.02 sec
In [557]:
convergence.head(5)
Out[557]:
epoch phase roc_auc accuracy CXE Hinge
0 0 train 0.5033 21.618163 2.368533 0.143321
1 0 val 0.2812 20.662424 2.836224 0.096613
2 1 train 0.5558 20.659569 2.975988 0.088383
3 1 val 0.2903 20.655517 3.070968 0.082892
4 2 train 0.5286 20.660749 3.117803 0.080936

Plot Convergence

In [558]:
from scipy.stats import iqr

def plot_convergence(figsize=(22,12)):
  conv = {phase : convergence[convergence.phase==phase] 
          for phase in ['train','val']}

  fig,axes = plt.subplots(2,2,figsize=figsize)
  cols = {'train' : 'tab:blue', 'val' : 'tab:orange'}

  # Loss
  ax=axes[0,0]
  for phase in ['train','val']:
    ax.plot(conv[phase].epoch,conv[phase].Hinge,label=phase,c=cols[phase])
  ax.set_xlabel('Epoch') 
  ax.set_ylabel('Hinge Loss')
  ax.legend()
  ax.grid()

  # CXE
  ax=axes[0,1]
  for phase in ['train','val']:
    ax.plot(conv[phase].epoch,conv[phase].CXE,label='CXE/'+phase,c=cols[phase])
  ax.set_xlabel('Epoch') 
  ax.set_ylabel('CXE')
  ax.legend()
  ax.grid()

  # Accuracy
  ax=axes[1,0]
  for phase in ['train','val']:
    ax.plot(conv[phase].epoch,conv[phase].accuracy,label='Acc/'+phase,c=cols[phase])
  ax.set_xlabel('Epoch') 
  ax.set_ylabel('Accuracy')
  ax.legend()
  ax.grid()

  # Plot ROC_AUC Curve of train and test
  ax=axes[1,1]
  for i in range(num_epochs):
      plt.plot(fprs_net_train[i],tprs_net_train[i], color='blue')
      plt.plot(fprs_net_valid[i],tprs_net_valid[i], color='orange')
  ax.set_xlim([0.0,1.0])
  ax.set_ylim([0.0,1.0])
  ax.set_xlabel('False Positive Rate')
  ax.set_ylabel('True Positive Rate')
  ax.set_title(f'ROC Curve Comparison')
  ax.legend([f'TrainRocAuc (AUC = {roc_auc_train})', f'TestRocAuc (AUC = {roc_auc_valid})'])
  ax.grid()
In [559]:
plot_convergence(figsize=(22,12))

Kaggle Submission

In [560]:
  nn_model.eval()
  test_class_scores = nn_model(full_X_kaggle_gpu)
  print(test_class_scores[0:10])
tensor([[0.9930],
        [0.9851],
        [0.9685],
        [0.9358],
        [0.9693],
        [0.9407],
        [0.9788],
        [0.9566],
        [0.9444],
        [0.9995]], device='cuda:0', grad_fn=<SliceBackward>)
In [561]:
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  fs_type = "Multilayer_nn1"
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores.detach().cpu().numpy()
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
   SK_ID_CURR    TARGET
0      100001  0.993022
1      100005  0.985118

Experimental Results

Traditional Models

Below is the resulting table for the various experiments we performed on the given dataset. Please refer section Final results for traditional models.

image.png

Deep Learning

Single Layer Neural Network

image.png

Multi Layer Neural Network

image.png

Discussion of Results

As you can see in the Experimental Results Final results we have performed various feature selection technique like (RFE, PCA, Variance Threshold, SelectKBest, SMOTE) on certain specific models with 132 highly correlated features. Below is brief description of results attained in these experiments.

Our best model turned out to be Logistic Regression with SelectKBest with 74.86% ROC score. Our hopes were higher on XGBoost classifier but it stood out to be second best in our models.

Our Deep Learning of simple network preformed model better than the multilayer network. The ROC score came as 74.6% for the simple network. For multilayer network our score came as 59.38%.

Compared to traditional machine learning model, deep learning model trained on completed dataset much faster.

Below has more details or the various classifiers executed in this project.

  1. Logistic Regression : This model was chosen as the baseline model trained with imbalanced dataset and later performed feature selection using RFE, SelectKBest, PCA & Variance Threshold technique on it. The baseline training accuracy for this model was encouraging which let us to perform the prior mentioned feature selection on these models. The best model for logistic regression we had is with Variance Threshold, with training accuracy as 92.56% and test accuracy as 92.2%. A 75.22% ROC score resulted with best parameters for this model. The same model was run with other feature selection performed very closer to the best model.

  2. Gradient Boosting : Boosting didn't help in achieving better results than the baseline model. The results were not good enough to continue in implementing & evaluating other feature selection technique. Training accuracy of 94.75% and test accuracy of 91.95% was achieved in this model. Test ROC under the curve for this model came out to 72.12%

  3. XGBoost : By far this model resulted in the second best model with RFE hence we continued to explore other feature selection techniques on this. The best performing model for XGBoost was with Variance Threshold. The accuracy of the training and test are 93.1% and test 92.36%. Test ROC under the curve is 73.88%. The other feature selection were very closer to the best XGBoost model. We also performed XGBoost with SMOTE as the dataset had oversampled records. The ROC score has promising result with 74.23%.

  4. Light BGM : Our expectation was this model would give us better and faster results than XGBoost, however it was slightly lower compared to XGBoost. Both RFE and variance threshold feature selection resulted in same ROC score of 72.2. The training accuracy came as 92.81% and test accuracy 92.28% was achieved.

  5. Random Forest : On our last decision tree models, the best Random Forest was with variance threshold which produced training accuracy of 92.51% and test accuracy of 92.36%. Test ROC score came out as 72.43%. Random forest performed better compared to LightBGM but lower than XGBoost.

  6. SVM : This was the lowest performing model in our experiment. Hence we didn't decide to continue on SVM with other feature selection techniques. The ROC score achieved for this model was way lower i.e. is 67.21%.

Conclusion

In the final phase, after proving our hypothesis that tuned machine learning techniques can outperform baseline models to aid Home Credit in their evaluation of loan applications, we believe expanding our framework will create a more robust environment with improved performance.

Logistic regression, XGBoost, Random Forest and LightGBM were selected to run with RFE, PCA, SelectKBest and Variance Threshold for feature selection, and SMOTE for data imbalance. The best performance for each algorithm was included in the classification ensemble using soft voting. The resulting Kaggle score was 0.72592 ROC_AUC.

Single and Multi-layer deep learning models, including linear, sigmoid, ReLu, and hidden layers were added with binary CXE, custom hinge loss using adam & sgd optimizer. The deep learning Kaggle score fell short of the ensemble model; additional experimentation will result in a better performing deep learning models. By combining and continuing to refine our extended loss function, we can further demonstrate our effectiveness.

Kaggle submissions

Click on this link

Phase - 1 : Kaggle Submission image-6.png

Phase - 2 : Kaggle Submission For phase-2, we did multiple submission in Kaggle with different feature setting. Below is the details. image-2.png

Phase - 3 : Kaggle Submission Below submission were done for Feature Selection with RFE, PCA, Variance Threshold & XGBOOST SMOTE with early stoping..

image-7.png

For Deep Learning, we submitted below Kaggle submission. image-8.png

Our Best Kaggle submission.

Our best Kaggle score was based on Voting Classifier with SelectKBest feature selection.

image.png

References

Some of the material in this notebook has been adopted from following

  1. https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction/notebook
  2. https://towardsdatascience.com/a-machine-learning-approach-to-credit-risk-assessment-ba8eda1cd11f
  3. https://juhiramzai.medium.com/introduction-to-credit-risk-modeling-e589d6914f57
  4. https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html
  5. https://stackoverflow.com/questions/28930465/what-is-the-difference-between-flatten-and-ravel-functions-in-numpy
  6. https://machinelearningmastery.com/rfe-feature-selection-in-python/#:~:text=RFE%20is%20a%20wrapper%2Dtype%20feature%20selection%20algorithm.&text=This%20is%20achieved%20by%20fitting,specified%20number%20of%20features%20remains.
  7. https://www.analyticsvidhya.com/blog/2020/10/7-feature-engineering-techniques-machine-learning/
  8. https://www.geeksforgeeks.org/append-extend-python/
  9. https://www.analyticsvidhya.com/blog/2020/03/google-colab-machine-learning-deep-learning/
  10. https://stackify.com/python-garbage-collection/
  11. https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/
  12. https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5
  13. https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/
  14. https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
  15. https://medium.com/mindorks/what-is-feature-engineering-for-machine-learning-d8ba3158d97a
  16. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=658a96346f63
  17. https://medium.com/analytics-vidhya/what-is-multicollinearity-and-how-to-remove-it-413c419de2f https://towardsdatascience.com/data-leakage-in-machine-learning-6161c167e8ba#:~:text=The%20most%20obvious%20cause%20of,test%20data%20with%20training%20data.
  18. https://stats.stackexchange.com/questions/412478/feature-selection-on-full-training-set-does-information-leak-if-using-filter-ba#:~:text=1%20Answer&text=You%20can%20reduce%20the%20features,if%20you%20cross%20validate%20afterwards.