Project Title : Home Credit Default Risk Prediction
Group Name : Group1
Group Number : 1
Group Members (From Left to right and top to bottom)
Team Montage
Overview : The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
The objective of this project is to use machine learning methodologies on historical loan application data to predict whether or not an applicant will be able to repay a loan. As an extension to EDA and hyper-tuned model, this phase provided valuable insights when feature engineering was modified to handle data leakage employing a better data processing flow. Multiple experiments were conducted applying feature selection techniques including RFE, SelecKbest, Variance threshold to Logistic regression, Gradient Boosting, XGBoost, LightGBM & SVM models , further handling class imbalance using SMOTE for XGBoost , monitoring error generalization with early stopping and building high performance Neural Networks . Our results in this phase show that the best performing algorithm was Logistic Regerssion with varaince threshold selection with test ROC_AOC as 75.22%. The lowest performing was SVM model with test AUC(Area under ROC) as 67.21%. Our best score in Kaggle submission was for Logistic Regression with SelectKBest with score of 0.72158 for private and 0.72592 for public.
Home Credit is an international non-bank financial institution, which primarily focuses on lending people money regardless of their credit history. Home credit groups aim to provide positive borrowing experience to customers, who do not bank on traditional sources. Hence, Home Credit Group published a dataset on Kaggle website with the objective of identifying and solving unfair loan rejection.
The goal of this project is to build a machine learning model to predict the customer behavior on repayment of loan. Our task would be to create a pipeline to build a baseline machine learning model using logistic regression classification algorithms. The final model will be evaluated with various performance metrics to build a better model. Businesses will be able to use the output of the model to identify if loan is at risk to default. The new model built will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The results of our machine learning pipelines will be measured using the follwing metrics;
The pipeline results will be logged, compared and ranked using the appropriate measurements. The most efficient pipeline will be submitted to the HCDR Kaggle Competition.
Workflow
For this project, we are following the proposed workflow as mentioned in Phase-0 of this project.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
If running on Google Drive Run below else not
import gc
gc.set_threshold(1000, 15, 15)
def collecttrash():
  print('before collection : ',gc.get_count())
  gc.collect()
  print('after collection : ',gc.get_count())
collecttrash()
%config Completer.use_jedi = False
from time import time, ctime
nb_start = time()
print("Note Book Start time:   ", ctime(nb_start))
# !pwd
# !mkdir ~/.kaggle
# !cp /root/shared/Downloads/kaggle.json ~/.kaggle
# !chmod 600 ~/.kaggle/kaggle.json
Mount drive for file load setup
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
# Uncomment if using running it on jupyter lab 
# For code autocompletion
# %config Completer.use_jedi = False
# Preprocessing imports
import copy
import numpy as np
import pandas as pd 
import os
import gc
import zipfile
import matplotlib.pyplot as plt
# from matplotlib import pyplot
import seaborn as sns
from pandas.plotting import scatter_matrix
# Pipelines
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.decomposition import PCA
# Model imports
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.utils import resample
import sklearn.metrics as metrics
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss
from sklearn.metrics import classification_report, roc_auc_score, make_scorer
from sklearn.metrics import roc_auc_score, make_scorer, roc_curve, ConfusionMatrixDisplay, precision_recall_curve
from sklearn.metrics import explained_variance_score
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve
from scipy import stats
from time import time, ctime
# Display help
from IPython.display import display, HTML
import re
import json
import pprint
import warnings
warnings.filterwarnings('ignore')
pprint = pprint.PrettyPrinter().pprint
# !pip install lightgbm
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.utils import resample
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.ensemble import VotingClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, log_loss, classification_report, roc_auc_score, make_scorer
from scipy import stats
import json
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, make_scorer, roc_curve, ConfusionMatrixDisplay, precision_recall_curve
from sklearn.metrics import explained_variance_score
from sklearn.metrics import plot_roc_curve, plot_confusion_matrix, plot_precision_recall_curve
DATA_DIR = "/root/shared/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/HCDR_Phase_1_baseline_submission/data"   #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir $DATA_DIR
!ls -l $DATA_DIR
! kaggle competitions download home-credit-default-risk -p $DATA_DIR
from google.colab import drive 
drive.mount('/content/drive',force_remount=True)
import os 
os.chdir("/content/drive/My Drive")
unzippingReq = False
if unzippingReq: #please modify this code 
    zip_file = DATA_DIR + '/' + 'home-credit-default-risk.zip'
    zip_ref = zipfile.ZipFile(zip_file, 'r')
    zip_ref.extractall(path=DATA_DIR)
    zip_ref.close()
Helper Function
def load_data(in_path, name):
    df = pd.read_csv(in_path)
    print(f"{name}: shape is {df.shape}")
    print(df.info())
    display(df.head(5))
    return df
datasets={}  # lets store the datasets in a dictionary so we can keep track of them easily
The full dataset consists of 7 tables. There is 1 primary table and 6 secondary tables.
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
Bureau 
This table includes all previous credits received by a customer from other financial institutions prior to their loan application. There is one row for each previous credit, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. 
The number of variables are 17.The number of data entries are 1,716,428.
Bureau Balance 
This table includes the monthly balance for a previous credit at other financial institutions. There is one row for each monthly balance, meaning a many-to-one relationship with the Bureau table. We could join it with bureau table by using bureau's ID, SK_ID_BUREAU. 
The number of variables are 3. The number of data entries are 27,299,925
Previous Application 
This table includes previous applications for loans made by the customer at Home Credit. There is one row for each previous application, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. 
There are four types of contracts: 
a. Consumer loan(POS – Credit limit given to buy consumer goods) 
b. Cash loan(Client is given cash) 
c. Revolving loan(Credit) 
d. XNA (Contract type without values) 
The number of variables are 37. The number of data entries are 1,670,214
POS CASH Balance 
This table includes a monthly balance snapshot of a previous point of sale or cash loan that the customer has at Home Credit. There is one row for each monthly balance, meaning a many-to-one relationship with the Previous Application table. We would  join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. 
The number of variables are 8. The number of data entries are 10,001,358
Credit Card Balance 
This table includes a monthly balance snapshot of previous credit cards the customer has with Home Credit. There is one row for each previous monthly balance, meaning a many-to-one relationship with the Previous Application table.We could join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. 
The number of variables are 23. The number of data entries are 3,840,312
Installments Payments 
This table includes previous repayments made or not made by the customer on credits issued by Home Credit. There is one row for each payment or missed payment, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. 
The number of variables are 8 . The number of data entries are 13,605,401
The application dataset has the most information about the client: Gender, income, family status, education ...
Download all the files
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
            "previous_application","POS_CASH_balance")
for ds_name in ds_names:
    datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
for ds_name in datasets.keys():
    print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
Exploratory Data Analysis is valuable to this project since it allows us to get closer to the certainty that the future results will be valid, accurately interpreted, and applicable to the proposed solution.
In phase 1 for this project this step involves looking at the summary statistics for each individual table in the model and focusing on the missing data , distribution and its central tendencies such as mean, median, count, min, max and the interquartile ranges.
Categorical and numerical features were looked at to identify anamolies in the data. Specific features were chosen to be visualized based on the correlation and distribution. The highly correlated features were used to plot the density to evaluate the distributions in comparison to the target.
pd.set_option("display.max_rows", None, "display.max_columns", None)
# Full stats
def stats_summary1(df, df_name):
    print(datasets[df_name].info(verbose=True, null_counts=True ))
    print("-----"*15)
    print(f"Shape of the df {df_name} is {df.shape} \n")
    print("-----"*15)
    print(f"Statistical summary of {df_name} is :")
    print("-----"*15)
    print(f"Description of the df {df_name}:\n")
    print(display(HTML(np.round(datasets['application_train'].describe(),2).to_html()))) 
    #print(f"Description of the df {df_name}:\n",np.round(datasets['application_train'].describe(),2))
def stats_summary2(df, df_name):   
    print(f"Description of the df continued for {df_name}:\n")
    print("-----"*15)
    print("Data type value counts: \n",df.dtypes.value_counts())
    print("\nReturn number of unique elements in the object. \n")
    print(df.select_dtypes('object').apply(pd.Series.nunique, axis = 0))
    
# List the categorical and Numerical features of a DF
def feature_datatypes_groups(df, df_name):
    df_dtypes = df.columns.to_series().groupby(df.dtypes).groups
    print("-----"*15)
    print(f"Categorical and Numerical(int + float) features  of {df_name}.")
    print("-----"*15)
    print()
    for k, v in df_dtypes.items():
        print({k.name: v})
        print("---"*10)
    print("\n \n")    
        
# Null data list and plot.        
def null_data_plot(df, df_name):
    percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False).round(2)
    sum_missing = df.isna().sum().sort_values(ascending = False)
    missing_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
    missing_data=missing_data[missing_data['Percent'] > 0] 
    print("-----"*15)
    print("-----"*15)
    print('\n The Missing Data: \n')
#     display(missing_data)  # display few
    if len(missing_data)==0:
      print("No missing Data")
    else:
      display(HTML(missing_data.to_html()))  # display all the rows
      print("-----"*15)
      if len(df.columns)> 35:
        f,ax =plt.subplots(figsize=(8,15))
      else: 
        f,ax =plt.subplots()
      #plt.xticks(rotation='90')
      #fig=sns.barplot(missing_data.index, missing_data["Percent"],alpha=0.8)
      #plt.xlabel('Features', fontsize=15)
      #plt.ylabel('Percent of missing values', fontsize=15)
      plt.title(f'Percent missing data for {df_name}.', fontsize=10)
      fig=sns.barplot(missing_data["Percent"],missing_data.index ,alpha=0.8)
      plt.xlabel('Percent of missing values', fontsize=10)
      plt.ylabel('Features', fontsize=10)
      return missing_data
# Full consolidation of all the stats function.
def display_stats(df, df_name):
    print("--"*40)
    print(" "*20 + '\033[1m'+ df_name +  '\033[0m' +" "*20)
    print("--"*40)
    stats_summary1(df, df_name)
def display_feature_info(df, df_name):
    stats_summary2(df, df_name)
    feature_datatypes_groups(df, df_name)
    null_data_plot(df, df_name)
(datasets['application_train'].dtypes).unique()
display_stats(datasets['application_train'], 'application_train')
display_feature_info(datasets['application_train'], 'application_train')
datasets["application_train"]['DAYS_EMPLOYED'].describe() 
anom_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']==365243]
norm_days_employed = datasets["application_train"][datasets["application_train"]['DAYS_EMPLOYED']!=365243]
print(anom_days_employed.shape)
dr_anom = anom_days_employed['TARGET'].mean()*100
dr_norm = norm_days_employed['TARGET'].mean()*100
print('Default rate (Anomaly): {:.2f}'.format(dr_anom))
print('Default rate (Normal): {:.2f}'.format(dr_norm))
pct_anom_days_employed = (anom_days_employed.shape[0]/datasets["application_train"].shape[0])*100
print(pct_anom_days_employed) 
df_app_train=datasets["application_train"].copy()
df_app_train['DAYS_EMPLOYED_ANOM'] = df_app_train['DAYS_EMPLOYED'] == 365243
df_app_train['DAYS_EMPLOYED'].replace({365243:np.nan}, inplace=True)
plt.hist(df_app_train['DAYS_EMPLOYED'],edgecolor = 'k', bins = 25)
plt.title('DAYS_EMPLOYED'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count'); 
gc.collect()
The bins above histogram shows that the data is not logical and this feature needs to be further investigated for imbalances. Number of days employed would show a steady source of income and could be a useful feature for predicting risk
plt.hist(datasets["application_train"]['OWN_CAR_AGE'],edgecolor = 'k', bins = 25)
plt.title('OWN CAR AGE'); plt.xlabel('No Of Days as per Dataset'); plt.ylabel('Count'); 
We see that those who have cars over 60 years old have a number of applications (i.e., 3339). This could a good area to investigate risk
display_feature_info(datasets['application_train'], 'application_train')
plt.hist(datasets["application_train"]['DAYS_BIRTH']/-365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count'); 
To visualize the effect of the age on the target, we will next make a kernel density estimation plot (KDE) colored by the value of the target.
plt.figure(figsize = (10, 8))
# KDE plot of loans that were repaid on time
sns.kdeplot(datasets['application_train'].loc[datasets['application_train']['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'target == 0')
# KDE plot of loans which were not repaid on time
sns.kdeplot(datasets['application_train'].loc[datasets['application_train']['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'target == 1')
plt.legend(loc='upper left')
# set_style("whitegrid")
plt.grid()
# Labeling of plot
plt.xlabel('Age (years)', fontsize=18); 
plt.ylabel('Density', fontsize=16); 
plt.suptitle('Distribution of Ages',fontsize=25); 
The target == 1 curve skews towards the younger end of the range. Let's look at this relationship in another way: average failure to repay loans by age bracket. To make this graph, first we cut the age category into bins of 5 years each. Then, for each bin, we calculate the average value of the target, which tells us the ratio of loans that were not repaid in each age category.
# Age information into a separate dataframe
age_data = datasets['application_train'][['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / -365
# Bin the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_data.head(10) 
# Group by the bin and calculate averages
age_groups  = age_data.groupby('YEARS_BINNED').mean()
age_groups 
plt.figure(figsize = (8, 8))
# Graph the age bins and the average of the target as a bar plot
plt.bar(age_groups.index.astype(str), 100 * age_groups['TARGET'])
# Plot labeling
plt.xticks(rotation = 75); plt.xlabel('Age Group (years)'); plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group'); 
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"], order = datasets["application_train"]['OCCUPATION_TYPE'].value_counts().index);
plt.title('Applicants Occupation');
plt.xticks(rotation=90); 
import pandas as pd
import numpy as np
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation
%matplotlib inline     
sns.set(color_codes=True) 
def generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale):
  sns.boxplot(xaxisfeature, yaxisfeature, hue = legendcategory, data = data)
  plt.title('Boxplot for '+ xaxisfeature +' with ' + yaxisfeature+' and '+legendcategory,fontsize=10)
  if log_scale:
                plt.yscale('log')
                plt.ylabel(f'{yaxisfeature} (log Scale)')
                plt.tight_layout() 
def box_plot(plots):
  number_of_subplots = len(plots)
  plt.figure(figsize = (20,8))
  sns.set_style('whitegrid')
  for i, ele in enumerate(plots):
        plt.subplot(1, number_of_subplots, i + 1)
        plt.subplots_adjust(wspace=0.25)
        xaxisfeature=ele[0]
        yaxisfeature=ele[1]
        legendcategory=ele[2]
        data=ele[3]
        log_scale=ele[4]
        generic_xy_boxplot(xaxisfeature,yaxisfeature,legendcategory,data,log_scale) 
plots=[['NAME_CONTRACT_TYPE','AMT_CREDIT','CODE_GENDER',datasets['application_train'],False]] 
box_plot(plots) 
Gender does not indicate a major impact . But credit amount for cash loans are significantly high compared to revolving loans.
display_stats(datasets['previous_application'], 'previous_application') 
display_feature_info(datasets['previous_application'], 'previous_application') 
display_stats(datasets['bureau'], 'bureau') 
display_feature_info(datasets['bureau'], 'bureau') 
display_stats(datasets['bureau_balance'], 'bureau_balance')
display_feature_info(datasets['bureau_balance'], 'bureau_balance')
display_stats(datasets['credit_card_balance'], 'credit_card_balance') 
display_feature_info(datasets['credit_card_balance'], 'credit_card_balance') 
display_stats(datasets['installments_payments'], 'installments_payments') 
display_feature_info(datasets['installments_payments'], 'installments_payments') 
display_stats(datasets['POS_CASH_balance'], 'POS_CASH_balance') 
display_feature_info(datasets['POS_CASH_balance'], 'POS_CASH_balance') 
display_stats(datasets['application_test'], 'application_test') 
display_feature_info(datasets['application_test'], 'application_test') 
The top 20 correlated features (positive and negative) for application train datset are listed below.
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10)) 
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL',  'AMT_CREDIT', 'DAYS_EMPLOYED',
               'DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df = datasets["application_train"].copy()
df2 = df[num_attribs]
corr = df2.corr()
corr.style.background_gradient(cmap='PuBu').set_precision(2) 
gc.collect()
Observing Highly correlated features from all input datasets
The distribution of the top correlated features are plotted below
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(15,20))
for i,var in enumerate(var_neg_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    
    plt.subplot(numVar,4,i+1)
    datasets["application_train"][var].hist()
    plt.title(var, fontsize = 10)
    plt.tight_layout()
plt.show() 
var_pos_corr = correlations.tail(10).index.values
numVar = var_pos_corr.shape[0]
plt.figure(figsize=(15,20))
for i,var in enumerate(var_pos_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    
    plt.subplot(numVar,4,i+1)
    datasets["application_train"][var].hist()
    plt.title(var, fontsize = 10)
    plt.tight_layout()
plt.show() 
def correlation_files_target(df_name):
  A = datasets["application_train"].copy()
  B = datasets[df_name].copy()
  correlation_matrix =  pd.concat([A.TARGET, B], axis=1).corr().filter(B.columns).filter(A.columns, axis=0)
  del A
  del B
  return correlation_matrix
df_name = "previous_application"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "bureau"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "bureau_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "credit_card_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "installments_payments"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
df_name = "POS_CASH_balance"
correlation_matrix = correlation_files_target(df_name)
print(f"Correlation of the {df_name} against the Target is :")
correlation_matrix.T.TARGET.sort_values(ascending= False)
gc.collect()
datasets['application_train']['TARGET'].value_counts()
fig, ax = plt.subplots(figsize =(10, 8))
plt.pie(datasets['application_train']['TARGET'].value_counts(), labels=['No Default', 'Default'],
        autopct='%1.2f%%',
        shadow=True, startangle=90)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
plt.suptitle("Distribution of Target",fontsize=30, ha='center')
ax.legend(bbox_to_anchor =(1, 0))
plt.legend(loc="upper left")
print(datasets['application_train'].select_dtypes('object').columns)
# Number of unique classes in each object column
datasets['application_train'].select_dtypes('object').apply(pd.Series.nunique, axis = 0)
# one-hot encoding of categorical variables
app_cat_train= copy.deepcopy(datasets['application_train'])
app_cat_train = pd.get_dummies(app_cat_train)
print(app_cat_train.shape)
print(app_cat_train.dtypes.value_counts())
display(app_cat_train.select_dtypes(include=['uint8']).head())
fig, ax = plt.subplots(figsize =(10, 8))
data = app_cat_train.dtypes.value_counts()
plt.pie(data, autopct='%1.2f%%',labels=['float64','int64','object'],
        shadow=True, startangle=90)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
# plt.setup(size = 8, weight ="bold")
# ax.set_title("Customizing pie chart",fontsize=30, ha='center')
plt.suptitle("Distribution of Features dtypes.",fontsize=30, ha='center')
ax.legend(bbox_to_anchor =(1, 0))
plt.legend(loc="upper left")
def cat_features_plot(datasets, df_name):
    df = copy.deepcopy(datasets[df_name])
    df['TARGET'].replace(0, "No Default", inplace=True)
    df['TARGET'].replace(1, "Default", inplace=True)
#     df.select_dtypes('object')
    categorical_col = []
    
    for col in df:
        if df[col].dtype == 'object':
            categorical_col.append(col)
    # print("The numerical olumns are: \n \n ",numerical_col)
    #print("The categorical columns are: \n \n ",categorical_col)
    # categorical_col = categorical_col[0:8]
    #print(int(len(categorical_col)))
    plot_x = int(len(categorical_col)/2)
    fig, ax = plt.subplots(plot_x, 2, figsize=(20, 50))
    #plt.subplots_adjust(left=None, bottom=None, right=None,
                        #top=None, wspace=None, hspace=0.45)
    num = 0
    for i in range(0, 8):
        for j in range(0,2):
            tst = sns.countplot(x=categorical_col[num],
                               data=df, hue='TARGET', ax=ax[i][j])
            tst.set_title(f"Distribution of the {categorical_col[num]}  Variable.")
            tst.set_xticklabels(tst.get_xticklabels(), rotation=90)
            plt.subplots_adjust(left=None, bottom=None, right=None,
                        top=None, wspace=None, hspace=0.45)
            num = num + 1
            plt.tight_layout() 
 cat_features_plot(datasets, "application_train") 
df = copy.deepcopy(datasets["application_train"])
df['DAYS_BIRTH'] = round(abs(df['DAYS_BIRTH'])/365).astype(int)
df['DAYS_EMPLOYED'] = round(abs(df['DAYS_EMPLOYED'])/365).astype(int)
num_attribs = ['TARGET', 'AMT_INCOME_TOTAL',  'AMT_CREDIT', 'DAYS_EMPLOYED','DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'AMT_GOODS_PRICE']
df2 = df[num_attribs]
df2.fillna(0, inplace=True)
df2.head(5)
# Scatter-plot
df2.fillna(0, inplace=True)
print('Numerical variables - Scatter-Matrix')
grr = pd.plotting.scatter_matrix(df2.loc[:, df2.columns != 'TARGET'], 
                                     c =df['TARGET'],
                                     figsize=(15, 15), marker='.',
                                     hist_kwds={'bins': 10}, s=60, alpha=.2)
 # Pair-plot
df2['TARGET'].replace(0, "No Default", inplace=True)
df2['TARGET'].replace(1, "Default", inplace=True)
print('Numerical variables - Pair-Plot')    
num_sns = sns.pairplot(df2, hue="TARGET", markers=["s", "o"])
    #    num_sns.title("Numerical variables - Pair-Plot")
collecttrash()
Correlation Map of Numerical Variables
var_neg_corr = correlations.head(10).index.values
numVar = var_neg_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_neg_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    
    plt.subplot(numVar,3,i+1)
    plt.subplots_adjust(wspace=2)
    sns.kdeplot(dflt_var,label='Default')
    sns.kdeplot(dflt_non_var,label='No Default')
    #plt.xlabel(var)
    plt.ylabel('Density')
    plt.title(var, fontsize = 10)
    plt.tight_layout()
plt.show() 
var_pos_corr = correlations.tail(10).index.values
numVar = var_pos_corr.shape[0]
plt.figure(figsize=(10,40))
for i,var in enumerate(var_pos_corr):    
    dflt_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==1,var]
    dflt_non_var = datasets["application_train"].loc[datasets["application_train"]['TARGET']==0,var]
    if var=='TARGET':
      pass
    else:
      plt.subplot(numVar,3,i+1)
      plt.subplots_adjust(wspace=2)
      sns.kdeplot(dflt_var,label='Default')
      sns.kdeplot(dflt_non_var,label='No Default')
      #plt.xlabel(var)
      plt.ylabel('Density')
      plt.title(var, fontsize = 10)
      plt.tight_layout()
plt.show() 
We plot the KDEs of the most positively (negatively) correlated features with the TARGET. This is to evaluate whether there are any strange distributions between the default and do not default items.
If the distributions for each feature are very different for default and do not default, this is good and we should look out for this. So we can see that EXT_SOURCE_3 has the most different distributions between default and no default.
Overall View of Categorical values in Train & Test
For any categorical variable (dtype == object) with 2 unique categories, we will use label encoding, and for any categorical variable with more than 2 unique categories, we will use one-hot encoding.
datasets['application_train'].select_dtypes('object').apply(pd.Series.nunique, axis = 0) 
datasets['application_test'].select_dtypes('object').apply(pd.Series.nunique, axis = 0) 
import time
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
import warnings
import seaborn as sns
color = sns.color_palette()
import pickle
from scipy.cluster.hierarchy import dendrogram, linkage
warnings.filterwarnings("ignore")
rcParams['figure.figsize'] = 12, 8
np.random.seed(23)
nb_levels_sr = datasets['application_train'].nunique()
binary_features_lst = nb_levels_sr.loc[nb_levels_sr == 2].index.tolist()
categorical_features_lst = list(set(datasets['application_train'].select_dtypes(["object"]).columns.tolist()) - set(binary_features_lst))
for feature in categorical_features_lst:
    fig, ax = plt.subplots(1, 2, sharex = False, sharey = False, figsize = (20, 10))
    # Plot levels distribution
    if datasets['application_train'][feature].nunique() < 10:
        sns.countplot(x = datasets['application_train'][feature], ax = ax[0], order = datasets['application_train'][feature].value_counts().index.tolist())
    else:
        sns.countplot(y = datasets['application_train'][feature], ax = ax[0], order = datasets['application_train'][feature].value_counts().index.tolist())
    ax[0].set_title("Count plot of each level of the feature: " + feature)
    # Plot target distribution among levels
    table_df = pd.crosstab(datasets['application_train']["TARGET"], datasets['application_train'][feature], normalize = True)
    table_df = table_df.div(table_df.sum(axis = 0), axis = 1)
    table_df = pd.crosstab(datasets['application_train']["TARGET"], datasets['application_train'][feature], normalize = True)
    table_df = table_df.div(table_df.sum(axis = 0), axis = 1)
    table_df = table_df.transpose().reset_index()
    order_lst = table_df.sort_values(by = 1)[feature].tolist()
    table_df = table_df.melt(id_vars = [feature])
    if datasets['application_train'][feature].nunique() < 10:
        ax2 = sns.barplot(x = table_df[feature], y = table_df["value"] * 100, hue = table_df["TARGET"], ax = ax[1], order = order_lst)
        for p in ax2.patches:
            height = p.get_height()
            ax2.text(p.get_x() + p.get_width() / 2., height + 1, "{:1.2f}".format(height), ha = "center")
    else:
        ax2 = sns.barplot(x = table_df["value"] * 100, y = table_df[feature], hue = table_df["TARGET"], ax = ax[1], order = order_lst)
        for p in ax2.patches:
            width = p.get_width()
            ax2.text(width + 3.1, p.get_y() + p.get_height() / 2. + 0.35, "{:1.2f}".format(width), ha = "center")
    ax[1].set_title("Target distribution among " +  feature + " levels")
    ax[1].set_ylabel("Percentage") 
numerical_features_lst = list(set(datasets['application_train'].columns.tolist()) - set(categorical_features_lst) - set(binary_features_lst))
binary_features_lst = list(set(binary_features_lst) - {"TARGET"})
	# generate the linkage matrix
numerical_features_df = datasets['application_train'][numerical_features_lst + ["TARGET"]]
numerical_features_df.fillna(-1, inplace = True) # We need to impute missing values before creating the dendrogram
numerical_features_df = numerical_features_df.transpose()
Z = linkage(numerical_features_df, "ward")
plt.figure(figsize = (20, 15))
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("feature")
plt.ylabel("distance")
dend = dendrogram(
    Z,
    leaf_rotation = 90.,  # rotates the x axis labels
    leaf_font_size = 8.,  # font size for the x axis labels
    labels = numerical_features_df.index.tolist()
) 
Imbalanced data
train_labels = datasets['application_train']['TARGET']
# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = datasets['application_train'].align(datasets['application_test'], join = 'inner', axis = 1)
# Add the target back in
app_train['TARGET'] = train_labels
print('Training Features shape: ', datasets['application_train'].shape)
print('Testing Features shape: ', datasets['application_test'].shape) 
Data preparation accounts for about 80% of the work of data scientists
The feature engineering we performed can be classified into - sub-parts which include
An essential part of any feature engineering process is the domain knowledge based features which will help improve the accuracy of a model. The first step was to identify these for each dataset. Some of the new custom features included were credit card amount balance after payment based on due amount,  application amount average , the credit average, Available credit as a percentage of income , Annuity as a percentage of income , Annuity as a percentage of available credit  
The next step involved  was to identify the numerical features and aggregate them to mean, min & max values. An attempt was made to apply label encoding for unique values more than 5 at the engineering phase. However,  a design decision was made to apply OHE at the pipeline level for specific highly correlated fields on the final merged dataset to optimize the amount of code to handle the same functionality. 
Extensive feature engineering was conducted by attempting multiple modelling approaches with primary, secondary and tertiary tables prior to finalizing an optimized approach with the least amount of memory usage. Attempt one involved creating engineered and aggregated features for Tier 3 tables: bureau_balance, credit_card_installment, installment_payments  and point_of_sale_systems_cash_balance. This was then merged with Tier 2 tables i.e  prev_application_balance with  credit_card_installment, installment_payments and point_of_sale_systems_cash_balance & bureau with bureau_balance, along with aggregated features. A flat view combining all of the above tables were merged along with the primary dataset application train. This resulted in a high number of redundant features occupying large memory.
Attempt 2 involved creating custom and aggregated features for tier 3 tables and merging with tier 2 tables based on the primary key provided, which was later “extended” to the tier1 tables based on the additional aggregated columns. This approach created less duplicates, was optimized and occupied less memory by using a garbage collector after each merge. 
In Attempt 3, the merged dataframe in the previous attempt were merged with the polynomial features with a degree of 4. 
A final merge of the Tier3, Tier2 and Tier1 datasets were  used to create a train dataframe. Special care was taken to ensure that there are no columns which have more than 50% of the data missing. 
Engineering the features and including them in the model with small splits helped test the model but provided low accuracy. However, using these merged features along with reasonable splits during the training face did provide a better accuracy and less possibility of overfitting especially for Random forest and XGBoost. 
Future work and experiments include Label encoding for the unique categorical values in all categorical fields and not select few. Attempting PCA or custom function to handle multicollinearity in the pipeline and eliminate features of low importance and verify its impacts on accuracy.
The steps involved in Feature engineering were: Separate the files into Tiers.
The pipeline for tier 3 files,
We include the manually engineered features for each file if any (InstallmentPaymentFeaturesAdder). Replace the missing values with the most frequent values for the categorical feature columns and perform One Hot Encoding for all the categorical columns (getDummies). Create aggregated features using the(FeaturesAggregator). Removed the features which have greater than 60% of null values(MissingFeatureRemover). Removed features with multicollinearity, dropped features which have a correlation value of greater than the threshold of 0.9(CollinearFeatureRemover).
The aggregated features with OHE are merged with the tier 2 files, those are the previous_application and the bureau datasets.
The tier 3 pipelines are repeated on the tier 2 files as well.
The tier 2 aggregated features with OHE are merged with application_train /test datasets.
The final pipeline on the application_train is as mentioned below. 
The final features in the application_train / test dataset is:
# Create installment features
class InstallmentPaymentFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # new feature creation from the Installment Payment file
        X['PAY_IS_LATE'] = X['DAYS_INSTALMENT'] - X['DAYS_ENTRY_PAYMENT']
        X['AMT_MISSED'] = X['AMT_INSTALMENT'] - X['AMT_PAYMENT']
        
        return X
# Create installment features
class CCBalFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # new feature creation from the Credit Card file
        X['DPD_MISSED'] = X['SK_DPD'] - X['SK_DPD_DEF']
        X['CREDIT_UTILIZED'] = X['AMT_CREDIT_LIMIT_ACTUAL'] - X['AMT_DRAWINGS_CURRENT']
        X['MIN_CREDIT_AMTMISS'] = X['AMT_INST_MIN_REGULARITY'] - X['AMT_PAYMENT_CURRENT']
        # Difference between the monthly amount paid - the expected monthly amount
        X['PAYMENT_DIFF_CURR_PAY'] = X['AMT_PAYMENT_TOTAL_CURRENT'] - X['AMT_PAYMENT_CURRENT']
        X['PAYMENT_DIFF_MIN_PAY'] = X['AMT_PAYMENT_TOTAL_CURRENT'] - X['AMT_INST_MIN_REGULARITY']
        # Difference between the monthly amount paid - the minimum monthly amount
        return X
# Create previous application features
class PrevAppFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        # new feature in the Previous Application file
        X['INTEREST'] = X['CNT_PAYMENT'] * X['AMT_ANNUITY'] - X['AMT_CREDIT']
        X['INTEREST_PER_CREDIT'] = X['INTEREST'] / X['AMT_CREDIT']
        X['CREDIT_SUCCESS'] = X['AMT_APPLICATION'] - X['AMT_CREDIT']
        X['INTEREST_RT'] = 2 * 12 * X['INTEREST'] / (X['AMT_CREDIT'] * (X['CNT_PAYMENT'] + 1))
        return X
# Create application features
class ApplicationTrainTestFeaturesAdder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.l = []
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # credit to income ratio
        X['CREDIT_INCOME_RATIO'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
        
        # annuity to income ratio
        X['ANNUITY_INCOME_RATIO'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
        
        # length of the credit term
        X['CREDIT_LENGTH'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']
        
        # what is income to age ratio
        X['INCOME_AGE_RATIO'] = X['AMT_INCOME_TOTAL'] / X['DAYS_BIRTH']
        
        # what is credit to age ratio
        X['CREDIT_AGE_RATIO'] = X['AMT_CREDIT'] / X['DAYS_BIRTH']
        
        # what percent of applicants life have they been working at recent company
        X['DAYS_EMPLOYED_PERCENT'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
        
        # add liability feature code
        conditions_temp = [
            (X['FLAG_OWN_CAR'] == 'Y') & (X['FLAG_OWN_REALTY'] == 'Y'),
            (X['FLAG_OWN_CAR'] == 'N') & (X['FLAG_OWN_REALTY'] == 'Y'),
            (X['FLAG_OWN_CAR'] == 'Y') & (X['FLAG_OWN_REALTY'] == 'N'),
            (X['FLAG_OWN_CAR'] == 'N') & (X['FLAG_OWN_REALTY'] == 'N')]
        
        values_temp = ['0', '1', '2', '3']
        
        X['HAS_LIBAILITY'] = np.select(conditions_temp, values_temp)
        X['DAYS_EMPLOYED_PCT'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
        X['CREDIT_INCOME_PCT'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
        X['ANNUITY_INCOME_PCT'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
        X['CREDIT_TERM'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']
        
        return X
# Create aggregate features (via pipeline)
class getDummies(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None): # no *args or **kargs
        self.columns = columns
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):      
        X.fillna(X.select_dtypes(include = 'object').mode().iloc[0], inplace = True)
        result = pd.get_dummies(X, columns = self.columns)
        #('imputer', SimpleImputer(strategy='most_frequent')),
        return result
Functions required to perform feature aggregations are listed below
# function to get the numerical features
def get_numattribs(dataDF):
  num_attribs=(dataDF.select_dtypes(include=['int64', 'float64']).columns.tolist())
  print()
  print('Numerical attributes for',ds_name,' : ',num_attribs)
  print()
  return num_attribs
class FeaturesAggregator(BaseEstimator, TransformerMixin):
    def __init__(self, file_name=None, features=None, primary_id = None): 
        self.prefix = file_name
        self.features = features
        self.numeric_stats = ["min", "max", "mean", "count", "sum"]
        self.categorical_stats = ["mean", "count", "sum"]
        self.primary_id = primary_id
        self.agg_op_features = {}
        self.agg_features_names = [self.primary_id]
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        numeric_cols = list(X.columns[X.columns.isin(self.features)])
        numeric_cols = [num for num in numeric_cols if num not in ['SK_ID_CURR','SK_ID_PREV','SK_ID_BUREAU']]
        categorical_cols = list(X.columns[~X.columns.isin(self.features)])
       
        for f in numeric_cols:
            self.agg_op_features[f] = self.numeric_stats
            self.agg_features_names = self.agg_features_names + [self.prefix + "_" + f + "_" + s for s in self.numeric_stats]
            
        for f in categorical_cols:
            self.agg_op_features[f] = self.categorical_stats
            self.agg_features_names = self.agg_features_names + [self.prefix + "_" + f + "_" + s for s in self.categorical_stats]       
       
        result = X.groupby(self.primary_id).agg(self.agg_op_features)
        result.columns = result.columns.droplevel()
        result = result.reset_index(level=[self.primary_id])
        result.columns = self.agg_features_names
        return result
class engineer_features(BaseEstimator, TransformerMixin):
    def __init__(self, features=None):
        self
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
# FROM APPLICATION
        # ADD INCOME CREDIT PERCENTAGE
        X['ef_INCOME_CREDIT_PERCENT'] = (
            X.AMT_INCOME_TOTAL / X.AMT_CREDIT).replace(np.inf, 0)
    
        # ADD INCOME PER FAMILY MEMBER
        X['ef_FAM_MEMBER_INCOME'] = (
            X.AMT_INCOME_TOTAL / X.CNT_FAM_MEMBERS).replace(np.inf, 0)
    
        # ADD ANNUITY AS PERCENTAGE OF ANNUAL INCOME
        X['ef_ANN_INCOME_PERCENT'] = (
            X.AMT_ANNUITY / X.AMT_INCOME_TOTAL).replace(np.inf, 0)
        return X
# Creates the following date features
# But could do so much more with these features
#    E.g., 
#      extract the domain address of the homepage and OneHotEncode it
# 
# ['release_month','release_day','release_year', 'release_dayofweek','release_quarter']
class prep_OCCUPATION_TYPE(BaseEstimator, TransformerMixin):
    def __init__(self, features="OCCUPATION_TYPE"): # no *args or **kargs
        self.features = features
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        df = pd.DataFrame(X, columns=self.features)
        #from IPython.core.debugger import Pdb as pdb;    pdb().set_trace() #breakpoint; dont forget to quit         
        df['OCCUPATION_TYPE'] = df['OCCUPATION_TYPE'].apply(lambda x: 1. if x in ['Core Staff', 'Accountants', 'Managers', 'Sales Staff', 'Medicine Staff', 'High Skill Tech Staff', 'Realty Agents', 'IT Staff', 'HR Staff'] else 0.)   
        #df.drop(self.features, axis=1, inplace=True)
        return np.array(df.values) 
# Remove missing columns
class MissingFeatureRemover(BaseEstimator, TransformerMixin):
    def __init__(self, threshold = .6):
        self.threshold = threshold
        
    def fit(self, X, y=None):
        
        # get the percent of missingness in features
        percent = (X.isnull().sum()/X.isnull().count()).sort_values(ascending = False)
        
        # turn into a data frame
        missing_application_train_data  = pd.DataFrame(percent, columns=['Percent'])
        
        # get the columns with missingness exceeding the threshold
        self.columns_to_drop = list(missing_application_train_data.index[missing_application_train_data['Percent'] > self.threshold])
        
        return self
    
    def transform(self, X, y=None):
        
        # drop the columns with missingness over the threshold
        X = X.drop(columns = self.columns_to_drop, axis=1)
        
        return X
# Remove features with high colli
class CollinearFeatureRemover(BaseEstimator, TransformerMixin):
    def __init__(self, threshold = .9):
        self.threshold = threshold
        
    def fit(self, X, y=None):
        
        # get the correlation matrix for the entire dataset after one hot encoding features
#        correlation_matrix = X.head(1000).corr().abs()
        correlation_matrix = X.sample(10000).corr().abs()
        # get only the lower portion of collinearity matrix
        lower = correlation_matrix.where(np.tril(np.ones(correlation_matrix.shape), k=-1).astype(np.bool))
        
        # get the fields with correlation above threshold
        self.columns_to_drop = [index for index in lower.index if any(lower.loc[index] > self.threshold)]
        
        return self
    
    def transform(self, X, y=None):
        
        # drop the columns with collinearity over the threshold
        X = X.drop(columns = self.columns_to_drop, axis=1)
        
        return X
# Remove features with near zero variance
class NearZeroVarianceFeatureRemover(BaseEstimator, TransformerMixin):
    def __init__(self, threshold = 0):
        self.threshold = threshold
        
    def fit(self, X, y=None):
        
        # get the fields with correlation above threshold
        self.columns_to_drop = [col for col in X.select_dtypes([np.number]).columns if np.nanvar(X[col]) <= self.threshold]
        
        return self
    
    def transform(self, X, y=None):
        
        # drop the columns with collinearity over the threshold
        X = X.drop(columns = self.columns_to_drop, axis=1)
        
        return X
gc.collect()
appsTrainDF = datasets['application_train'].copy()
X_kaggle_test = datasets['application_test'].copy()
prevAppsDF = datasets["previous_application"].copy() 
bureauDF = datasets["bureau"].copy()
bureaubalDF = datasets['bureau_balance'].copy()
ccbalDF = datasets["credit_card_balance"].copy()
installmentspaymentsDF = datasets["installments_payments"].copy()
pos_cash_bal_DF = datasets["POS_CASH_balance"].copy() 
The tertiary datasets or tables refer to bureau_balance, POS_CASH_balance, instalments_payments, credit_card_balance
tertiaty_datasets=['bureau_balance','credit_card_balance','installments_payments','POS_CASH_balance']
Feature aggregation for the tertiary datasets
primary_id1 = "SK_ID_PREV"
primary_id2 = "SK_ID_BUREAU"
posBal_features = pos_cash_bal_DF.columns.to_list()
instalPay_features = installmentspaymentsDF.columns.to_list()
instalPay_features.extend(['PAY_IS_LATE', 'AMT_MISSED'])
ccBal_features = ccbalDF.columns.to_list()
ccBal_features.extend(['DPD_MISSED', 'CREDIT_UTILIZED', 'MIN_CREDIT_AMTMISS', 
                       'PAYMENT_DIFF_CURR_PAY','PAYMENT_DIFF_MIN_PAY']) 
  
burBal_features = bureaubalDF.columns.to_list()
fn_POS_CASH ='POS_CASH_balance'
fn_ins_pay = 'installments_payments'
fn_ccbal = 'credit_card_balance'
fn_bbal ='bureau_balance'
Define Pipeline to create aggregator and OHE features.
# set pos cash pipeline
pos_cash_pipe = Pipeline([
    #('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_POS_CASH,posBal_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])
# set installments_payments pipeline
install_pay_pipe = Pipeline([
    ('install_pay_new_features', InstallmentPaymentFeaturesAdder()),  
    #('imputer', SimpleImputer(strategy='most_frequent')),                       
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_ins_pay,instalPay_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])
# set credit_card_balance pipeline
cc_bal_pipe = Pipeline([
    ('install_pay_new_features', CCBalFeaturesAdder()), 
    #('imputer', SimpleImputer(strategy='most_frequent')),                        
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_ccbal,ccBal_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])
# set bureau_balance pipeline
bureau_bal_pipe = Pipeline([
    #('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_bbal, burBal_features, primary_id2)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])
pos_cash_bal_aggregated = pos_cash_pipe.fit_transform(pos_cash_bal_DF)
del pos_cash_bal_DF
installments_pmnts_aggregated = install_pay_pipe.fit_transform(installmentspaymentsDF)
del installmentspaymentsDF
ccblance_aggregated = cc_bal_pipe.fit_transform(ccbalDF)
del ccbalDF  
bureaubal_aggregated = bureau_bal_pipe.fit_transform(bureaubalDF)
del bureaubalDF
gc.collect()
print(pos_cash_bal_aggregated.shape)
print(ccblance_aggregated.shape)
print(installments_pmnts_aggregated.shape)
print(bureaubal_aggregated.shape)
Merging the aggregated features for pos_cash_bal , installments_pmnts , credit card balance with Previous application
prevApps_ThirdTierMerge = True
posBal_join_feature = 'SK_ID_PREV'
instalPay_join_feature = 'SK_ID_PREV'
ccBal_join_feature = 'SK_ID_PREV'
burBal_join_feature = 'SK_ID_BUREAU'
prevApps_join_feature = 'SK_ID_CURR'
bureau_join_feature = 'SK_ID_CURR'
if prevApps_ThirdTierMerge:
  # Merge Datasets
  prevAppsDF = prevAppsDF.merge(pos_cash_bal_aggregated, how='left', on=posBal_join_feature)
  prevAppsDF = prevAppsDF.merge(installments_pmnts_aggregated, how='left', on=instalPay_join_feature)
  prevAppsDF = prevAppsDF.merge(ccblance_aggregated, how='left', on=ccBal_join_feature)
Merging the aggregated features the dataset Bureau Balance with Bureau as per the data model.
bureau_ThirdTierMerge = True
if bureau_ThirdTierMerge:
  bureauDF = bureauDF.merge(bureaubal_aggregated, how='left', on=burBal_join_feature)
print(prevAppsDF.shape)
print(bureauDF.shape)
gc.collect()
primary_id1 = "SK_ID_CURR"
fn_bureau = 'bureau'
fn_prevapps = 'previous_application'
fn_appsTrain = 'application_train'
fn_appsTest = 'application_test'
# dataframe names for reference
# appsTrainDF 
# appsTestDF
# prevAppsDF
# bureauDF 
Define the second tier pipeline
# get column names
prevApps_features = prevAppsDF.columns.to_list()
prevApps_features.extend(['INTEREST', 'INTEREST_PER_CREDIT', 'CREDIT_SUCCESS', 'INTEREST_RT'])
bureau_features = bureauDF.columns.to_list()
# set previous_application pipeline
prev_app_pipe = Pipeline([
    ('prev_app_feature_adder', PrevAppFeaturesAdder()), 
    #('imputer', SimpleImputer(strategy='most_frequent')),                        
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_prevapps,prevApps_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])
# set bureau pipeline
bureau_pipe = Pipeline([  
    #('imputer', SimpleImputer(strategy='most_frequent')),                      
    ('ohe', getDummies()),
    ('aggregator', FeaturesAggregator(fn_bureau,bureau_features, primary_id1)),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover())
])
prevApps_aggregated = prev_app_pipe.fit_transform(prevAppsDF)
bureau_aggregated = bureau_pipe.fit_transform(bureauDF)
del bureauDF
del prevAppsDF
gc.collect()
print(prevApps_aggregated.shape)
print(bureau_aggregated.shape)
Prior to merging with the Primary data, we will be dropping columns with more than 50% missing values because they are not reliable parameters.
prevApps_join_feature = 'SK_ID_CURR'
bureau_join_feature = 'SK_ID_CURR'
merge_all_data = True
if merge_all_data:
# 1. Join/Merge in prevApps Data
    # Merge all the features with Application_train
    appsTrainDF = appsTrainDF.merge(prevApps_aggregated, how = 'left', on = prevApps_join_feature)
    appsTrainDF = appsTrainDF.merge(bureau_aggregated, how = 'left', on = bureau_join_feature)
    # Merge all the features with Application_train
    X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how = 'left', on = prevApps_join_feature)
    X_kaggle_test = X_kaggle_test.merge(bureau_aggregated, how = 'left', on = bureau_join_feature)
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
gc.collect()
def get_cat_attribs():
  cat_cols = []
  cat_cols=list(appsTrainDF.select_dtypes(include=['object']).columns)
  return cat_cols
cat_attribs = get_cat_attribs()
over_5_unique = ([])
for att in cat_attribs:
  if (len(appsTrainDF[att].unique()) > 5):
    over_5_unique.append(att)
print(f'{len(over_5_unique)} cat attributes with more than 5 unique values')
for att in over_5_unique:
  print(f'{att}:')
  column_total = appsTrainDF[att].shape[0]
  for v in appsTrainDF[att].unique():
    print(f"Rows for {v}: {sum(appsTrainDF[att] == v)} - {round(100 * (sum(appsTrainDF[att] == v) / column_total))}%")
 appsTrainDF['NAME_TYPE_SUITE'] = appsTrainDF['NAME_TYPE_SUITE'].replace({
                       'Family' : 'other',
                       'Spouse, partner' : 'other',
                       'Children' : 'other',
                       'Other_A' : 'other',
                       'Other_B' : 'other',
                       'Group of people' : 'other',})
 appsTrainDF['NAME_INCOME_TYPE'] = appsTrainDF['NAME_INCOME_TYPE'].replace({
                       'Unemployed' : 'other',
                       'Student' : 'other',
                       'Businessman' : 'other',
                       'Maternity leave' : 'other',})
 appsTrainDF['NAME_FAMILY_STATUS'] = appsTrainDF['NAME_FAMILY_STATUS'].replace({
                       'Single / not married' : 'Not Married',
                       'Married' : 'Married',
                       'Civil marriage' : 'Married',
                       'Widow' : 'Not Married',
                       'Separated' : 'Not Married',
                       'Unknown' : 'Not Married',})
 appsTrainDF['NAME_HOUSING_TYPE'] = appsTrainDF['NAME_HOUSING_TYPE'].replace({
                       'House / apartment' : 'House / apartment',
                       'Rented apartment' : 'other',
                       'With parents' : 'other',
                       'Municipal apartment' : 'other',
                       'Office apartment' : 'other',
                       'Co-op apartment' : 'other',})
 appsTrainDF['OCCUPATION_TYPE'] = appsTrainDF['OCCUPATION_TYPE'].replace({
                       'Laborers' : 'Service Industry',
                       'Drivers' : 'Service Industry',
                       'Cleaning staff' : 'Service Industry',
                       'Cooking staff' : 'Service Industry',
                       'Private service staff' : 'Service Industry',
                       'Security staff' : 'Service Industry',
                       'Waiters/barmen staff' : 'Service Industry',
                       'Low-skill Laborers' : 'Service Industry',
                       'Core staff' : 'Office',
                       'Accountants' : 'Office',
                       'Managers' : 'Office',
                       'Sales staff' : 'Office',
                       'Medicine staff' : 'Office',
                       'High skill tech staff' : 'Office',
                       'Realty agents' : 'Office',
                       'Secretaries' : 'Office',
                       'IT staff' : 'Office',
                       'HR staff' : 'Office',})
 appsTrainDF['WEEKDAY_APPR_PROCESS_START'] = appsTrainDF['WEEKDAY_APPR_PROCESS_START'].replace({
                       'SUNDAY' : 'Weekend',
                       'MONDAY' : 'Weekday',
                       'TUESDAY' : 'Weekday',
                       'WEDNESDAY' : 'Weekday',
                       'THURSDAY' : 'Weekday',
                       'FRIDAY' : 'Weekday',
                       'SATURDAY' : 'Weekend',})
 appsTrainDF['ORGANIZATION_TYPE'] = appsTrainDF['ORGANIZATION_TYPE'].replace({
                         'Advertising' : 'Business',
                         'Agriculture' : 'Industrial',
                         'Bank' : 'Business',
                         'Business Entity Type 1' : 'Business',
                         'Business Entity Type 2' : 'Business',
                         'Business Entity Type 3' : 'Business',
                         'Cleaning' : 'Service',
                         'Construction' : 'Industrial',
                         'Culture' : 'Other',
                         'Electricity' : 'Industrial',
                         'Emergency' : 'Government',
                         'Government' : 'Government',
                         'Hotel' : 'Service',
                         'Housing' : 'Other',
                         'Industry: type 1' : 'Industrial',
                         'Industry: type 10' : 'Industrial',
                         'Industry: type 11' : 'Industrial',
                         'Industry: type 12' : 'Industrial',
                         'Industry: type 13' : 'Industrial',
                         'Industry: type 2' : 'Industrial',
                         'Industry: type 3' : 'Industrial',
                         'Industry: type 4' : 'Industrial',
                         'Industry: type 5' : 'Industrial',
                         'Industry: type 6' : 'Industrial',
                         'Industry: type 7' : 'Industrial',
                         'Industry: type 8' : 'Industrial',
                         'Industry: type 9' : 'Industrial',
                         'Insurance' : 'Business',
                         'Kindergarten' : 'Government',
                         'Legal Services' : 'Business',
                         'Medicine' : 'Government',
                         'Military' : 'Government',
                         'Mobile' : 'Other',
                         'Other' : 'Other',
                         'Police' : 'Government',
                         'Postal' : 'Government',
                         'Realtor' : 'Business',
                         'Religion' : 'Other',
                         'Restaurant' : 'Government',
                         'School' : 'Government',
                         'Security' : 'Other',
                         'Security Ministries' : 'Other',
                         'Self-employed' : 'Other',
                         'Services' : 'Service',
                         'Telecom' : 'Business',
                         'Trade: type 1' : 'Trade',
                         'Trade: type 2' : 'Trade',
                         'Trade: type 3' : 'Trade',
                         'Trade: type 4' : 'Trade',
                         'Trade: type 5' : 'Trade',
                         'Trade: type 6' : 'Trade',
                         'Trade: type 7' : 'Trade',
                         'Transport: type 1' : 'Service',
                         'Transport: type 2' : 'Service',
                         'Transport: type 3' : 'Service',
                         'Transport: type 4' : 'Service',
                         'University' : 'Government',
                         'XNA' : 'XNA'})
 X_kaggle_test['NAME_TYPE_SUITE'] = X_kaggle_test['NAME_TYPE_SUITE'].replace({
                       'Family' : 'other',
                       'Spouse, partner' : 'other',
                       'Children' : 'other',
                       'Other_A' : 'other',
                       'Other_B' : 'other',
                       'Group of people' : 'other',})
 X_kaggle_test['NAME_INCOME_TYPE'] = X_kaggle_test['NAME_INCOME_TYPE'].replace({
                       'Unemployed' : 'other',
                       'Student' : 'other',
                       'Businessman' : 'other',
                       'Maternity leave' : 'other',})
 X_kaggle_test['NAME_FAMILY_STATUS'] = X_kaggle_test['NAME_FAMILY_STATUS'].replace({
                       'Single / not married' : 'Not Married',
                       'Married' : 'Married',
                       'Civil marriage' : 'Married',
                       'Widow' : 'Not Married',
                       'Separated' : 'Not Married',
                       'Unknown' : 'Not Married',})
 X_kaggle_test['NAME_HOUSING_TYPE'] = X_kaggle_test['NAME_HOUSING_TYPE'].replace({
                       'House / apartment' : 'House / apartment',
                       'Rented apartment' : 'other',
                       'With parents' : 'other',
                       'Municipal apartment' : 'other',
                       'Office apartment' : 'other',
                       'Co-op apartment' : 'other',})
 X_kaggle_test['OCCUPATION_TYPE'] = X_kaggle_test['OCCUPATION_TYPE'].replace({
                       'Laborers' : 'Service Industry',
                       'Drivers' : 'Service Industry',
                       'Cleaning staff' : 'Service Industry',
                       'Cooking staff' : 'Service Industry',
                       'Private service staff' : 'Service Industry',
                       'Security staff' : 'Service Industry',
                       'Waiters/barmen staff' : 'Service Industry',
                       'Low-skill Laborers' : 'Service Industry',
                       'Core staff' : 'Office',
                       'Accountants' : 'Office',
                       'Managers' : 'Office',
                       'Sales staff' : 'Office',
                       'Medicine staff' : 'Office',
                       'High skill tech staff' : 'Office',
                       'Realty agents' : 'Office',
                       'Secretaries' : 'Office',
                       'IT staff' : 'Office',
                       'HR staff' : 'Office',})
 X_kaggle_test['WEEKDAY_APPR_PROCESS_START'] = X_kaggle_test['WEEKDAY_APPR_PROCESS_START'].replace({
                       'SUNDAY' : 'Weekend',
                       'MONDAY' : 'Weekday',
                       'TUESDAY' : 'Weekday',
                       'WEDNESDAY' : 'Weekday',
                       'THURSDAY' : 'Weekday',
                       'FRIDAY' : 'Weekday',
                       'SATURDAY' : 'Weekend',})
 X_kaggle_test['ORGANIZATION_TYPE'] = X_kaggle_test['ORGANIZATION_TYPE'].replace({
                         'Advertising' : 'Business',
                         'Agriculture' : 'Industrial',
                         'Bank' : 'Business',
                         'Business Entity Type 1' : 'Business',
                         'Business Entity Type 2' : 'Business',
                         'Business Entity Type 3' : 'Business',
                         'Cleaning' : 'Service',
                         'Construction' : 'Industrial',
                         'Culture' : 'Other',
                         'Electricity' : 'Industrial',
                         'Emergency' : 'Government',
                         'Government' : 'Government',
                         'Hotel' : 'Service',
                         'Housing' : 'Other',
                         'Industry: type 1' : 'Industrial',
                         'Industry: type 10' : 'Industrial',
                         'Industry: type 11' : 'Industrial',
                         'Industry: type 12' : 'Industrial',
                         'Industry: type 13' : 'Industrial',
                         'Industry: type 2' : 'Industrial',
                         'Industry: type 3' : 'Industrial',
                         'Industry: type 4' : 'Industrial',
                         'Industry: type 5' : 'Industrial',
                         'Industry: type 6' : 'Industrial',
                         'Industry: type 7' : 'Industrial',
                         'Industry: type 8' : 'Industrial',
                         'Industry: type 9' : 'Industrial',
                         'Insurance' : 'Business',
                         'Kindergarten' : 'Government',
                         'Legal Services' : 'Business',
                         'Medicine' : 'Government',
                         'Military' : 'Government',
                         'Mobile' : 'Other',
                         'Other' : 'Other',
                         'Police' : 'Government',
                         'Postal' : 'Government',
                         'Realtor' : 'Business',
                         'Religion' : 'Other',
                         'Restaurant' : 'Government',
                         'School' : 'Government',
                         'Security' : 'Other',
                         'Security Ministries' : 'Other',
                         'Self-employed' : 'Other',
                         'Services' : 'Service',
                         'Telecom' : 'Business',
                         'Trade: type 1' : 'Trade',
                         'Trade: type 2' : 'Trade',
                         'Trade: type 3' : 'Trade',
                         'Trade: type 4' : 'Trade',
                         'Trade: type 5' : 'Trade',
                         'Trade: type 6' : 'Trade',
                         'Trade: type 7' : 'Trade',
                         'Transport: type 1' : 'Service',
                         'Transport: type 2' : 'Service',
                         'Transport: type 3' : 'Service',
                         'Transport: type 4' : 'Service',
                         'University' : 'Government',
                         'XNA' : 'XNA'})
for att in over_5_unique:
  print(f'{att}:')
  column_total = appsTrainDF[att].shape[0]
  for v in appsTrainDF[att].unique():
    print(f"Rows for {v}: {sum(appsTrainDF[att] == v)} - {round(100 * (sum(appsTrainDF[att] == v) / column_total))}%")
# # Convert categorical features to numerical approximations (via pipeline)
# class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
#     def fit(self, X, y=None):
#         return self
#     def transform(self, X, y=None): 
#         charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
#         los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
#           '1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
#         X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
#         X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
#         X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
#         X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
#         return X
    
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
# set application pipeline
application_pipe = Pipeline([
    ('app_train_features', ApplicationTrainTestFeaturesAdder()),
    ('ohe', getDummies()),
    ('missing data remover', MissingFeatureRemover()),
    ('collinearity remover', CollinearFeatureRemover()),
    ('near zero variance remover', NearZeroVarianceFeatureRemover())
])
appsTrainDF = application_pipe.fit_transform(appsTrainDF)
X_kaggle_test = application_pipe.transform(X_kaggle_test)
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
drop_uncommon=(list(set(appsTrainDF.columns.tolist()) - set(X_kaggle_test.columns.tolist())))
drop_uncommon.remove('TARGET')
drop_uncommon
appsTrainDF=appsTrainDF.drop(columns=drop_uncommon)
print(appsTrainDF.shape)
print(X_kaggle_test.shape)
appsTrainDF.to_csv("/content/drive/My Drive/AML Project/Data/appsTrainDF.csv",index=False)
X_kaggle_test.to_csv("/content/drive/My Drive/AML Project/Data/X_kaggle_test.csv",index=False)
Data Preparation Ends here with all numeric aggregated features and polynomial features all accumulation to :
Application_train -- (307511, 496)
Application_test -- (48744, 495)
print(appsTrainDF.shape)
print(X_kaggle_test.shape) 
Total numeric features in the application train df.
appsTrainDF.select_dtypes(exclude=['object']).columns
Total Categorical features in the application train df.
appsTrainDF.select_dtypes(include=['object']).columns
Deductions from the list of dtypes of the appsTrainDF
appsTrainDF.dtypes.value_counts()
start = time()
correlation_with_all_features = appsTrainDF.corr()
end = time()
print("Time taken for correlation ", ctime(end - start))
print()
correlation_with_all_features['TARGET'].sort_values()
# correlation_with_all_features.reset_index(inplace= True)
len(correlation_with_all_features.index)
# set this value to choose the number of positive and negative correlated features
n_val = 50
print("---"*50)
print("---"*50)
print("    Total correlation of all the features.    " )
print("---"*50)
print("---"*50)
print(f"Top {n_val} negative correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).head(n_val))
print()
print()
print(f"Top {n_val} positive correlated features")
print()
print(correlation_with_all_features.TARGET.sort_values(ascending = True).tail(n_val))
correlation_with_all_features.TARGET.sort_values(ascending = True)[-n_val:]
gc.collect()
corrn=correlation_with_all_features.TARGET.sort_values(ascending = True).head(n_val).index.tolist()
corrp=correlation_with_all_features.TARGET.sort_values(ascending = True).tail(n_val).index.tolist()
corr=corrn + corrp
corr.remove('TARGET')
print(len(corr))
corr
DATA_DIR='/content/drive/My Drive/AML Project/Data/Phase3'
%%time
ds_names = ('appsTrainDF', 'X_kaggle_test')
for ds_name in ds_names:
    datasets[ds_name]= load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
print(datasets['appsTrainDF'].shape)
print(datasets['X_kaggle_test'].shape)
X_kaggle_test=datasets['X_kaggle_test']
appsTrainDF=datasets['appsTrainDF']
train_dataset=appsTrainDF
class_labels = ["No Default","Default"]
# Create a class to select numerical or categorical columns since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
Identify the numeric features we wish to consider.
num_attribs=['EXT_SOURCE_3',
 'EXT_SOURCE_2',
 'EXT_SOURCE_1',
 'OCCUPATION_TYPE_Office',
 'previous_application_NAME_CONTRACT_STATUS_Approved_mean',
 'NAME_EDUCATION_TYPE_Higher education',
 'CODE_GENDER_F',
 'previous_application_DAYS_FIRST_DRAWING_mean',
 'DAYS_EMPLOYED',
 'previous_application_DAYS_FIRST_DRAWING_min',
 'FLOORSMAX_AVG',
 'previous_application_RATE_DOWN_PAYMENT_sum',
 'previous_application_NAME_YIELD_GROUP_low_normal_mean',
 'previous_application_RATE_DOWN_PAYMENT_max',
 'previous_application_INTEREST_RT_sum',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_mean',
 'REGION_POPULATION_RELATIVE',
 'previous_application_INTEREST_RT_mean',
 'previous_application_HOUR_APPR_PROCESS_START_mean',
 'previous_application_AMT_ANNUITY_mean',
 'previous_application_NAME_PAYMENT_TYPE_Cash through the bank_mean',
 'ELEVATORS_AVG',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_mean',
 'previous_application_RATE_DOWN_PAYMENT_mean',
 'previous_application_NAME_CONTRACT_TYPE_Consumer loans_mean',
 'previous_application_AMT_ANNUITY_min',
 'previous_application_DAYS_FIRST_DRAWING_count',
 'previous_application_HOUR_APPR_PROCESS_START_min',
 'previous_application_HOUR_APPR_PROCESS_START_max',
 'previous_application_PRODUCT_COMBINATION_POS industry with interest_sum',
 'AMT_CREDIT',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_mean',
 'APARTMENTS_AVG',
 'previous_application_NAME_YIELD_GROUP_low_action_mean',
 'previous_application_AMT_ANNUITY_max',
 'previous_application_NAME_GOODS_CATEGORY_Furniture_sum',
 'FLAG_DOCUMENT_6',
 'NAME_HOUSING_TYPE_House / apartment',
 'previous_application_NAME_YIELD_GROUP_low_normal_sum',
 'previous_application_CREDIT_SUCCESS_sum',
 'previous_application_NAME_CLIENT_TYPE_Refreshed_mean',
 'bureau_CREDIT_TYPE_Consumer credit_mean',
 'previous_application_AMT_DOWN_PAYMENT_max',
 'previous_application_NAME_YIELD_GROUP_low_action_sum',
 'HOUR_APPR_PROCESS_START',
 'FLAG_PHONE',
 'previous_application_AMT_DOWN_PAYMENT_count',
 'NAME_INCOME_TYPE_State servant',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: low_sum',
 'previous_application_INTEREST_PER_CREDIT_min',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_sum',
 'bureau_CREDIT_TYPE_Credit card_sum',
 'previous_application_CHANNEL_TYPE_AP+ (Cash loan)_mean',
 'previous_application_PRODUCT_COMBINATION_Cash X-Sell: high_sum',
 'bureau_DAYS_CREDIT_ENDDATE_max',
 'previous_application_NAME_YIELD_GROUP_high_sum',
 'previous_application_NAME_YIELD_GROUP_high_mean',
 'previous_application_NAME_PAYMENT_TYPE_XNA_sum',
 'previous_application_CODE_REJECT_REASON_LIMIT_mean',
 'previous_application_PRODUCT_COMBINATION_Card Street_mean',
 'previous_application_CODE_REJECT_REASON_LIMIT_sum',
 'DAYS_REGISTRATION',
 'bureau_DAYS_CREDIT_sum',
 'previous_application_NAME_YIELD_GROUP_XNA_mean',
 'bureau_DAYS_CREDIT_UPDATE_min',
 'FLAG_DOCUMENT_3',
 'REG_CITY_NOT_LIVE_CITY',
 'bureau_CREDIT_TYPE_Microloan_mean',
 'previous_application_NAME_CONTRACT_TYPE_Revolving loans_sum',
 'previous_application_NAME_CLIENT_TYPE_New_sum',
 'previous_application_DAYS_DECISION_mean',
 'bureau_DAYS_CREDIT_ENDDATE_mean',
 'previous_application_CODE_REJECT_REASON_HC_sum',
 'previous_application_PRODUCT_COMBINATION_Card Street_sum',
 'bureau_DAYS_CREDIT_max',
 'NAME_EDUCATION_TYPE_Secondary / secondary special',
 'REG_CITY_NOT_WORK_CITY',
 'DAYS_ID_PUBLISH',
 'bureau_DAYS_ENDDATE_FACT_mean',
 'previous_application_DAYS_DECISION_min',
 'bureau_DAYS_CREDIT_ENDDATE_sum',
 'previous_application_CODE_REJECT_REASON_HC_mean',
 'DAYS_LAST_PHONE_CHANGE',
 'previous_application_CODE_REJECT_REASON_SCOFR_mean',
 'bureau_DAYS_ENDDATE_FACT_min',
 'previous_application_CODE_REJECT_REASON_SCOFR_sum',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_mean',
 'NAME_INCOME_TYPE_Working',
 'REGION_RATING_CLIENT',
 'previous_application_NAME_PRODUCT_TYPE_walk-in_sum',
 'previous_application_NAME_CONTRACT_STATUS_Refused_sum',
 'bureau_CREDIT_ACTIVE_Active_sum',
 'bureau_DAYS_CREDIT_UPDATE_mean',
 'previous_application_INTEREST_PER_CREDIT_max',
 'bureau_DAYS_CREDIT_min',
 'bureau_CREDIT_ACTIVE_Active_mean',
 'previous_application_NAME_CONTRACT_STATUS_Refused_mean',
 'DAYS_BIRTH',
 'bureau_DAYS_CREDIT_mean',
 'previous_application_INTEREST_PER_CREDIT_mean',
'previous_application_CREDIT_SUCCESS_mean',
'previous_application_INTEREST_RT_mean',
'HAS_LIBAILITY_0',
'HAS_LIBAILITY_1',
'HAS_LIBAILITY_2',
'HAS_LIBAILITY_3',
 'FLAG_DOCUMENT_2',
 'FLAG_DOCUMENT_3',
 'FLAG_DOCUMENT_4',
 'FLAG_DOCUMENT_5',
 'FLAG_DOCUMENT_6',
 'FLAG_DOCUMENT_7',
 'FLAG_DOCUMENT_8',
 'FLAG_DOCUMENT_9',
 'FLAG_DOCUMENT_10',
 'FLAG_DOCUMENT_11',
 'FLAG_DOCUMENT_12',
 'FLAG_DOCUMENT_13',
 'FLAG_DOCUMENT_14',
 'FLAG_DOCUMENT_15',
 'FLAG_DOCUMENT_16',
 'FLAG_DOCUMENT_17',
 'FLAG_DOCUMENT_18',
 'FLAG_DOCUMENT_19',
 'FLAG_DOCUMENT_20',
 'FLAG_DOCUMENT_21',
  'AMT_REQ_CREDIT_BUREAU_HOUR',
 'AMT_REQ_CREDIT_BUREAU_DAY',
 'AMT_REQ_CREDIT_BUREAU_WEEK',
 'AMT_REQ_CREDIT_BUREAU_MON',
 'AMT_REQ_CREDIT_BUREAU_QRT',
 'AMT_REQ_CREDIT_BUREAU_YEAR'
]
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy='mean')),
        ('std_scaler', StandardScaler()),
    ])
OHE when previously unseen unique values in the test/validation set
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])
# Identify the categorical features we wish to consider.
# cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
#                'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
#cat_attribs = train_dataset.select_dtypes(include=['object', ]).columns.tolist()
cat_attribs =[]
gc.collect()
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
        #('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        #('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])
With Feature union, combine numerical and categorical Pipeline together to prepare for Data pipeline
data_prep_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
    #    ("cat_pipeline", cat_pipeline),
    ])              
selected_features = num_attribs 
tot_features = f"{len(selected_features)}:   Num:{len(num_attribs)},    Cat:{len(cat_attribs)}"
#Total Feature selected for processing
tot_features
gc.collect()
Since HCDR is a Classification task, we are going to use the following metrics to measure the Model performance
def pct(x):
    return round(100*x,3)
Define dataframe with all metrics included
#del expLog
try:
    expLog
except NameError:
    expLog = pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train AUC", 
                                   "Valid AUC",
                                   "Test  AUC",
                                   "Train F1 Score",
                                   "Valid F1 Score",
                                   "Test F1 Score",                                   
                                   "Train Log Loss",
                                   "Valid Log Loss",
                                   "Test Log Loss",
                                   "P Score",
                                   "Train Time",
                                   "Valid Time",
                                   "Test Time",
                                   "Description"
                                  ])
# roc curve, precision recall curve for each model
fprs, tprs, precisions, recalls, names, scores, cvscores, pvalues, accuracy, cnfmatrix = list(), list(), list(), list(), list(), list(), list(), list(), list(), list()
features_list, final_best_clf,results = {}, {},[]
This metric describes the fraction of correctly classified samples. In SKLearn, it can be modified to return solely the number of correct samples.Accuracy is the default scoring method for both logistic regression and k-Nearest Neighbors in scikit-learn.
The precision is the ratio of true positives over the total number of predicted positives.
The recall is the ratio of true positives over the true positives and false negatives. Recall is assessing the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0
def precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name):
    # plot precision_recall Test
    precision, recall, threshold = precision_recall_curve(y_test,model.predict_proba(X_test)[:, 1])
    precisions.append(precision)
    recalls.append(recall)
    
    # plot combined Precision Recall curve for train, valid, test
    show_train_precision = plot_precision_recall_curve(model, X_train, y_train, name="TrainPresRecal")
    show_test_precision = plot_precision_recall_curve(model, X_test, y_test, name="TestPresRecal", ax=show_train_precision.ax_)
    show_valid_precision = plot_precision_recall_curve(model, X_valid, y_valid, name="ValidPresRecal", ax=show_test_precision.ax_)
    show_valid_precision.ax_.set_title ("Precision Recall Curve Comparison - " + name)
    plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
    plt.show()
    return precisions,recalls
The F1 score is a metric that has a value of 0 - 1, with 1 being the best value. The F1 score is a weighted average of the precision and recall, with the contributions of precision and recall are the same
The confusion matrix, in this case for a binary classification, is a 2x2 matrix that contains the count of the true positives, false positives, true negatives, and false negatives.
def confusion_matrix_def(model,X_train,y_train,X_test, y_test, X_valid, y_valid,cnfmatrix):
  #Prediction
  preds_test = model.predict(X_test)
  preds_train = model.predict(X_train)
  preds_valid = model.predict(X_valid)
    
  cm_train = confusion_matrix(y_train, preds_train).astype(np.float32)
  #print(cm_train)
  cm_train /= cm_train.sum(axis=1)[:, np.newaxis]
  cm_test = confusion_matrix(y_test, preds_test).astype(np.float32)
  #print(cm_test)
  cm_test /= cm_test.sum(axis=1)[:, np.newaxis]
  cm_valid = confusion_matrix(y_valid, preds_valid).astype(np.float32)
  cm_valid /= cm_valid.sum(axis=1)[:, np.newaxis]
  plt.figure(figsize=(16, 4))
  #plt.subplots(1,3,figsize=(12,4))
  plt.subplot(131)
  g = sns.heatmap(cm_train, vmin=0, vmax=1, annot=True, cmap="Reds")
  plt.xlabel("Predicted", fontsize=14)
  plt.ylabel("True", fontsize=14)
  g.set(xticklabels=class_labels, yticklabels=class_labels)
  plt.title("Train", fontsize=14)
  plt.subplot(132)
  g = sns.heatmap(cm_valid, vmin=0, vmax=1, annot=True, cmap="Reds")
  plt.xlabel("Predicted", fontsize=14)
  plt.ylabel("True", fontsize=14)
  g.set(xticklabels=class_labels, yticklabels=class_labels)
  plt.title("Validation set", fontsize=14);
  plt.subplot(133)
  g = sns.heatmap(cm_test, vmin=0, vmax=1, annot=True, cmap="Reds")
  plt.xlabel("Predicted", fontsize=14)
  plt.ylabel("True", fontsize=14)
  g.set(xticklabels=class_labels, yticklabels=class_labels)
  plt.title("Test", fontsize=14);
  cnfmatrix.append(cm_test)
  return cnfmatrix
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
▪   True Positive Rate
▪   False Positive Rate
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).
AUC is desirable for the following two reasons:
def roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name):
    fpr, tpr, threshold = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
    fprs.append(fpr)
    tprs.append(tpr)
    # plot combined ROC curve for train, valid, test
    show_train_roc = plot_roc_curve(model, X_train, y_train, name="TrainRocAuc")
    show_test_roc = plot_roc_curve(model, X_test, y_test, name="TestRocAuc", ax=show_train_roc.ax_)
    show_valid_roc = plot_roc_curve(model, X_valid, y_valid, name="ValidRocAuc", ax=show_test_roc.ax_)
    show_valid_roc.ax_.set_title ("ROC Curve Comparison - " + name)
    plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
    plt.show()
    return fprs,tprs
CXE measures the performance of a classification model whose output is a probability value between 0 and 1. CXE increases as the predicted probability diverges from the actual label. Therefore, we choose a parameter, which would minimize the binary CXE loss function.
The log loss formula for the binary case is as follows :
$$ -\frac{1}{m}\sum^m_{i=1}\left(y_i\cdot\:\log\:\left(p_i\right)\:+\:\left(1-y_i\right)\cdot\log\left(1-p_i\right)\right) $$p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.
We will compare the classifiers with the baseline untuned model by conducting two-tailed hypothesis test.
Null Hypothesis, H0: There is no significant difference between the two machine learning pipelines. Alternate Hypothesis, HA: The two machine learning pipelines are different. A p-value less than or equal to the significance level is considered statistically significant.
metrics = {'accuracy': make_scorer(accuracy_score),
            'roc_auc': 'roc_auc',
            'f1': make_scorer(f1_score),
            'log_loss': make_scorer(log_loss)
          }
Phase 3 was significant in terms of feature engineering since we had a better understanding of the data and were able to fine tune the datasets to minimize data leakage. The “TARGET” feature was not used in the train test and the test set was managed separately from the merged train dataset. The preprocessing was applied to the merged train set only. We also ensured to remove multicollinearity as a separate preprocessing step and experimented with feature selection methods only within the modelling pipeline to avoid any issues with data leakage. The feature selection methods used were RFE, SelectKbest with mutual classification and Variance threshold.
for col in selected_features:
  if col not in  train_dataset.columns:
    selected_features.remove(col)
# Split Sample to feed the pipeline and it will result in a new dataset that is (1 / splits) the size 
splits = 75
# Train Test split percentage
subsample_rate = 0.3
finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
X_kaggle_test= X_kaggle_test[selected_features]
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
                                                    test_size=subsample_rate, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)
print(f"X train           shape: {X_train.shape}")
print(f"X validation      shape: {X_valid.shape}")
print(f"X test            shape: {X_test.shape}")
print(f"X kaggle_test     shape: {X_kaggle_test.shape}")
X_kaggle_test=datasets['X_kaggle_test1']
kaggle_test = X_kaggle_test[selected_features]
X_kaggle_test.shape,kaggle_test.shape
Logistic regression model is used as a baseline Model, since it's easy to implement yet provides great efficiency. Training a logistic regression model doesn't require high computation power. We also tuned the regularization, tolerance, and C hyper parameters for the Logistic regression model and compared the results with the baseline model. We used 15 fold cross fold validation with hyperparameters to tune the model and apply GridSearchCV function in Sklearn.
%%time 
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
        ("preparation", data_prep_pipeline),
        ("linear", LogisticRegression())
    ])
Split the training data to 15 fold to perform Crossfold validation
cvSplits = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
X_train.head(5)
gc.collect()
start = time()
model = full_pipeline_with_predictor.fit(X_train, y_train)
np.random.seed(42)
# Set up cross validation scores 
logit_scores = cross_validate(model, X_train, y_train,cv=cvSplits,scoring=metrics, return_train_score=True, n_jobs=-1)  
train_time = np.round(time() - start, 4)
# Time and score valid predictions
start = time()
logit_score_valid  = full_pipeline_with_predictor.score(X_valid, y_valid)
valid_time = np.round(time() - start, 4)
# Time and score test predictions
start = time()
logit_score_test  = full_pipeline_with_predictor.score(X_test, y_test)
test_time = np.round(time() - start, 4)
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [logit_scores['train_accuracy'].mean(), 
                logit_scores['test_accuracy'].mean(),
                accuracy_score(y_test, model.predict(X_test)),
                logit_scores['train_roc_auc'].mean(),
                logit_scores['test_roc_auc'].mean(),
                roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                logit_scores['train_f1'].mean(),
                logit_scores['test_f1'].mean(),
                f1_score(y_test, model.predict(X_test)),
                logit_scores['train_log_loss'].mean(),
                logit_scores['test_log_loss'].mean(),
                log_loss(y_test, model.predict(X_test)),0 ],4)) \
                + [train_time, logit_scores['score_time'].mean(), test_time] + [f"Imbalanced Logistic reg features {tot_features} with 20% training data"]
expLog
# Create confusion matrix for baseline model
_=confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)
_,_=roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,"Baseline Logistic Regression Model")
_,_=precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,"Baseline Logistic Regression Model")
gc.collect()
Various Classification algorithms were used to compare with the best model. Following metrics were used to find the best model
We implemented the Logistic regression model as the baseline model, which didn’t require high computation power and was easy to implement, in addition we implemented KNN and tuned logistic models with balanced dataset to improve our model predictiveness. Our objective in current phase s to explore various classification models which would improve our prediction. Our primary focus is on boosting algorithms which are said to be highly efficient and moderately quicker. As shown in the diagram below is the modelling pipeline for current phase. We primarily experimented with Gradient Boosting, XGBoost, Light BGM, RandomForest and SVM.
Recursive Feature Elimination RFE is a wrapper-type feature selection algorithm. A different machine learning algorithm is given and used in the core of the method, is wrapped by RFE, and used to help select features. We have chosen this model in contrast to filter-based feature selections that score each feature and select those features with the largest (or smallest) score.
Below is the reason for choosing the mentioned models.
Gradient Boosting implementations have no regularisation like XGBoost, therefore it helps to reduce overfitting. But boosting algorithms can overfit if the number of trees is very large. We did two submission in Kaggle, one using Voting Classifier and the other one with best classifier i.e. XGBoost. A Voting Classifier is a machine learning model that trains on an ensemble of various models and predicts an output based on their highest probability of chosen class as the output. We have chosen soft voting instead of hard voting since the soft voting predicts based on the average of all models.
For discussion, Results and conclusions refer to the bottom of this page.
classifiers = [
        [('Logistic Regression', LogisticRegression(solver='saga',random_state=42),"RFE")],
        [('Support Vector', SVC(random_state=42,probability=True),"SVM")],
        [('Gradient Boosting', GradientBoostingClassifier(warm_start=True, random_state=42),"RFE")],
        [('XGBoost', XGBClassifier(random_state=42),"RFE")],
        [('Light GBM', LGBMClassifier(boosting_type='gbdt', random_state=42),"RFE")],
        [('RandomForest', RandomForestClassifier(random_state=42),"RFE")]
    ]
# Arrange grid search parameters for each classifier
params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2','elasticnet'),
            'tol': (0.0001, 0.00001), 
            'C': (10, 1, 0.1, 0.01),
        }
    ,
        'Support Vector' : {
            'kernel': ('rbf','poly'),     
            'degree': (4, 5),
            'C': ( 0.001, 0.01),   #Low C - allow for misclassification
            'gamma':(0.01,0.1,1)  #Low gamma - high variance and low bias
        }
    ,
    'Gradient Boosting':  {
            'max_depth': [5,10], # Lower helps with overfitting
            'max_features': [10,15],
            'validation_fraction': [0.2],
            'n_iter_no_change': [10],
            'tol': [0.01,0.0001],
            'n_estimators':[1000],
            'subsample' : [0.8],             #fraction of observations to be randomly samples for each tree.
    #        'min_samples_split' : [5], # Must have 'x' number of samples to split (Default = 2)
            'min_samples_leaf' : [3,5],        # (Default = 1) minimum number of samples in a leaf
        },
        'XGBoost':  {
            'max_depth': [3,5], # Lower helps with overfitting
            'n_estimators':[300,500],
            'learning_rate': [0.01,0.1],
#            'objective': ['binary:logistic'],
#            'eval_metric': ['auc'],
            'eta' : [0.01,0.1],
            'colsample_bytree' : [0.2,0.5], 
        },
        'Light GBM':  {
            'max_depth': [2,5],  # Lower helps with overfitting
            'num_leaves': [5,10], # Equivalent to max depth
            'n_estimators':[1000,5000],
            'learning_rate': [0.01,0.1],
 #           'reg_alpha': [0.1,0.01,1],
 #           'reg_lambda': [0.1,0.01,1],
            'boosting_type':['goss','dart'],
 #           'metric': ['auc'],
 #           'objective':['binary'],
            'max_bin' : [100,200],  #Setting it to high values has similar effect as caused by increasing value of num_leaves 
        },                          #small numbers reduces accuracy but runs faster 
        'RandomForest':  {
            'max_depth': [5,10],
            'max_features': [15,20],
            'min_samples_split': [5, 10],
            'min_samples_leaf': [3, 5],
            'bootstrap': [True],
            'n_estimators':[1000]},
    }
# Set feature selection settings
# Features removed each step
#feature_selection_steps=50
feature_selection_steps=0.5
# Number of features used
features_used=len(selected_features)
#features_used=100
results.append(logit_scores['train_accuracy'])
names = ['Baseline LR']
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))
            # generate the pipeline based on the feature selection method
            if feature_sel == "SVM":
                full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
            #    ("PCA",PCA(0.95)),
            #    ('RFE', RFE(estimator=classifier, n_features_to_select=features_used, step=feature_selection_steps)),
                ("predictor", classifier)
                ])
            else:
                full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ('RFE', RFE(estimator=classifier, n_features_to_select=features_used, step=feature_selection_steps)),
                ("predictor", classifier)
                ])
            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)
            # Best estimator score
            best_train = pct(grid_search.best_score_)
            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  
            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])
            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)
            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))
            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)
            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)
            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)
            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            #Feature importance 
            feature_name = num_attribs #+ cat_attribs
            feature_list = feature_name
            if feature_sel == "RFE":
            #    features_list[name]=pd.DataFrame({'feature_name': feature_list,
            #                                         'feature_importance': grid_search.best_estimator_.named_steps['PCA'].explained_variance_ratio_})
            #                             'feature_importance': grid_search.best_estimator_.named_steps['RFE'].ranking_})
               # print(grid_search.best_estimator_.named_steps['preparation'].get_feature_names())
               # print(len(grid_search.best_estimator_.named_steps['preparation'].get_feature_names()))
               # print(len(feature_list),feature_list)
               # print(len(grid_search.best_estimator_.named_steps['RFE'].ranking_))
            #          grid_search.best_estimator_.named_steps['RFE'].ranking_)
                features_list[name]=pd.DataFrame({'feature_name': feature_list,
                                         'feature_importance': grid_search.best_estimator_.named_steps['RFE'].ranking_[:132]})
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")
            # Record the results
            exp_name = name
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
gc.collect()
expLog
ConductGridSearch(classifiers[2],cnfmatrix,fprs,tprs,precisions,recalls)
gc.collect()
expLog
ConductGridSearch(classifiers[3],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[4],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[5],cnfmatrix,fprs,tprs,precisions,recalls)
gc.collect()
expLog
ConductGridSearch(classifiers[1],cnfmatrix,fprs,tprs,precisions,recalls)
gc.collect()
expLog
# plot feature importance by their ranking for each model
for name in names[1:-1]:
    plt.figure(figsize=(10,10), dpi= 80)
    features_df = features_df = features_list[name].sort_values(['feature_importance','feature_name'], ascending=[False, False])
    sortedNames = np.array(features_df)[0:25, 0]
    sortedImportances = np.array(features_df)[0:25, 1]
    plt.title('Feature Importance  - ' + name)
    plt.barh(range(len(sortedNames)), sortedImportances, color='g', align='center')
    plt.yticks(range(len(sortedNames)), sortedNames)  
    plt.xlabel('Low Importance                                                           High Importance')
    plt.grid()
    plt.show()
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names,rotation=90)
pyplot.grid()
pyplot.show()
# roc curve fpr, tpr  for all classifiers 
plt.plot([0,1],[0,1], 'k--')
for i in range(len(names)-1):
    plt.plot(fprs[i],tprs[i],label = names[i] + '  ' + str(scores[i]))
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.show()
# precision recall curve  for all classifiers 
for i in range(len(names)-1):
    plt.plot(recalls[i],precisions[i],label = names[i])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Precision-Recall Curve')
plt.show()
# plot confusion matrix for all classifiers 
f, axes = plt.subplots(1, len(names), figsize=(30, 8), sharey='row')
for i in range(len(names)):
    disp = ConfusionMatrixDisplay(cnfmatrix[i], display_labels=['0', '1'])
    disp.plot(ax=axes[i], xticks_rotation=0)
    disp.ax_.set_title("Confusion Matrix - " + names[i])
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')
f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.10, hspace=0.1)
f.colorbar(disp.im_, ax=axes)
plt.show()
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_RFE.csv",index=False)
expLog
final_best_clf
final_best_cl_rfe_fdf = pd.DataFrame(list(final_best_clf.items()))#,columns = ['column1','column2']) 
with open('/content/drive/My Drive/AML Project/Data/Phase3/final_best_clf_RFE.txt', 'w') as file:
     file.write(str(final_best_clf))
Build Pipeline using best models and create an ensemble model for Kaggle submission
def voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params):
  %%time 
  np.random.seed(42)
  print("Classifier with parameters")
  final_estimators = []
  for i,clf in enumerate(model_selection):
      model = final_best_clf[clf]['predictor'][0]
      print(i+1, " :",model)
      final_estimators.append((clf,make_pipeline(data_prep_pipeline,
                         RFE(estimator=model,n_features_to_select=features_used, step=feature_selection_steps),
                          model)))
  voting_classifier = Pipeline([("clf", VotingClassifier(estimators=final_estimators, voting='soft'))])
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  voting_classifier.fit(final_X_train, final_y_train)
  start = time()
  train_time = round(time() - start, 4)
  print("Voting Score:{0}".format(voting_classifier.score(final_X_train, final_y_train)))
  test_class_scores = voting_classifier.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
final_best_clf
model_selection = ['Logistic Regression','Gradient Boosting','XGBoost','Light GBM','RandomForest']
fs_type='RFE'
voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params)
Multicollinearity highly affects the variance associated with the problem, and can also affect the interpretation of the model, as it undermines the statistical significance of independent variables.For a dataset, if some of the independent variables are highly independent of each other, it results in multicollinearity. A small change in any of the features can affect the model performance to a great extent. In other words, The coefficients of the model become very sensitive to small changes in the independent variables.The basic idea is to run a PCA on all predictors. Their ratio, the Condition Index, will be high if multicollinearity is present. 
Reference : https://www.whitman.edu/Documents/Academics/Mathematics/2017/Perez
for (name, classifier,feature_sel) in classifiers[0]:
        # Print classifier and parameters
        print('****** START', name,'*****')
        parameters = params_grid[name]
        print("Parameters:")
        for p in sorted(parameters.keys()):
            print("\t"+str(p)+": "+ str(parameters[p]))
        # generate the pipeline based on the feature selection method
        full_pipeline_with_predictor = Pipeline([
            ("preparation", data_prep_pipeline),
            ("PCA",PCA(0.95)),
            ("predictor", classifier)
            ])
 
        # Execute the grid search
        params = {}
        for p in parameters.keys():
            pipe_key = 'predictor__'+str(p)
            params[pipe_key] = parameters[p] 
        grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                   n_jobs=-1,verbose=1)
        grid_search.fit(X_train, y_train)
        # Best estimator score
        best_train = pct(grid_search.best_score_)
        # Best train scores
        print("Cross validation with best estimator")
        best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                           return_train_score=True, n_jobs=-1)  
        #get all scores
        best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
        best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
        best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
        best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
        valid_time = np.round(best_train_scores['score_time'].mean(),4)
        best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
        best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
        best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
        best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
        (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])
        #test and Prediction with whole data
        # Best estimator fitting time
        print("Fit and Prediction with best estimator")
        start = time()
        model = grid_search.best_estimator_.fit(X_train, y_train)
        train_time = round(time() - start, 4)
        # Best estimator prediction time
        start = time()
        y_test_pred = model.predict(X_test)
        test_time = round(time() - start, 4)
        # Collect the best parameters found by the grid search
        print("Best Parameters:")
        best_parameters = grid_search.best_estimator_.get_params()
        param_dump = []
        for param_name in sorted(params.keys()):
            param_dump.append((param_name, best_parameters[param_name]))
            print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
        print("****** FINISH",name," *****")
        print("")
        # Record the results
#        exp_name = "Logistic Regression with PCA"
        expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                [best_train_accuracy, 
                #pct(accuracy_score(y_valid, model.predict(X_valid))),
                best_valid_accuracy,
                accuracy_score(y_test, y_test_pred),
                best_train_roc_auc,
                best_valid_roc_auc,
                #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                best_train_f1,
                best_valid_f1,
                f1_score(y_test, y_test_pred),
                best_train_logloss,
                best_valid_logloss, 
                log_loss(y_test, y_test_pred),
                p_value
                ],4)) + [train_time,valid_time,test_time] \
                + [json.dumps(param_dump)]
expLog
  fs_type='Logistic PCA'
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  grid_search.best_estimator_.fit(final_X_train, final_y_train)
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  start = time()
  train_time = round(time() - start, 4)
  print("Logistic PCA Score:{0}".format(grid_search.best_estimator_.score(final_X_train, final_y_train)))
  test_class_scores = grid_search.best_estimator_.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
Various Classification algorithms were used to compare with the best model. Following metrics were used to find the best model
# Clean up the  arrays
del fprs[1:]
del tprs[1:] 
del precisions[1:] 
del recalls[1:] 
del names[1:] 
del scores[1:] 
del cvscores[1:] 
del pvalues[1:] 
del accuracy[1:]
del cnfmatrix[1:] 
del results[1:]
final_best_clf,results = {}, {}
print(names)
classifiers = [
        [('Logistic Regression', LogisticRegression(solver='saga',random_state=42),"SelectKbest")],
        [('XGBoost', XGBClassifier(random_state=42),"SelectKbest")],
        [('Light GBM', LGBMClassifier(boosting_type='gbdt', random_state=42),"SelectKbest")],
        [('RandomForest', RandomForestClassifier(random_state=42),"SelectKbest")]
    ]
# Arrange grid search parameters for each classifier
params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2','elasticnet'),
            'tol': (0.0001, 0.00001), 
            'C': (10, 1, 0.1, 0.01),
        }
    ,
        'XGBoost':  {
            'max_depth': [3,5], # Lower helps with overfitting
            'n_estimators':[300,500],
            'learning_rate': [0.01,0.1],
#            'objective': ['binary:logistic'],
#            'eval_metric': ['auc'],
            'eta' : [0.01,0.1],
            'colsample_bytree' : [0.2,0.5], 
        },
        'Light GBM':  {
            'max_depth': [2,5],  # Lower helps with overfitting
            'num_leaves': [5,10], # Equivalent to max depth
            'n_estimators':[1000,5000],
            'learning_rate': [0.01,0.1],
 #           'reg_alpha': [0.1,0.01,1],
 #           'reg_lambda': [0.1,0.01,1],
            'boosting_type':['goss','dart'],
 #           'metric': ['auc'],
 #           'objective':['binary'],
            'max_bin' : [100,200],  #Setting it to high values has similar effect as caused by increasing value of num_leaves 
        },                          #small numbers reduces accuracy but runs faster 
        'RandomForest':  {
            'max_depth': [5,10],
            'max_features': [15,20],
            'min_samples_split': [5, 10],
            'min_samples_leaf': [3, 5],
            'bootstrap': [True],
            'n_estimators':[1000]},
    }
results = []
results.append(logit_scores['train_accuracy'])
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))
            # generate the pipeline based on the feature selection method
            full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ('SelectKbest',SelectKBest(score_func=mutual_info_classif, k=features_used)),
                ("predictor", classifier)
                ])
            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)
            # Best estimator score
            best_train = pct(grid_search.best_score_)
            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  
            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])
            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)
            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))
            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)
            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)
            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)
            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")
            # Record the results
            exp_name = name+str('SelectKbest')
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[3],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[1],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[2],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names,rotation=90)
pyplot.grid()
pyplot.show()
# roc curve fpr, tpr  for all classifiers 
plt.plot([0,1],[0,1], 'k--')
for i in range(len(names)):
    plt.plot(fprs[i],tprs[i],label = names[i] + '  ' + str(scores[i]))
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.show()
# precision recall curve  for all classifiers 
for i in range(len(names)):
    plt.plot(recalls[i],precisions[i],label = names[i])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Precision-Recall Curve')
plt.show()
# plot confusion matrix for all classifiers 
f, axes = plt.subplots(1, len(names), figsize=(30, 8), sharey='row')
for i in range(len(names)):
    disp = ConfusionMatrixDisplay(cnfmatrix[i], display_labels=['0', '1'])
    disp.plot(ax=axes[i], xticks_rotation=0)
    disp.ax_.set_title("Confusion Matrix - " + names[i])
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')
f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.10, hspace=0.1)
f.colorbar(disp.im_, ax=axes)
plt.show()
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_SelectKbest.csv",index=False)
expLog
def voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params):
  %%time 
  np.random.seed(42)
  print("Classifier with parameters")
  final_estimators = []
  for i,clf in enumerate(model_selection):
      model = final_best_clf[clf]['predictor'][0]
      print(i+1, " :",model)
      final_estimators.append((clf,make_pipeline(data_prep_pipeline,
                         SelectKBest(score_func=mutual_info_classif, k=features_used),
                          model)))
  voting_classifier = Pipeline([("clf", VotingClassifier(estimators=final_estimators, voting='soft'))])
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  voting_classifier.fit(final_X_train, final_y_train)
  start = time()
  train_time = round(time() - start, 4)
  print("Voting Score:{0}".format(voting_classifier.score(final_X_train, final_y_train)))
  test_class_scores = voting_classifier.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
final_best_clf
model_selection = ['Logistic Regression','XGBoost','Light GBM','RandomForest']
fs_type='SelectKbest'
voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params)
Various Classification algorithms were used to compare with the best model using Variance Thershold feature selection technique.
Variance Threshold The variance threshold is a simple baseline approach to feature selection. It removes all features which variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples. We assume that features with a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or feature and target variables into account, which is one of the drawbacks of filter methods.
Following metrics were used to find the best model
# Clean up the  arrays
del fprs[1:]
del tprs[1:] 
del precisions[1:] 
del recalls[1:] 
del names[1:] 
del scores[1:] 
del cvscores[1:] 
del pvalues[1:] 
del accuracy[1:]
del cnfmatrix[1:] 
del results[1:]
final_best_clf = {}
classifiers = [
        [('Logistic Regression', LogisticRegression(solver='saga',random_state=42),"VarianceThreshold")],
        [('XGBoost', XGBClassifier(random_state=42),"VarianceThreshold")],
        [('Light GBM', LGBMClassifier(boosting_type='gbdt', random_state=42),"VarianceThreshold")],
        [('RandomForest', RandomForestClassifier(random_state=42),"VarianceThreshold")]
    ]
# Arrange grid search parameters for each classifier
params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2','elasticnet'),
            'tol': (0.0001, 0.00001), 
            'C': (10, 1, 0.1, 0.01),
        }
    ,
        'XGBoost':  {
            'max_depth': [3,5], # Lower helps with overfitting
            'n_estimators':[300,500],
            'learning_rate': [0.01,0.1],
#            'objective': ['binary:logistic'],
#            'eval_metric': ['auc'],
            'eta' : [0.01,0.1],
            'colsample_bytree' : [0.2,0.5], 
        },
        'Light GBM':  {
            'max_depth': [2,5],  # Lower helps with overfitting
            'num_leaves': [5,10], # Equivalent to max depth
            'n_estimators':[1000,5000],
            'learning_rate': [0.01,0.1],
 #           'reg_alpha': [0.1,0.01,1],
 #           'reg_lambda': [0.1,0.01,1],
            'boosting_type':['goss','dart'],
 #           'metric': ['auc'],
 #           'objective':['binary'],
            'max_bin' : [100,200],  #Setting it to high values has similar effect as caused by increasing value of num_leaves 
        },                          #small numbers reduces accuracy but runs faster 
        'RandomForest':  {
            'max_depth': [5,10],
            'max_features': [15,20],
            'min_samples_split': [5, 10],
            'min_samples_leaf': [3, 5],
            'bootstrap': [True],
            'n_estimators':[1000]},
    }
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))
            # generate the pipeline based on the feature selection method
            full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ("VarianceThreshold", VarianceThreshold(threshold=0.9)),    
                ("predictor", classifier)
                ])
            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)
            # Best estimator score
            best_train = pct(grid_search.best_score_)
            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  
            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])
            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)
            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))
            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)
            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)
            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)
            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")
            # Record the results
            exp_name = name+str('Variance')
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[1],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[2],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
ConductGridSearch(classifiers[3],cnfmatrix,fprs,tprs,precisions,recalls)
expLog
# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Classification Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names,rotation=90)
pyplot.grid()
pyplot.show()
# roc curve fpr, tpr  for all classifiers 
plt.plot([0,1],[0,1], 'k--')
for i in range(len(names)):
    plt.plot(fprs[i],tprs[i],label = names[i] + '  ' + str(scores[i]))
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title('Receiver Operating Characteristic')
plt.show()
# precision recall curve  for all classifiers 
for i in range(len(names)):
    plt.plot(recalls[i],precisions[i],label = names[i])
plt.legend(bbox_to_anchor=(1.04,1), loc="upper left", borderaxespad=0)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title('Precision-Recall Curve')
plt.show()
# plot confusion matrix for all classifiers 
f, axes = plt.subplots(1, len(names), figsize=(30, 8), sharey='row')
for i in range(len(names)):
    disp = ConfusionMatrixDisplay(cnfmatrix[i], display_labels=['0', '1'])
    disp.plot(ax=axes[i], xticks_rotation=0)
    disp.ax_.set_title("Confusion Matrix - " + names[i])
    disp.im_.colorbar.remove()
    disp.ax_.set_xlabel('')
    if i!=0:
        disp.ax_.set_ylabel('')
f.text(0.4, 0.1, 'Predicted label', ha='left')
plt.subplots_adjust(wspace=0.10, hspace=0.1)
f.colorbar(disp.im_, ax=axes)
plt.show()
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_Variance.csv",index=False)
expLog
def voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params):
  %%time 
  np.random.seed(42)
  print("Classifier with parameters")
  final_estimators = []
  for i,clf in enumerate(model_selection):
      model = final_best_clf[clf]['predictor'][0]
      print(i+1, " :",model)
      final_estimators.append((clf,make_pipeline(data_prep_pipeline,
                         (VarianceThreshold(threshold=0.9)),
                          model)))
  voting_classifier = Pipeline([("clf", VotingClassifier(estimators=final_estimators, voting='soft'))])
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  voting_classifier.fit(final_X_train, final_y_train)
  start = time()
  train_time = round(time() - start, 4)
  print("Voting Score:{0}".format(voting_classifier.score(final_X_train, final_y_train)))
  test_class_scores = voting_classifier.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
final_best_clf
model_selection = ['Logistic Regression','XGBoost','Light GBM','RandomForest']
fs_type='Variance'
voting_classifier_submission(model_selection,final_best_clf,fs_type,fs_params)
In this section we are using the classifier XGBoost with SMOTE for imbalanced dataset with early stopping.
SMOTE: Synthetic Minority Oversampling Technique SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model. SMOTE works by taking advantage of thek-nearest neighbor algorithm to create synthetic data. This algorithm overcomes the overfitting issue raised by random oversampling. This also emphasizes on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.
Early stopping is an approach to training complex machine learning models to avoid overfitting. It works by monitoring the performance of the model that is being trained on a separate test dataset and stopping the training procedure once the performance on the test dataset has not improved after a fixed number of training iterations. It avoids overfitting by attempting to automatically select the inflection point where performance on the test dataset starts to decrease while performance on the training dataset continues to improve as the model starts to overfit.The performance measure may be the loss function that is being optimized to train the model (such as logarithmic loss)
classifiers = [
        [('XGBoost SMOTE', XGBClassifier(random_state=42),"SMOTE")],]
params_grid = {
        'XGBoost SMOTE':  {
        'max_depth': [5], # Lower helps with overfitting
        'n_estimators':[1000],
        'learning_rate': [0.01],
        'objective': ['binary:logistic'],
        'eval_metric': ['auc'],
        'min_child_weight' : [15],
        'eta' : [0.01,],
        'subsample' : [0.5],
        'early_stopping_rounds':[5,10]
    },
    }
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
results=[]
def ConductGridSearch(in_classifiers,cnfmatrix,fprs,tprs,precisions,recalls):
    for (name, classifier,feature_sel) in in_classifiers:
            # Print classifier and parameters
            print('****** START', name,'*****')
            parameters = params_grid[name]
            print("Parameters:")
            for p in sorted(parameters.keys()):
                print("\t"+str(p)+": "+ str(parameters[p]))
            # generate the pipeline based on the feature selection method
            full_pipeline_with_predictor = Pipeline([
                ("preparation", data_prep_pipeline),
                ('SMOTE', SMOTE(random_state=42, sampling_strategy=0.25, k_neighbors=3)),
                ("predictor", classifier)
                ])
            # Execute the grid search
            params = {}
            for p in parameters.keys():
                pipe_key = 'predictor__'+str(p)
                params[pipe_key] = parameters[p] 
            grid_search = GridSearchCV(full_pipeline_with_predictor, params, cv=cvSplits, scoring='roc_auc',
                                       n_jobs=-1,verbose=1)
            grid_search.fit(X_train, y_train)
            # Best estimator score
            best_train = pct(grid_search.best_score_)
            # Best train scores
            print("Cross validation with best estimator")
            best_train_scores = cross_validate(grid_search.best_estimator_, X_train, y_train,cv=cvSplits,scoring=metrics, 
                                               return_train_score=True, n_jobs=-1)  
            #get all scores
            best_train_accuracy = np.round(best_train_scores['train_accuracy'].mean(),4)
            best_train_f1 = np.round(best_train_scores['train_f1'].mean(),4)
            best_train_logloss = np.round(best_train_scores['train_log_loss'].mean(),4)
            best_train_roc_auc = np.round(best_train_scores['train_roc_auc'].mean(),4)
            valid_time = np.round(best_train_scores['score_time'].mean(),4)
            best_valid_accuracy = np.round(best_train_scores['test_accuracy'].mean(),4)
            best_valid_f1 = np.round(best_train_scores['test_f1'].mean(),4)
            best_valid_logloss = np.round(best_train_scores['test_log_loss'].mean(),4)
            best_valid_roc_auc = np.round(best_train_scores['test_roc_auc'].mean(),4)
            #append all results
            results.append(best_train_scores['train_accuracy'])
            names.append(name)
            
            # Conduct t-test with baseline logit (control) and best estimator (experiment)
            (t_stat, p_value) = stats.ttest_rel(logit_scores['train_roc_auc'], best_train_scores['train_roc_auc'])
            #test and Prediction with whole data
            # Best estimator fitting time
            print("Fit and Prediction with best estimator")
            start = time()
            model = grid_search.best_estimator_.fit(X_train, y_train)
            train_time = round(time() - start, 4)
            # Best estimator prediction time
            start = time()
            y_test_pred = model.predict(X_test)
            test_time = round(time() - start, 4)
            scores.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
            accuracy.append(accuracy_score(y_test, y_test_pred))
            # Create confusion matrix for the best model
            cnfmatrix = confusion_matrix_def(model,X_train,y_train,X_test,y_test,X_valid, y_valid,cnfmatrix)
            # Create AUC ROC curve
            fprs,tprs = roc_curve_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,fprs,tprs,name)
            #Create Precision recall curve
            precisions,recalls = precision_recall_cust(model,X_train,y_train,X_test, y_test,X_valid, y_valid,precisions,recalls,name)
            #Best Model
            final_best_clf[name]=pd.DataFrame([{'label': grid_search.best_estimator_.named_steps['predictor'].__class__.__name__,
                                           'predictor': grid_search.best_estimator_.named_steps['predictor']}])
            # Collect the best parameters found by the grid search
            print("Best Parameters:")
            best_parameters = grid_search.best_estimator_.get_params()
            param_dump = []
            for param_name in sorted(params.keys()):
                param_dump.append((param_name, best_parameters[param_name]))
                print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
            print("****** FINISH",name," *****")
            print("")
            # Record the results
            exp_name = name+str('SMOTE')
            expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
                    [best_train_accuracy, 
                    #pct(accuracy_score(y_valid, model.predict(X_valid))),
                    best_valid_accuracy,
                    accuracy_score(y_test, y_test_pred),
                    best_train_roc_auc,
                    best_valid_roc_auc,
                    #roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
                    roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
                    best_train_f1,
                    best_valid_f1,
                    f1_score(y_test, y_test_pred),
                    best_train_logloss,
                    best_valid_logloss, 
                    log_loss(y_test, y_test_pred),
                    p_value
                    ],4)) + [train_time,valid_time,test_time] \
                    + [json.dumps(param_dump)]
ConductGridSearch(classifiers[0],cnfmatrix,fprs,tprs,precisions,recalls)
pd.set_option('display.max_colwidth', None)
expLog.to_csv("/content/drive/My Drive/AML Project/Data/Phase3/expLog_SMOTE.csv",index=False)
expLog
  fs_type='XGBoost SMOTE'
  final_X_train = finaldf[0][selected_features]
  final_y_train = finaldf[0]['TARGET']
  final_X_kaggle_test = kaggle_test
  grid_search.best_estimator_.fit(final_X_train, final_y_train)
  print(final_X_train.shape,final_y_train.shape,final_X_kaggle_test.shape)
  start = time()
  train_time = round(time() - start, 4)
  print("XGBoost SMOTE Score:{0}".format(grid_search.best_estimator_.score(final_X_train, final_y_train)))
  test_class_scores = grid_search.best_estimator_.predict_proba(final_X_kaggle_test)[:, 1]
  print(test_class_scores[0:10])
  
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  #
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
#! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "Phase 2-Voting submission"
Deep Learning Model Pipeline & Workflow
Deep learning is a sub field of machine learning. Deep learning is about learning from past data using artificial neural networks with multiple hidden layers (2 or more hidden layers). Deep neural networks uncrumple complex representation of data step-by-step, layer-by-layer (hence multiple hidden layers) into a clear representation of the data. Artificial neural networks having one hidden layer apart from input and output layer is called as multi-layer perceptron (MLP) network.
import torch
import tensorflow as tf
import torch.nn as nn
import torch.nn.functional as func
import torch.optim as optim
from torch.utils.data import DataLoader
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers.normalization import BatchNormalization
import copy
from datetime import datetime
import pickle
import time
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.nn.functional as func
import torch.optim as optim
from torch.optim import lr_scheduler
# Metrics
from sklearn.metrics import auc
Transform data using data pipeline and converted into Tensor for neural network pipeline.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
full_X_train = data_prep_pipeline.fit_transform(X_train)
full_X_test = data_prep_pipeline.fit_transform(X_test)
full_X_train_gpu = torch.FloatTensor(full_X_train).cuda()
full_X_test_gpu = torch.FloatTensor(full_X_test).cuda()
y_train_gpu =  torch.FloatTensor(y_train.to_numpy()).cuda()
y_test_gpu = torch.FloatTensor(y_test.to_numpy()).cuda()
full_X_test_gpu.shape,full_X_train_gpu.shape
results = pd.DataFrame(columns=["ExpID", 
              "Train Acc", "Val Acc", "Test Acc", "p-value",
              "Train AUC", "Val AUC", "Test AUC",
              "Train f1", "Val f1", "Test f1",
              "Train logloss", "Val logloss", "Test logloss",
              "Train Time(s)", "Val Time(s)", "Test Time(s)", 
              "Experiment description",
              "Top 10 Features"])
Signmoid layer is used to create the probability of prediction.
D_in = full_X_train_gpu.shape[1]
D_hidden1 = 20
D_hidden2 = 10
D_out= 1
model1 = torch.nn.Sequential( 
    torch.nn.Linear(D_in, D_out),
    nn.Sigmoid())
learning_rate = 0.01
optimizer = torch.optim.Adam(model1.parameters(), lr=learning_rate)
model1 = model1.cuda()
def return_report(y, y_prob):
  _, y_pred = torch.max(y_prob, dim = 1)
  y_pred = y_pred.cpu().numpy()
  acc = accuracy_score(y, y_pred)
  roc_auc = roc_auc_score(y, y_prob.cpu().detach().numpy())
  return_list = ([round(acc,4), round(roc_auc, 4)])
  return return_list
def print_report(y, y_prob):
  _, y_pred = torch.max(y_prob, dim = 1)
  y_pred = y_pred.cpu().numpy()
  acc = accuracy_score(y, y_pred)
  roc_auc = roc_auc_score(y, y_prob.cpu().detach().numpy())
  print(f'Accuracy : {round(acc,4)} ; ROC_AUC : {round(roc_auc, 4)}')
epochs = 500
y_train_gpu = y_train_gpu.reshape(-1, 1)
print('Train data : ')
model1.train()
for i in range(epochs):
  
  y_train_pred_prob = model1(full_X_train_gpu)
  loss = func.binary_cross_entropy(y_train_pred_prob, y_train_gpu)
  optimizer.zero_grad()
  #loss = loss_func(y_train_pred_prob, y_train_gpu)
  loss.backward()
  optimizer.step()
  if i % 50 == 49:
    print(f"Epoch {i + 1}:")
    print_report(y_train, y_train_pred_prob)
model1.eval()
y_test_gpu = y_test_gpu.reshape(-1, 1)
with torch.no_grad():
    y_test_pred_prob=model1(full_X_test_gpu)
    print('-' * 50)
    print('Test data : ')
    print_report(y_test, y_test_pred_prob)
    print('-' * 50)
final_X_kaggle_test = kaggle_test
final_X_kaggle_test = data_prep_pipeline.fit_transform(final_X_kaggle_test)
full_X_kaggle_gpu = torch.FloatTensor(final_X_kaggle_test).cuda()
full_X_kaggle_gpu.shape
  model1.eval()
  test_class_scores = model1(full_X_kaggle_gpu)
  print(test_class_scores[0:10])
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  fs_type = "simple_nn"
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores.detach().cpu().numpy()
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
The model contain one linear layer, one hidden layer with Relu function and sigmoid function for probability prediction.
## Model using hidden layers
class SVMNNmodel(nn.Module):
  def __init__(self, input_features , hidden1 = 80, hidden2 = 80, output_features = 1):
    super(SVMNNmodel, self).__init__()
    # self.f_connected1 = nn.Linear(input_features, hidden1)
    # self.f_connected2 = nn.Linear(hidden1, hidden2)
    # self.out = nn.Linear(hidden2, output_features)
    # self.sigmoid = nn.Sigmoid()
    self.f_connected1 = nn.Linear(input_features, hidden1)
    self.out = nn.Linear(hidden1, output_features)
  def forward(self, x):
    #x = func.relu(self.f_connected1(x))
    #x= func.relu(self.f_connected2(x))
    #x = self.out(x)
    #return self.sigmoid(x)
    h_relu = torch.relu(self.f_connected1(x))
    y_target_pred = torch.sigmoid(self.out(h_relu))
    return y_target_pred
The model contain one linear layer, one hidden layer with Relu function and sigmoid function for probability prediction.
To extend a hard SVM to cases in which the data are not linearly separable (a little noisy), we introduce the hinge loss function,
$${\displaystyle \max \left(0,1-y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\right).} $$This function is zero if the following constraint for a training example $y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\geq 1,$ is satisfied, in other words, if ${\displaystyle {\vec {x}}_{i}} $ lies on the correct side of the margin (DMZ demilitarized zone). For data on the wrong side of the margin, the function's value is proportional to the distance from the margin.
This type of data loss leads a soft-margin SVM classifier. Computing a (soft-margin) SVM classifier amounts to minimizing an expression of the form:
$$ {LinSVM}(\mathbf{w}, b) = \underset{W,b}{\operatorname{argmin}} \, \left( \overbrace{\dfrac{\lambda}{2}}^A \underbrace{\mathbf{w}^T \cdot \mathbf{w}}_B \quad + \quad \overbrace{\dfrac{1}{m} {\displaystyle \sum\limits_{i=1}^{m}max\left(0, 1 - \underbrace{y^{(i)}(\mathbf{w}^T \cdot \mathbf{x}^{(i)} + b)}_D \right)} }^{C}\right) \qquad (3) $$where the parameter ${\displaystyle \lambda } $ (corresponding to the A-zone in the above formulation) determines the tradeoff between increasing the margin-size and ensuring that the ${\displaystyle {\vec {x}}_{i}} $ lie on the correct side of the margin (corresponding to the C-zone in the above formulation).
Here choosing a sufficiently small value for $ \lambda$ yields the hard-margin classifier for linearly classifiable input data. Classically, this problem can be solved via quadratic programming (see slides/textbook/wikipedia for details). More recently approaches such as sub-gradient descent and coordinate descent have been proposed and lead to more scaleable implementations without compromising quality.
class SVMLoss(nn.Module):
  def __init__(self):
    super(SVMLoss,self).__init__()
  def forward(self,outputs,labels,model2):
     C = 0.10
    # data_loss = torch.mean(torch.clamp(1 - outputs.squeeze().t() == labels,min=0))
     data_loss = torch.mean(torch.clamp(1 - outputs.squeeze(),min=0))
     weight = model2.out.weight.squeeze()
     reg_loss = weight.t() @ weight
     reg_loss = reg_loss + ( model2.out.bias.squeeze()**2)
     hinge = data_loss +( C*reg_loss/2)
     return (hinge)
class Converttensor(Dataset):
    def __init__(self, feature, label, mode ='train', transforms=None):
        """
        Initialize data set as a list of IDs corresponding to each item of data set
        :param feature: x - numpy array
        :param label: y - numpy array
        """
        self.x = feature
        self.y = label
    
    def __len__(self):
        """
        Return the length of data set using list of IDs
        :return: number of samples in data set
        """
        return (self.x.shape[0])
    def __getitem__(self, index):
        """
        Generate one item of data set.
        :param index: index of item in IDs list
        :return: image and label and bouding box params
        """
        x = self.x[index,:]
        y_target = self.y[index]
        x = torch.FloatTensor(x)
        y_target_arr = np.array(y_target)
        return x, y_target_arr
fprs_net_train, tprs_net_train, fprs_net_valid, tprs_net_valid = [], [], [], []
roc_auc_net_train = 0.0
roc_auc_net_valid = 0.0
num_epochs=25
batch_size=256
CASE_NAME = "NN"
splits = 1
# Train Test split percentage
subsample_rate = 0.3
finaldf = np.array_split(train_dataset, splits)
X_train = finaldf[0][selected_features]
y_train = finaldf[0]['TARGET']
final_X_kaggle_test = kaggle_test
## split part of data
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
                                                    test_size=subsample_rate, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,test_size=0.15, random_state=42)
nn_X_train = data_prep_pipeline.fit_transform(X_train)
nn_X_valid = data_prep_pipeline.fit_transform(X_valid)
nn_X_test = data_prep_pipeline.fit_transform(X_test)
nn_X_kaggle_test = data_prep_pipeline.fit_transform(final_X_kaggle_test)
full_X_kaggle_gpu = torch.FloatTensor(nn_X_kaggle_test).cuda()
nn_y_train = np.array(y_train)
nn_y_valid = np.array(y_valid)
in_feature_cnt = nn_X_train.shape[1]
out_feature_cnt = 1
print(f"X train           shape: {nn_X_train.shape}")
print(f"X validation      shape: {nn_X_valid.shape}")
print(f"X test            shape: {nn_X_test.shape}")
print(f"X kaggle_test     shape: {nn_X_kaggle_test.shape}")
print("Feature count           : ",in_feature_cnt)
nn_dataset = {'train': nn_X_train, 'val': nn_X_valid}
dataset_sizes = {x_type : len(nn_dataset[x_type]) for x_type in ['train','val']}
dataset_sizes
## Transform dataset
nn_dataset['train'] = Converttensor(nn_dataset['train'] ,nn_y_train, mode='train')
## Transform validation dataset
nn_dataset['val'] = Converttensor(nn_dataset['val'] ,nn_y_valid, mode='validation')
nn_dataset
## Set dataloader
dataloaders = {x_type: torch.utils.data.DataLoader(nn_dataset[x_type], batch_size=batch_size,shuffle=True, num_workers=0)  
              for x_type in ['train', 'val']}  
# Set model
nn_model = SVMNNmodel(input_features = in_feature_cnt, output_features= 1).cuda()
#nn_model = nn_model.float()
#del convergence
try:
       convergence
       epoch_offset=convergence.epoch.iloc[-1]+1
except NameError:
        convergence=pd.DataFrame(columns=['epoch','phase','roc_auc','accuracy','CXE','Hinge'])
        epoch_offset=0
# Code adapted from https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
def train(optimizer_cxe,optimizer_hinge,criteron,scheduler_cxe,scheduler_hinge,num_epochs=25, w_cel=1.0):
    
    global roc_auc_train
    global roc_auc_valid
    fac_cel=torch.tensor(w_cel)
    start = time.time()
    best_model_wts = copy.deepcopy(nn_model.state_dict())
    best_acc = 0.0
    # Store results to easier collect stats
    nn_y_pred = {x: np.zeros((dataset_sizes[x],1)) for x in ['train', 'val']}
    for epoch in range(num_epochs):
        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            t0=time.time()
            # Reset to zero to be save
           
            nn_y_pred[phase].fill(0)
            if phase == 'train':
                nn_model.train()  # Set model to training mode
            else:
                nn_model.eval()   # Set model to evaluate mode
            running_loss = 0.0
            running_corrects = 0
            running_hinge = 0.0
            running_cxe = 0.0
            # Iterate over data.
            ix=0
            for inputs, targets in dataloaders[phase]:
                n_batch = len(targets)
                
                #nn_y_pred[phase][ix:ix+n_batch,:] = targets.detach().numpy().reshape(-1,1)
                inputs = inputs.to(device)
                targets = targets.to(device).float()
                # zero the parameter gradients
                optimizer_hinge.zero_grad()
                optimizer_cxe.zero_grad()
                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    output_target = nn_model.forward(inputs)
                    preds = torch.where((output_target > .5), 1, 0)
                    #print(output_target.squeeze(),targets)
                    ix += n_batch
                    loss_cxe = func.binary_cross_entropy(output_target.squeeze(), targets)
                    loss_hinge = criteron.forward(output_target.squeeze(), targets,nn_model)
                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss_hinge.backward()
                        optimizer_hinge.step()
                        #loss_cxe.backward()
                        optimizer_cxe.step()
                # statistics
                running_hinge += loss_hinge.item() * inputs.size(0)
                running_corrects += 1*(preds == targets.data.int()).sum().item()
                running_cxe += loss_cxe.item() * inputs.size(0)
            if phase == 'train':
                scheduler_hinge.step()
                scheduler_cxe.step()
            epoch_cxe = running_cxe / dataset_sizes[phase]
            epoch_hinge = running_hinge / dataset_sizes[phase]
            epoch_acc = running_corrects / dataset_sizes[phase]                      
            epoch_roc_auc = 0.0 
            if (phase == 'train'):
                ## Calculate 'false_positive_rate' and 'True_positive_rate' of train
    
                nn_fprs_train, nn_tpr_train, nn_thresholds = roc_curve(targets.detach().cpu().numpy(), output_target.squeeze().detach().cpu().numpy())
                fprs_net_train.append(nn_fprs_train)
                tprs_net_train.append(nn_tpr_train)
                roc_auc_train = round(auc(nn_fprs_train, nn_tpr_train), 4)  
                epoch_roc_auc = roc_auc_train
            elif (phase == 'val'):
                ## Calculate 'false_positive_rate' and 'True_positive_rate' of valid
                nn_fpr_valid, nn_tpr_valid, thresholds = roc_curve(targets.detach().cpu().numpy(), output_target.squeeze().detach().cpu().numpy())
                fprs_net_valid.append(nn_fpr_valid)
                tprs_net_valid.append(nn_tpr_valid)
                roc_auc_valid = round(auc(nn_fpr_valid, nn_tpr_valid), 4)
                epoch_roc_auc = roc_auc_valid
            dt=time.time() - t0
            fmt='{:6s} ROC_AUC: {:.4f} Acc: {:.4f} CXE: {:.4f} Hinge: {:.4f}  DT={:.1f}'
            out_list=[phase, epoch_roc_auc, epoch_acc, epoch_cxe, epoch_hinge] + [dt]
            out_str=fmt.format(*out_list)
            if phase=='train':
                epoch_str='Epoch {}/{} '.format(epoch, num_epochs)
                out_str=epoch_str + out_str
            else:
                out_str = ' '*len(epoch_str) + out_str
            print(out_str)
            if (phase == 'val') and epoch == num_epochs-1:
                 plt.plot(nn_fprs_train, nn_tpr_train, color='blue') 
                 plt.plot(nn_fpr_valid, nn_tpr_valid, color='orange')
                 plt.xlim([0.0,1.0])
                 plt.ylim([0.0,1.0])
                 plt.xlabel('False Positive Rate')
                 plt.ylabel('True Positive Rate')
                 plt.title(f'ROC Curve Comparison')
                 plt.legend([f'TrainRocAuc (AUC = {roc_auc_train})', f'TestRocAuc (AUC = {roc_auc_valid})'])
                 plt.show()
            convergence.loc[len(convergence)] = [epoch+epoch_offset,phase,   
                        epoch_roc_auc, epoch_acc, epoch_cxe, epoch_hinge]
            
            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(nn_model.state_dict())
 
    time_elapsed = time.time() - start
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))
    # load best model weights
    nn_model.load_state_dict(best_model_wts)   
    
# Run all Code cells first
optimizer_cxe = optim.Adam(nn_model.parameters(), lr=0.0001)
optimizer_hinge = torch.optim.SGD(nn_model.parameters(), lr=learning_rate,momentum = 0.5,weight_decay = 0.1)
nn_model = nn_model.cuda()
scheduler_cxe = lr_scheduler.StepLR(optimizer_cxe, step_size=10, gamma=0.1)  
scheduler_hinge= lr_scheduler.StepLR(optimizer_hinge, step_size=10, gamma=0.1)
criteron = SVMLoss()
train(optimizer_cxe,optimizer_hinge,criteron,scheduler_cxe,scheduler_hinge,num_epochs=num_epochs, w_cel=0.000000001)
t0=time.time()
date_time = datetime.now().strftime("--%Y-%m-%d-%H-%M-%S-%f")
pickle.dump(nn_model,open(DATA_DIR + '/' + CASE_NAME + date_time + '.p','wb'))
print('Pickled in {:.2f} sec'.format(time.time()-t0))
convergence.head(5)
from scipy.stats import iqr
def plot_convergence(figsize=(22,12)):
  conv = {phase : convergence[convergence.phase==phase] 
          for phase in ['train','val']}
  fig,axes = plt.subplots(2,2,figsize=figsize)
  cols = {'train' : 'tab:blue', 'val' : 'tab:orange'}
  # Loss
  ax=axes[0,0]
  for phase in ['train','val']:
    ax.plot(conv[phase].epoch,conv[phase].Hinge,label=phase,c=cols[phase])
  ax.set_xlabel('Epoch') 
  ax.set_ylabel('Hinge Loss')
  ax.legend()
  ax.grid()
  # CXE
  ax=axes[0,1]
  for phase in ['train','val']:
    ax.plot(conv[phase].epoch,conv[phase].CXE,label='CXE/'+phase,c=cols[phase])
  ax.set_xlabel('Epoch') 
  ax.set_ylabel('CXE')
  ax.legend()
  ax.grid()
  # Accuracy
  ax=axes[1,0]
  for phase in ['train','val']:
    ax.plot(conv[phase].epoch,conv[phase].accuracy,label='Acc/'+phase,c=cols[phase])
  ax.set_xlabel('Epoch') 
  ax.set_ylabel('Accuracy')
  ax.legend()
  ax.grid()
  # Plot ROC_AUC Curve of train and test
  ax=axes[1,1]
  for i in range(num_epochs):
      plt.plot(fprs_net_train[i],tprs_net_train[i], color='blue')
      plt.plot(fprs_net_valid[i],tprs_net_valid[i], color='orange')
  ax.set_xlim([0.0,1.0])
  ax.set_ylim([0.0,1.0])
  ax.set_xlabel('False Positive Rate')
  ax.set_ylabel('True Positive Rate')
  ax.set_title(f'ROC Curve Comparison')
  ax.legend([f'TrainRocAuc (AUC = {roc_auc_train})', f'TestRocAuc (AUC = {roc_auc_valid})'])
  ax.grid()
plot_convergence(figsize=(22,12))
  nn_model.eval()
  test_class_scores = nn_model(full_X_kaggle_gpu)
  print(test_class_scores[0:10])
  #For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. 
  fs_type = "Multilayer_nn1"
  submit_df = datasets["application_test"][['SK_ID_CURR']]
  submit_df['TARGET'] = test_class_scores.detach().cpu().numpy()
  print(submit_df.head(2))
  submit_df.to_csv(f'/content/drive/My Drive/AML Project/Data/Phase3/submission_{fs_type}.csv',index=False)
Below is the resulting table for the various experiments we performed on the given dataset. Please refer section Final results for traditional models.
Single Layer Neural Network
Multi Layer Neural Network
As you can see in the Experimental Results Final results we have performed various feature selection technique like (RFE, PCA, Variance Threshold, SelectKBest, SMOTE) on certain specific models with 132 highly correlated features. Below is brief description of results attained in these experiments.
Our best model turned out to be Logistic Regression with SelectKBest with 74.86% ROC score. Our hopes were higher on XGBoost classifier but it stood out to be second best in our models.
Our Deep Learning of simple network preformed model better than the multilayer network. The ROC score came as 74.6% for the simple network. For multilayer network our score came as 59.38%.
Compared to traditional machine learning model, deep learning model trained on completed dataset much faster.
Below has more details or the various classifiers executed in this project.
Logistic Regression : This model was chosen as the baseline model trained with imbalanced dataset and later performed feature selection using RFE, SelectKBest, PCA & Variance Threshold technique on it. The baseline training accuracy for this model was encouraging which let us to perform the prior mentioned feature selection on these models. The best model for logistic regression we had is with Variance Threshold, with training accuracy as 92.56% and test accuracy as 92.2%. A 75.22% ROC score resulted with best parameters for this model. The same model was run with other feature selection performed very closer to the best model.
Gradient Boosting : Boosting didn't help in achieving better results than the baseline model. The results were not good enough to continue in implementing & evaluating other feature selection technique. Training accuracy of 94.75% and test accuracy of 91.95% was achieved in this model. Test ROC under the curve for this model came out to 72.12%
XGBoost : By far this model resulted in the second best model with RFE hence we continued to explore other feature selection techniques on this. The best performing model for XGBoost was with Variance Threshold. The accuracy of the training and test are 93.1% and test 92.36%. Test ROC under the curve is 73.88%. The other feature selection were very closer to the best XGBoost model. We also performed XGBoost with SMOTE as the dataset had oversampled records. The ROC score has promising result with 74.23%.
Light BGM : Our expectation was this model would give us better and faster results than XGBoost, however it was slightly lower compared to XGBoost. Both RFE and variance threshold feature selection resulted in same ROC score of 72.2. The training accuracy came as 92.81% and test accuracy 92.28% was achieved.
Random Forest : On our last decision tree models, the best Random Forest was with variance threshold which produced training accuracy of 92.51% and test accuracy of 92.36%. Test ROC score came out as 72.43%. Random forest performed better compared to LightBGM but lower than XGBoost.
SVM : This was the lowest performing model in our experiment. Hence we didn't decide to continue on SVM with other feature selection techniques. The ROC score achieved for this model was way lower i.e. is 67.21%.
In the final phase, after proving our hypothesis that tuned machine learning techniques can outperform baseline models to aid Home Credit in their evaluation of loan applications, we believe expanding our framework will create a more robust environment with improved performance.
Logistic regression, XGBoost, Random Forest and LightGBM were selected to run with RFE, PCA, SelectKBest and Variance Threshold for feature selection, and SMOTE for data imbalance. The best performance for each algorithm was included in the classification ensemble using soft voting. The resulting Kaggle score was 0.72592 ROC_AUC.
Single and Multi-layer deep learning models, including linear, sigmoid, ReLu, and hidden layers were added with binary CXE, custom hinge loss using adam & sgd optimizer. The deep learning Kaggle score fell short of the ensemble model; additional experimentation will result in a better performing deep learning models. By combining and continuing to refine our extended loss function, we can further demonstrate our effectiveness.
Phase - 1 : Kaggle Submission
Phase - 2 : Kaggle Submission
For phase-2, we did multiple submission in Kaggle with different feature setting. Below is the details.
Phase - 3 : Kaggle Submission Below submission were done for Feature Selection with RFE, PCA, Variance Threshold & XGBOOST SMOTE with early stoping..
For Deep Learning, we submitted below Kaggle submission.
Our best Kaggle score was based on Voting Classifier with SelectKBest feature selection.
Some of the material in this notebook has been adopted from following