python德國(guó)信用評(píng)分卡建模(附代碼AAA推薦)

python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv

Minimization of risk and maximization of profit on behalf of the bank.
To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application.
The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link to the German Credit data (right-click and "save as"?).? A predictive model developed on this data is expected to provide a bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
代表銀行將風(fēng)險(xiǎn)最小化并將利潤(rùn)最大化。
為了從銀行的角度將損失降到最低,銀行需要制定決策規(guī)則,確定誰(shuí)批準(zhǔn)貸款,誰(shuí)不批準(zhǔn)。 在決定貸款申請(qǐng)之前,貸款經(jīng)理會(huì)考慮申請(qǐng)人的人口統(tǒng)計(jì)和社會(huì)經(jīng)濟(jì)概況。
德國(guó)信用數(shù)據(jù)包含有關(guān)20個(gè)變量的數(shù)據(jù),以及1000個(gè)貸款申請(qǐng)者被視為好或壞信用風(fēng)險(xiǎn)的分類。 這是指向德國(guó)信用數(shù)據(jù)的鏈接(右鍵單擊并另存為)。 預(yù)期基于此數(shù)據(jù)開(kāi)發(fā)的預(yù)測(cè)模型將為銀行經(jīng)理提供指導(dǎo),以根據(jù)他/她的個(gè)人資料來(lái)決定是否批準(zhǔn)準(zhǔn)申請(qǐng)人的貸款。
信用評(píng)分系統(tǒng)應(yīng)用
http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

account balance 賬戶余額
duration of credit持卡時(shí)長(zhǎng)

Data Set Information:
Two datasets are provided. the original dataset, in the form provided by Prof. Hofmann, contains categorical/symbolic attributes and is in the file "german.data".?
For algorithms that need numerical attributes, Strathclyde University produced the file "german.data-numeric". This file has been edited and several indicator variables added to make it suitable for algorithms which cannot cope with categorical variables. Several attributes that are ordered categorical (such as attribute 17) have been coded as integer. This was the form used by StatLog.?
數(shù)據(jù)集信息:
提供了兩個(gè)數(shù)據(jù)集。 原始數(shù)據(jù)集以Hofmann教授的形式提供,包含分類/符號(hào)屬性,并位于文件“ german.data”中。
對(duì)于需要數(shù)字屬性的算法,斯特拉斯克萊德大學(xué)產(chǎn)生了文件“ german.data-numeric”。 該文件已經(jīng)過(guò)編輯,并添加了一些指標(biāo)變量,以使其適用于無(wú)法處理分類變量的算法。 幾個(gè)按類別排序的屬性(例如屬性17)已編碼為整數(shù)。 這是StatLog使用的形式。
This dataset requires use of a cost matrix (see below)?
該數(shù)據(jù)集需要使用成本矩陣(請(qǐng)參見(jiàn)下文)
..... 1 2?
----------------------------?
1 0 1?
-----------------------?
2 5 0?
(1 = Good, 2 = Bad)?
The rows represent the actual classification and the columns the predicted classification.?
It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).?
行代表實(shí)際分類,列代表預(yù)測(cè)分類。
不好的時(shí)候?qū)⒖蛻舴诸悶楹茫?),而不是好的時(shí)將客戶分類為壞(1)。
Attribute Information:
Attribute 1: (qualitative)?
Status of existing checking account?
A11 : ... < 0 DM?
A12 : 0 <= ... < 200 DM?
A13 : ... >= 200 DM / salary assignments for at least 1 year?
A14 : no checking account?
Attribute 2: (numerical)?
Duration in month?
Attribute 3: (qualitative)?
Credit history?
A30 : no credits taken/ all credits paid back duly?
A31 : all credits at this bank paid back duly?
A32 : existing credits paid back duly till now?
A33 : delay in paying off in the past?
A34 : critical account/ other credits existing (not at this bank)?
Attribute 4: (qualitative)?
Purpose?
A40 : car (new)?
A41 : car (used)?
A42 : furniture/equipment?
A43 : radio/television?
A44 : domestic appliances?
A45 : repairs?
A46 : education?
A47 : (vacation - does not exist?)?
A48 : retraining?
A49 : business?
A410 : others?
Attribute 5: (numerical)?
Credit amount?
Attibute 6: (qualitative)?
Savings account/bonds?
A61 : ... < 100 DM?
A62 : 100 <= ... < 500 DM?
A63 : 500 <= ... < 1000 DM?
A64 : .. >= 1000 DM?
A65 : unknown/ no savings account?
Attribute 7: (qualitative)?
Present employment since?
A71 : unemployed?
A72 : ... < 1 year?
A73 : 1 <= ... < 4 years?
A74 : 4 <= ... < 7 years?
A75 : .. >= 7 years?
Attribute 8: (numerical)?
Installment rate in percentage of disposable income?
Attribute 9: (qualitative)?
Personal status and sex?
A91 : male : divorced/separated?
A92 : female : divorced/separated/married?
A93 : male : single?
A94 : male : married/widowed?
A95 : female : single?
Attribute 10: (qualitative)?
Other debtors / guarantors?
A101 : none?
A102 : co-applicant?
A103 : guarantor?
Attribute 11: (numerical)?
Present residence since?
Attribute 12: (qualitative)?
Property?
A121 : real estate?
A122 : if not A121 : building society savings agreement/ life insurance?
A123 : if not A121/A122 : car or other, not in attribute 6?
A124 : unknown / no property?
Attribute 13: (numerical)?
Age in years?
Attribute 14: (qualitative)?
Other installment plans?
A141 : bank?
A142 : stores?
A143 : none?
Attribute 15: (qualitative)?
Housing?
A151 : rent?
A152 : own?
A153 : for free?
Attribute 16: (numerical)?
Number of existing credits at this bank?
Attribute 17: (qualitative)?
Job?
A171 : unemployed/ unskilled - non-resident?
A172 : unskilled - resident?
A173 : skilled employee / official?
A174 : management/ self-employed/?
highly qualified employee/ officer?
Attribute 18: (numerical)?
Number of people being liable to provide maintenance for?
Attribute 19: (qualitative)?
Telephone?
A191 : none?
A192 : yes, registered under the customers name?
Attribute 20: (qualitative)?
foreign worker?
A201 : yes?
A202 : no?
?
It is worse to class a customer as good when they are bad (5),
than it is to class a customer as bad when they are good (1).
當(dāng)顧客不好時(shí),將顧客歸類為好(5),
而不是將顧客分為好(1)。


randomForest.py
random forest with 1000 trees:
accuracy on the training subset:1.000
accuracy on the test subset:0.772
準(zhǔn)確性高于決策樹(shù)



比較之前

自己繪制樹(shù)圖

準(zhǔn)確率不高,且嚴(yán)重過(guò)度擬合
accuracy on the training subset:0.991
accuracy on the test subset:0.680
# -*- coding: utf-8 -*-
"""
博主python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv
博主微信公眾號(hào):pythonEducation @author: 231469242@qq.com
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np
import pydotplus
from IPython.display import Image
import graphviz
from sklearn.tree import export_graphviz
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
?
trees=1000
#讀取文件
readFileName="German_credit.xlsx"
#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)
#調(diào)參
list_average_accuracy=[]
depth=range(1,30)
for i in depth:
? ? #max_depth=4限制決策樹(shù)深度可以降低算法復(fù)雜度,獲取更精確值
? ? tree= DecisionTreeClassifier(max_depth=i,random_state=0)
? ? tree.fit(x_train,y_train)
? ? accuracy_training=tree.score(x_train,y_train)
? ? accuracy_test=tree.score(x_test,y_test)
? ? average_accuracy=(accuracy_training+accuracy_test)/2.0
? ? #print("average_accuracy:",average_accuracy)
? ? list_average_accuracy.append(average_accuracy)
? ? ?
max_value=max(list_average_accuracy)
#索引是0開(kāi)頭,結(jié)果要加1
best_depth=list_average_accuracy.index(max_value)+1
print("best_depth:",best_depth)
best_tree= DecisionTreeClassifier(max_depth=best_depth,random_state=0)
best_tree.fit(x_train,y_train)
accuracy_training=best_tree.score(x_train,y_train)
accuracy_test=best_tree.score(x_test,y_test)
print("decision tree:")? ?
print("accuracy on the training subset:{:.3f}".format(best_tree.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(best_tree.score(x_test,y_test)))
?
n_features=x.shape[1]
plt.barh(range(n_features),best_tree.feature_importances_,align='center')
plt.yticks(np.arange(n_features),names)
plt.title("Decision Tree:")
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.show()
?
#生成一個(gè)dot文件,以后用cmd形式生成圖片
export_graphviz(best_tree,out_file="creditTree.dot",class_names=['bad','good'],feature_names=names,impurity=False,filled=True)
'''
best_depth: 12
decision tree:
accuracy on the training subset:0.991
accuracy on the test subset:0.680
'''
支持向量最高預(yù)測(cè)率
accuracy on the scaled training subset:0.867
accuracy on the scaled test subset:0.800效果高于隨機(jī)森林0.8-0.772=0.028
# -*- coding: utf-8 -*-
"""
Created on Fri Mar 30 21:57:29 2018
博主微信公眾號(hào):pythonEducation
@author: 231469242@qq.com
SVM需要標(biāo)準(zhǔn)化數(shù)據(jù)處理<br>博主python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv
"""
#標(biāo)準(zhǔn)化數(shù)據(jù)
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import pandas as pd
?
#讀取文件
readFileName="German_credit.xlsx"
#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns
#random_state 相當(dāng)于隨機(jī)數(shù)種子
X_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=42)
svm=SVC()
svm.fit(X_train,y_train)
print("accuracy on the training subset:{:.3f}".format(svm.score(X_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(svm.score(x_test,y_test)))
'''
accuracy on the training subset:1.000
accuracy on the test subset:0.700
?
'''
#觀察數(shù)據(jù)是否標(biāo)準(zhǔn)化
plt.plot(X_train.min(axis=0),'o',label='Min')
plt.plot(X_train.max(axis=0),'v',label='Max')
plt.xlabel('Feature Index')
plt.ylabel('Feature magnitude in log scale')
plt.yscale('log')
plt.legend(loc='upper right')
?
#標(biāo)準(zhǔn)化數(shù)據(jù)
X_train_scaled = preprocessing.scale(X_train)
x_test_scaled = preprocessing.scale(x_test)
svm1=SVC()
svm1.fit(X_train_scaled,y_train)
print("accuracy on the scaled training subset:{:.3f}".format(svm1.score(X_train_scaled,y_train)))
print("accuracy on the scaled test subset:{:.3f}".format(svm1.score(x_test_scaled,y_test)))
'''
accuracy on the scaled training subset:0.867
accuracy on the scaled test subset:0.800
'''
#改變C參數(shù),調(diào)優(yōu),kernel表示核函數(shù),用于平面轉(zhuǎn)換,probability表示是否需要計(jì)算概率
svm2=SVC(C=10,gamma="auto",kernel='rbf',probability=True)
svm2.fit(X_train_scaled,y_train)
print("after c parameter=10,accuracy on the scaled training subset:{:.3f}".format(svm2.score(X_train_scaled,y_train)))
print("after c parameter=10,accuracy on the scaled test subset:{:.3f}".format(svm2.score(x_test_scaled,y_test)))
'''
after c parameter=10,accuracy on the scaled training subset:0.972
after c parameter=10,accuracy on the scaled test subset:0.716
'''
#計(jì)算樣本點(diǎn)到分割超平面的函數(shù)距離
#print (svm2.decision_function(X_train_scaled))
#print (svm2.decision_function(X_train_scaled)[:20]>0)
#支持向量機(jī)分類
#print(svm2.classes_)
#malignant和bening概率計(jì)算,輸出結(jié)果包括惡性概率和良性概率
#print(svm2.predict_proba(x_test_scaled))
#判斷數(shù)據(jù)屬于哪一類,0或1表示
#print(svm2.predict(x_test_scaled))
神經(jīng)網(wǎng)絡(luò)
效果不如支持向量和隨機(jī)森林
最好概率
accuracy on the training subset:
0.916
accuracy on the test subset:
0.720

# -*- coding: utf-8 -*-
"""
Created on Sun Apr? 1 11:49:50 2018
博主微信公眾號(hào):pythonEducation
@author: 231469242@qq.com
神經(jīng)網(wǎng)絡(luò)需要預(yù)處理數(shù)據(jù)
博主python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv
"""
#Multi-layer Perceptron 多層感知機(jī)
from sklearn.neural_network import MLPClassifier
#標(biāo)準(zhǔn)化數(shù)據(jù),否則神經(jīng)網(wǎng)絡(luò)結(jié)果不準(zhǔn)確,和SVM類似
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import mglearn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
?
#讀取文件
readFileName="German_credit.xlsx"
#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns
#random_state 相當(dāng)于隨機(jī)數(shù)種子
x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=42)
mlp=MLPClassifier(random_state=42)
mlp.fit(x_train,y_train)
print("neural network:")? ?
print("accuracy on the training subset:{:.3f}".format(mlp.score(x_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp.score(x_test,y_test)))
scaler=StandardScaler()
x_train_scaled=scaler.fit(x_train).transform(x_train)
x_test_scaled=scaler.fit(x_test).transform(x_test)
mlp_scaled=MLPClassifier(max_iter=1000,random_state=42)
mlp_scaled.fit(x_train_scaled,y_train)
print("neural network after scaled:")? ?
print("accuracy on the training subset:{:.3f}".format(mlp_scaled.score(x_train_scaled,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp_scaled.score(x_test_scaled,y_test)))
mlp_scaled2=MLPClassifier(max_iter=1000,alpha=1,random_state=42)
mlp_scaled2.fit(x_train_scaled,y_train)
print("neural network after scaled and alpha change to 1:")? ?
print("accuracy on the training subset:{:.3f}".format(mlp_scaled2.score(x_train_scaled,y_train)))
print("accuracy on the test subset:{:.3f}".format(mlp_scaled2.score(x_test_scaled,y_test)))
?
?
#繪制顏色圖,熱圖
plt.figure(figsize=(20,5))
plt.imshow(mlp_scaled.coefs_[0],interpolation="None",cmap="GnBu")
plt.yticks(range(30),names)
plt.xlabel("columns in weight matrix")
plt.ylabel("input feature")
plt.colorbar()
?
'''
neural network:
accuracy on the training subset:0.700
accuracy on the test subset:0.700
neural network after scaled:
accuracy on the training subset:1.000
accuracy on the test subset:0.704
neural network after scaled and alpha change to 1:
accuracy on the training subset:0.916
accuracy on the test subset:0.720
'''
xgboost
區(qū)分能力還可以
AUC: 0.8134
ACC: 0.7720
Recall: 0.9521
F1-score: 0.8480
Precesion: 0.7644
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 24 22:42:47 2018
博主python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv
博主微信公眾號(hào):pythonEducation
@author: 231469242@qq.com
出現(xiàn)module 'xgboost' has no attribute 'DMatrix'的臨時(shí)解決方法
初學(xué)者或者說(shuō)不太了解Python才會(huì)犯這種錯(cuò)誤,其實(shí)只需要注意一點(diǎn)!不要使用任何模塊名作為文件名,任何類型的文件都不可以!我的錯(cuò)誤根源是在文件夾中使用xgboost.*的文件名,當(dāng)import xgboost時(shí)會(huì)首先在當(dāng)前文件中查找,才會(huì)出現(xiàn)這樣的問(wèn)題。
? ? ? ? 所以,再次強(qiáng)調(diào):不要用任何的模塊名作為文件名!
"""
import xgboost as xgb
from sklearn.cross_validation import train_test_split
import pandas as pd
import matplotlib.pylab as plt
#讀取文件
readFileName="German_credit.xlsx"
#讀取excel
df=pd.read_excel(readFileName)
list_columns=list(df.columns[:-1])
x=df.ix[:,:-1]
y=df.ix[:,-1]
names=x.columns
train_x, test_x, train_y, test_y=train_test_split(x,y,random_state=0)
dtrain=xgb.DMatrix(train_x,label=train_y)
dtest=xgb.DMatrix(test_x)
params={'booster':'gbtree',
? ? #'objective': 'reg:linear',
? ? 'objective': 'binary:logistic',
? ? 'eval_metric': 'auc',
? ? 'max_depth':4,
? ? 'lambda':10,
? ? 'subsample':0.75,
? ? 'colsample_bytree':0.75,
? ? 'min_child_weight':2,
? ? 'eta': 0.025,
? ? 'seed':0,
? ? 'nthread':8,
? ? ?'silent':1}
watchlist = [(dtrain,'train')]
bst=xgb.train(params,dtrain,num_boost_round=100,evals=watchlist)
ypred=bst.predict(dtest)
# 設(shè)置閾值, 輸出一些評(píng)價(jià)指標(biāo)
y_pred = (ypred >= 0.5)*1
#模型校驗(yàn)
from sklearn import metrics
print ('AUC: %.4f' % metrics.roc_auc_score(test_y,ypred))
print ('ACC: %.4f' % metrics.accuracy_score(test_y,y_pred))
print ('Recall: %.4f' % metrics.recall_score(test_y,y_pred))
print ('F1-score: %.4f' %metrics.f1_score(test_y,y_pred))
print ('Precesion: %.4f' %metrics.precision_score(test_y,y_pred))
metrics.confusion_matrix(test_y,y_pred)
print("xgboost:")?
#print("accuracy on the training subset:{:.3f}".format(bst.get_score(train_x,train_y)))
#print("accuracy on the test subset:{:.3f}".format(bst.get_score(test_x,test_y)))
print('Feature importances:{}'.format(bst.get_fscore()))
'''
AUC: 0.8135
ACC: 0.7640
Recall: 0.9641
F1-score: 0.8451
Precesion: 0.7523
#特征重要性和隨機(jī)森林差不多
Feature importances:{'Account Balance': 80, 'Duration of Credit (month)': 119,
?'Most valuable available asset': 54, 'Payment Status of Previous Credit': 84,
?'Value Savings/Stocks': 66, 'Age (years)': 94, 'Credit Amount': 149,
?'Type of apartment': 20, 'Instalment per cent': 37,
?'Length of current employment': 70, 'Sex & Marital Status': 29,
?'Purpose': 67, 'Occupation': 13, 'Duration in Current address': 25,
?'Telephone': 15, 'Concurrent Credits': 23, 'No of Credits at this Bank': 7,
?'Guarantors': 28, 'No of dependents': 6}
'''
最終結(jié)論:
xgboost 有時(shí)候特征重要性分析比隨機(jī)森林還準(zhǔn)確,可見(jiàn)其強(qiáng)大之處
隨機(jī)森林重要因子排序? ? xgboost權(quán)重指數(shù)
Credit amount信用保證金? 149
age 年齡? ? ? ? ? ? ? ? ? ? ? ? ? ? 94
account balance 賬戶余額 80
duration of credit持卡時(shí)間 119 (信用卡逾期時(shí)間,每個(gè)銀行有所不同,以招商銀行為例,兩個(gè)月就會(huì)被停卡)
?
2018-9-18數(shù)據(jù)更新
邏輯回歸驗(yàn)證數(shù)據(jù)和catboost驗(yàn)證數(shù)據(jù)差不多,可見(jiàn)邏輯回歸穩(wěn)定性
# -*- coding: utf-8 -*-
"""
博主python金融風(fēng)控評(píng)分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv
作者郵箱 231469242@qq.com
博主微信公眾號(hào):pythonEducation
技術(shù)文檔
https://www.cnblogs.com/webRobot/p/7216614.html
model accuracy is: 0.755
model precision is: 0.697841726618705
model sensitivity is: 0.3233333333333333
f1_score: 0.44191343963553525
AUC: 0.7626619047619048
?
根據(jù)iv值刪除后預(yù)測(cè)結(jié)果沒(méi)有變量完全保留的高
model accuracy is: 0.724
model precision is: 0.61320754717
model sensitivity is: 0.216666666667
f1_score: 0.320197044335
AUC: 0.7031
good classifier
?
帶入German_credit原始數(shù)據(jù)結(jié)果
accuracy on the training subset:0.777
accuracy on the test subset:0.740
A: 6.7807190511263755
B: 14.426950408889635
model accuracy is: 0.74
model precision is: 0.7037037037037037
model sensitivity is: 0.38
f1_score: 0.49350649350649356
AUC: 0.7885
"""
import math
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
import statsmodels.api as sm
#混淆矩陣計(jì)算
from sklearn import metrics
from sklearn.metrics import roc_curve, auc,roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
?
#df_german=pd.read_excel("german_woe.xlsx")
df_german=pd.read_excel("german_credit.xlsx")
#df_german=pd.read_excel("df_after_vif.xlsx")
y=df_german["target"]
x=df_german.ix[:,"Account Balance":"Foreign Worker"]
#x=df_german.ix[:,"Credit Amount":"Purpose"]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
#驗(yàn)證
print("accuracy on the training subset:{:.3f}".format(classifier.score(X_train,y_train)))
print("accuracy on the test subset:{:.3f}".format(classifier.score(X_test,y_test)))
#得分公式
'''
P0 = 50
PDO = 10
theta0 = 1.0/20
B = PDO/np.log(2)
A = P0 + B*np.log(theta0)
'''
def Score(probability):
? ? #底數(shù)是e
? ? score = A-B*np.log(probability/(1-probability))
? ? return score
#批量獲取得分
def List_score(pos_probablity_list):
? ? list_score=[]
? ? for probability in pos_probablity_list:
? ? ? ? score=Score(probability)
? ? ? ? list_score.append(score)
? ? return list_score
?
P0 = 50
PDO = 10
theta0 = 1.0/20
B = PDO/np.log(2)
A = P0 + B*np.log(theta0)
print("A:",A)
print("B:",B)
list_coef = list(classifier.coef_[0])
intercept= classifier.intercept_
?
#獲取所有x數(shù)據(jù)的預(yù)測(cè)概率,包括好客戶和壞客戶,0為好客戶,1為壞客戶
probablity_list=classifier.predict_proba(x)
#獲取所有x數(shù)據(jù)的壞客戶預(yù)測(cè)概率
pos_probablity_list=[i[1] for i in probablity_list]
#獲取所有客戶分?jǐn)?shù)
list_score=List_score(pos_probablity_list)
list_predict=classifier.predict(x)
df_result=pd.DataFrame({"label":y,"predict":list_predict,"pos_probablity":pos_probablity_list,"score":list_score})
df_result.to_excel("score_proba.xlsx")
#變量名列表
list_vNames=df_german.columns
#去掉第一個(gè)變量名target
list_vNames=list_vNames[1:]
df_coef=pd.DataFrame({"variable_names":list_vNames,"coef":list_coef})
df_coef.to_excel("coef.xlsx")
y_true=y_test
y_pred=classifier.predict(X_test)
accuracyScore = accuracy_score(y_true, y_pred)
print('model accuracy is:',accuracyScore)
#precision,TP/(TP+FP) (真陽(yáng)性)/(真陽(yáng)性+假陽(yáng)性)
precision=precision_score(y_true, y_pred)
print('model precision is:',precision)
#recall(sensitive)敏感度,(TP)/(TP+FN)
sensitivity=recall_score(y_true, y_pred)
print('model sensitivity is:',sensitivity)
??
#F1 = 2 x (精確率 x 召回率) / (精確率 + 召回率)
#F1 分?jǐn)?shù)會(huì)同時(shí)考慮精確率和召回率,以便計(jì)算新的分?jǐn)?shù)??蓪?F1 分?jǐn)?shù)理解為精確率和召回率的加權(quán)平均值,其中 F1 分?jǐn)?shù)的最佳值為 1、最差值為 0:
f1Score=f1_score(y_true, y_pred)
print("f1_score:",f1Score)
?
def AUC(y_true, y_scores):
? ? auc_value=0
? ? #auc第二種方法是通過(guò)fpr,tpr,通過(guò)auc(fpr,tpr)來(lái)計(jì)算AUC
? ? fpr, tpr, thresholds = metrics.roc_curve(y_true, y_scores, pos_label=1)
? ? auc_value= auc(fpr,tpr) ###計(jì)算auc的值
? ? #print("fpr:",fpr)
? ? #print("tpr:",tpr)
? ? #print("thresholds:",thresholds)
? ? if auc_value<0.5:
? ? ? ? auc_value=1-auc_value
? ? return auc_value
def Draw_roc(auc_value):
? ? fpr, tpr, thresholds = metrics.roc_curve(y, list_score, pos_label=0)
? ? #畫(huà)對(duì)角線
? ? plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Diagonal line')
? ? plt.plot(fpr,tpr,label='ROC curve (area = %0.2f)' % auc_value)
? ? plt.title('ROC curve')?
? ? plt.legend(loc="lower right")
#評(píng)價(jià)AUC表現(xiàn)
def AUC_performance(AUC):
? ? if AUC >=0.7:
? ? ? ? print("good classifier")
? ? if 0.7>AUC>0.6:
? ? ? ? print("not very good classifier")
? ? if 0.6>=AUC>0.5:
? ? ? ? print("useless classifier")
? ? if 0.5>=AUC:
? ? ? ? print("bad classifier,with sorting problems")
#Auc驗(yàn)證,數(shù)據(jù)采用測(cè)試集數(shù)據(jù)
auc_value=AUC(y, list_score)
print("AUC:",auc_value)
#評(píng)價(jià)AUC表現(xiàn)
AUC_performance(auc_value)
#繪制ROC曲線
Draw_roc(auc_value)
?博主網(wǎng)校主頁(yè) :http://dwz.date/bwes
