銀行案例學(xué)習(xí)實例4_IV and WOE
python金融風(fēng)控評分卡模型和數(shù)據(jù)分析微專業(yè)課:http://dwz.date/b9vv

http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/
This is a continuation of our banking case study for scorecards development. In this part, we will discuss information value (IV) and weight of evidence. These concepts are useful for variable selection while developing credit scorecards. We will also learn how to use ?weight of evidence (WOE) in logistic regression modeling. The following are the links where you can find the previous three parts?(Part 1),?(Part 2)?&?(Part 3).
這是我們針對計分卡開發(fā)的銀行業(yè)案例研究的延續(xù)。 在這一部分中,我們將討論信息價值(IV)和證據(jù)權(quán)重。 這些概念對于開發(fā)信用計分卡時的變量選擇很有用。 我們還將學(xué)習(xí)如何在邏輯回歸建模中使用證據(jù)權(quán)重(WOE)。 以下是可以在其中找到前三個部分(第1部分),(第2部分)和(第3部分)的鏈接。
Experts in Expensive Suits昂貴西裝專家

A couple of weeks ago I was watching this show called ‘Brain Games’ on the National Geographic Channel. In one of the segments, they had a comedian dressed up as a television news reporter. He had a whole television camera crew along with him. He was informing the people coming out of a mall in California that Texas has decided to form an independent country, not part of the United States. Additionally, while on camera he was asking for their opinion on the matter. After the initial amusement, people took him seriously and started giving their serious viewpoints. This is the phenomenon psychologists describe as ‘expert fallacy’ or obeying authority, no matter how irrational the authorities seem. Later after learning the truth, the people on this show agreed that they believed this comedian because he was in an expensive suit with a TV crew.
Nate Silver in his book The Signal and The Noise described a similar phenomenon. He analyzed the forecasts made by the panel of experts on the TV program The McLaughlin Group. The forecasts turned out to be true only in 50% cases; you could have forecasted the same by tossing a coin. We do take experts in expensive suits seriously, don’t we? These?are not few-off examples. Men in suits or uniforms come in all different forms – from army generals to security personnel?in malls. We take them all very seriously.
We have just discovered that rather than accept an expert’s opinion, it would be better to look at the value of the information and make decisions oneself. Let us continue with the theme and try to explore how to assign the value to information using information value and weight of evidence. Then we will create a simple logistic regression model using WOE (weight of evidence). However, before that let us recapture the case study we are working on.
幾個星期前,我在國家地理頻道觀看這個名為“腦游戲”的節(jié)目。在其中一個片段中,他們有一個扮成電視新聞記者的喜劇演員。他和他一起有一整個電視攝制組。他告訴從加利福尼亞州的一個商場出來的人們,德克薩斯州決定組建一個獨立的國家,而不是美國的一部分。此外,他在鏡頭前詢問他們對此事的看法。在最初的娛樂之后,人們認真地對待他并開始給予他們認真的觀點。這是心理學(xué)家所描述的“專家謬誤”或服從權(quán)威的現(xiàn)象,無論當局看起來多么不合理。在得知真相之后,這個節(jié)目的人們同意他們相信這個喜劇演員,因為他是一個昂貴的電視工作人員。
Nate Silver在他的著作“信號與噪音”中描述了類似的現(xiàn)象。他分析了電視節(jié)目The McLaughlin Group的專家小組所做的預(yù)測。僅在50%的情況下,預(yù)測結(jié)果是正確的;你可以通過擲硬幣來預(yù)測同樣的事情。我們認真對待昂貴西裝的專家,不是嗎?這些都不是很少的例子。穿西裝或制服的男子有各種形式 - 從軍隊將軍到商場的保安人員。我們非常重視他們。
我們剛剛發(fā)現(xiàn),不要接受專家的意見,最好是查看信息的價值并自己做出決定。讓我們繼續(xù)討論主題,并嘗試探索如何使用信息值和證據(jù)權(quán)重為信息賦值。然后我們將使用WOE(證據(jù)權(quán)重)創(chuàng)建一個簡單的邏輯回歸模型。但是,在此之前讓我們重新審視我們正在研究的案例研究。
Case Study Continues ..
This is a continuation of our case study on CyndiCat bank. The bank had disbursed 60816 auto loans with around 2.5% of the bad rate in the quarter between April–June 2012. We did some exploratory data analysis (EDA) using tools of data visualization in the first two parts?(Part 1)?&?(Part 2). In the previous article, we have developed a simple logistic regression model with just age as the variable?(Part 3). This time, we will continue from where we left in the previous article and use weight of evidence (WOE) for age to develop a new model. Additionally, we will also explore the predictive power of the variable (age) through information value.
信息價值是模型構(gòu)建過程中變量選擇的一個非常有用的概念。 我認為,信息價值的根源在于克勞德·香農(nóng)提出的信息理論。 我相信的原因是相似性信息值與信息論中廣泛使用的熵概念有關(guān)。 Chi Square值是一種廣泛使用的統(tǒng)計量度量,是IV(信息值)的良好替代品。 然而,IV是業(yè)內(nèi)流行且廣泛使用的措施。 這樣做的原因是與IV相關(guān)的變量選擇的一些非常方便的經(jīng)驗法則 - 這些非常方便,您將在本文后面發(fā)現(xiàn)。 信息值的公式如下所示。
Information Value (IV)?and Weight of Evidence (WOE)
Information value is a very useful concept for variable selection during model building. The roots of information value, I think, are in information theory proposed by Claude Shannon. The reason for my belief is the similarity information value has with a widely used concept of entropy in?information theory. Chi Square value, an extensively used measure in statistics, is a good replacement for IV (information value). However, IV is a popular and widely used measure in the industry. The reason for this is some very convenient rules of thumb for variables selection associated with IV – these are really?handy as you will discover later in this article. The formula for information value is shown below.
信息價值是模型構(gòu)建過程中變量選擇的一個非常有用的概念。 我認為,信息價值的根源在于克勞德·香農(nóng)提出的信息理論。 我相信的原因是相似性信息值與信息論中廣泛使用的熵概念有關(guān)。 Chi Square值是一種廣泛使用的統(tǒng)計量度量,是IV(信息值)的良好替代品。 然而,IV是業(yè)內(nèi)流行且廣泛使用的措施。 這樣做的原因是與IV相關(guān)的變量選擇的一些非常方便的經(jīng)驗法則 - 這些非常方便,您將在本文后面發(fā)現(xiàn)。 信息值的公式如下所示。

What distribution good/bad mean will soon be clear when we will calculate IV for our case study. This is probably an opportune moment to define Weight of Evidence (WOE), which is the log component in information value.

Hence, IV can further be written as the following.

If you examine both information value and weight of evidence carefully then you will notice that both these values will break down when either the distribution good or bad goes to zero. A mathematician will hate it. The assumption, a fair one, is that this will never happen while a scorecard development because of the reasonable sample size. A word of caution, if you are developing non-standardized scorecards with smaller sample size use IV carefully.
如果仔細檢查信息的價值和證據(jù)的重量,那么你會注意到,當分布好壞都歸零時,這兩個值都會崩潰。 數(shù)學(xué)家會討厭它。 假設(shè)是合理的,因為合理的樣本量,在記分卡開發(fā)時這種情況永遠不會發(fā)生。 需要注意的是,如果您正在開發(fā)樣本量較小的非標準化記分卡,請謹慎使用IV。
Back to the Case Study
In the previous article, we have created coarse classes for the variable age in our case study. Now, let us calculate both information value and weight of evidence for these coarse classes.在上一篇文章中,我們在案例研究中為可變年齡創(chuàng)建了粗糙的類。 現(xiàn)在,讓我們計算這些粗略分類的信息價值和證據(jù)權(quán)重。

Let us examine this table. Here, distribution of loans is the ratio of loans for a coarse class to total loans. For the group 21-30, this is 4821/60801 = 0.079. Similarly, distribution bad (DB) = 206/1522 = .135 and distribution good = 4615/59279 (DG) = 0.078. Additionally, DG-DB = 0.078 – 0.135 = – 0.057. Further, WOE = ln(0.078/0.135) = -0.553.
讓我們檢查一下這張表。 在這里,貸款分配是粗略貸款與總貸款之比。 對于21-30組,這是4821/60801 = 0.079。 同樣,分布不良(DB)= 206/1522 = .135,分布良好= 4615/59279(DG)= 0.078。 此外,DG-DB = 0.078 – 0.135 = – 0.057。 此外,WOE = ln(0.078 / 0.135)=-0.553。

Download the attached Excel to understand this?calculation :?Information Value (IV)?and Weight of Evidence (WOE)
下載隨附的Excel以了解此計算:信息值(IV)和證據(jù)權(quán)重(WOE)

Finally, component of IV for this group is (-0.057)*(-0.553) = 0.0318.?Similarly, calculate the IV components for all the other coarse classes. Adding these components will produce the IV value of 0.1093 (last column of the table). Now the question is how to interpret this value of IV? ?The answer is the rule of thumb described below.

信息價值預(yù)測能力
<0.02無法用于預(yù)測
0.02到0.1弱預(yù)測值
0.1到0.3中等預(yù)測值
0.3到0.5強預(yù)測器
? > 0.5可疑或太好不可能
Typically, variables with medium and strong predictive powers are selected for model development. ?However, some school of thoughts would advocate just the variables with medium IVs for a broad-based model development. Notice, the information value for age is 0.1093 hence it is barely falling in the medium predictors’ range.
通常,選擇具有中等和強預(yù)測能力的變量用于模型開發(fā)。 然而,一些學(xué)派只會提倡具有中等IV的變量來進行基礎(chǔ)廣泛的模型開發(fā)。 請注意,年齡的信息值為0.1093,因此在中期預(yù)測器的范圍內(nèi)幾乎沒有下降。
Logistic Regression with Weight of Evidence (WOE)
Finally, let us create a logistic regression model with weight of evidence of the coarse classes as the value for the independent variable age. The following are the results generated through a statistical software.
最后,讓我們創(chuàng)建一個邏輯回歸模型,其中粗類的證據(jù)權(quán)重作為自變量年齡的值。 以下是通過統(tǒng)計軟件生成的結(jié)果。

If we estimate the value of bad rate for the age group 21-30 using the above information.

This is precisely the value we have obtained the last time?(See the previous part)?and is consistent with the bad rate for the group.
Sign-off note
I wish there was an instrument similar to information value available with us to estimate the value of information coming from so called experts. However, next time when an expert on a business channel gives you the advice to buy a certain stock, take that advice with a pinch of salt.
我希望有一種類似于信息價值的工具可用于估算來自所謂專家的信息的價值。 但是,下次商業(yè)渠道專家為您提供購買某種庫存的建議時,請盡量不予理睬。
Read the remaining part of credit scoring series
Part 1:?Data visualization for scoring
Part 2:?Creating ratio variables for better scoring
Part 3:?Logistic regression
Part 5:?Reject inference
Part 6:?Population stability index for scorecard monitoring
References1. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring – Naeem Siddiqi 2. Credit Scoring for Risk Managers: The Handbook for Lenders – Elizabeth Mays and Niall Lynas
up主微信公眾號pythonEducation
博主網(wǎng)校主頁 :http://dwz.date/bwes
