建炎以来系年要录:决策树

特征选择

  1. 信息熵: 信息熵 \[H(X)=-\sum_{x \in X} P(x) \log P(x) ) \] 条件熵 \[H(Y | X)=\sum_{i=1}^{n} p_{i} H\left(Y | X=x_{i}\right) \] 关于这里有空我要单独写篇文章!
  2. 信息增益 \[g(D, A)=H(D)-H(D | A) \] 其中,数据集 \(D\) 的经验熵 \[H(D)=-\sum_{k=1}^{K} \frac{\left|C_{k}\right|}{|D|} \log _{2} \frac{\left|C_{k}\right|}{|D|} \] 特征 \(A\) 对数据集 \(D\) 的经验条件熵 \[H(D | A)=\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} H\left(D_{i}\right)=-\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{D |} \sum_{k=1}^{K} \frac{\left|D_{i k}\right|}{\left|D_{i}\right|} \log _{2} \frac{\left|D_{i k}\right|}{\left|D_{i}\right|} \]
  3. 信息增益比 \[g_{R}(D, A)=\frac{g(D, A)}{H(D)} \]

决策树剪枝

\[H_{t}(T)=-\sum_{k} \frac{N_{t k}}{N_{t}} \log \frac{N_{t k}}{N_{t}} \]

\[C_{\alpha}(T)=\sum_{t=1}^{T I} N_{t} H_{t}(T)+\alpha|T| \]

CART

如何生成回归树

两个区域 \(R_{1}(j, s)=\left\{x | x^{(j)} \leqslant s\right\}\)\(R_{2}(j, s)=\left\{x | x^{(j)}>s\right\}\)

\[\min_{j,s} \left[\min _{c_{1}} \sum_{x_{i} \in R_{1}(j, s)}\left(y_{i}-c_{1}\right)^{2} + \min _{c_{2}} \sum_{x \in R_{R_{2}}(j, s)}\left(y_{i}-c_{2}\right)^{2} \right ] \]

\[\hat{c}_{m}=\frac{1}{N_{m}} \sum_{x_{i} \in R_{m}(j, s)} y_{i} \]

基尼指数

\[\operatorname{Gini}(p)=\sum_{k=1}^{K} p_{k}\left(1-p_{k}\right)=1-\sum_{k=1}^{K} p_{k}^{2} \] \[\operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right) \]

分类树的生成

递归

CART 剪枝

\(\alpha=\frac{C(t)-C\left(T_{t}\right)}{\left|T_{t}\right|-1}\)时,计算\(g(t)=\frac{C(t)-C\left(T_{t}\right)}{\left|T_{t}\right|-1}\)