[머신러닝] 결정트리와 랜덤포레스트를 이용한 분류 기법

[머신러닝] 결정트리와 랜덤포레스트를 이용한 분류 기법

2020. 5. 19. 09:06ㆍ노트/Python : 프로그래밍

참고 문헌 :

[파이썬 라이브러리를 활용한 머신러닝] p103~ p121

1. 결정트리 (DecisionTreeClassifier)

* 만들어진 모델을 쉽게 시각화할 수 있어 비전문가도 이해하기 쉬움

* 특성의 정규화나 표준화 같은 전처리 과정이 필요없음

* 과대적합이 되는 경향이 있음

유방암 분류 예제

import sklearn
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

cancer = sklearn.datasets.load_breast_cancer()
X_train, X_test , y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=42)
tree = DecisionTreeClassifier(random_state = 0)
tree.fit(X_train, y_train)
print("훈련 세트 정확도: {:.3f}".format(tree.score(X_train,y_train)))
print("테스트 세트 정확도: {:.3f}".format(tree.score(X_test,y_test)))

>>> 
훈련 세트 정확도: 1.000
테스트 세트 정확도: 0.930

cancer

# 일정 깊이에 도달하면 트리의 성장을 멈추기

# 일정 깊이에 도달하면 트리의 성장을 멈추게 하는 것 
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(X_train, y_train)

print("훈련 세트 정확도: {:.3f}".format(tree.score(X_train,y_train)))
print("테스트 세트 정확도: {:.3f}".format(tree.score(X_test,y_test)))
# 과대적합이 줄어듬 
# 훈련세트의 정확도를  떨어뜨리지만 테스트 세트의 성능은 개선시킴

>>> 
훈련 세트 정확도: 0.995
테스트 세트 정확도: 0.951

# 시각화

from sklearn.tree import export_graphviz
export_graphviz(tree, out_file ="tree.dot", class_names = ["악성","양성"], feature_names = cancer.feature_names , 
                impurity=False, filled=True)
                
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

import graphviz 

with open("tree.dot" ,encoding="UTF-8") as f:
    dot_graph = f.read()
display(graphviz.Source(dot_graph))

# 이미지로 저장 
graphviz.Source(dot_graph).render('tree', format="png")

# 트리의 특성 중요도 
# 0: 저각 특성에 대해 전혀 사용되지 않음 
# 1: 완벽하게 타깃 클래스를 예측함 
print("특성 중요도:\n", tree.feature_importances_)

>>> 
특성 중요도:
 [0.         0.         0.         0.         0.         0.
 0.         0.73943775 0.         0.         0.013032   0.
 0.         0.         0.         0.         0.         0.01737208
 0.00684355 0.         0.06019401 0.11783988 0.         0.03522339
 0.01005736 0.         0.         0.         0.         0.        ]

# 트리의 특성 중요도

import matplotlib.pyplot as plt
import numpy as np

%matplotlib notebook
def plot_feature_importances_cancer(model):
    plt.figure(figsize=(15,5))
    n_features = cancer.data.shape[1]
    plt.barh(np.arange(n_features), model.feature_importances_, align="center")
    plt.yticks(np.arange(n_features), cancer.feature_names)
    plt.xlabel("특성 중요도")
    plt.ylabel("특성")
    plt.ylim(-1, n_features)
    
    
plot_feature_importances_cancer(tree)

2. 랜덤포레스트 (RandomForest)

* 같은 결과를 만들어야 한다면 random_state 값을 고정해야함

* 결정트리의 과대적합 단점을 회피할 수 있음

유방암 분류 예제

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import mglearn

X, y = make_moons(n_samples= 100, noise = 0.25, random_state = 3)
X_train, X_test , y_train, y_test = train_test_split(X, y , stratify = y, random_state = 42)

forest = RandomForestClassifier(n_estimators = 5, random_state=42)
forest.fit(X_train, y_train)

>>> 
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=5,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

fig, axes = plt.subplots(2,3 , figsize = (20,10))
for i , (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):
    ax.set_title("트리 {}".format(i))
    mglearn.plots.plot_tree_partition(X,y,tree, ax=ax)
    
mglearn.plots.plot_2d_separator(forest, X, fill=True, ax=axes[-1,-1], alpha= .4)
axes[-1,-1].set_title("랜덤 포레스트")
mglearn.discrete_scatter(X[:,0], X[:,1], y)

X_train, X_test , y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
forest = RandomForestClassifier(n_estimators=100, random_state = 0)
forest.fit(X_train, y_train)

print("훈련 세트 정확도: {:.3f}".format(forest.score(X_train, y_train)))
print("테스트 세트 정확도: {:.3f}".format(forest.score(X_test, y_test)))

>>> 
훈련 세트 정확도: 1.000
테스트 세트 정확도: 0.972

트리의 특성 중요도

plot_feature_importances_cancer(forest)

'노트 > Python : 프로그래밍' 카테고리의 다른 글

[신경망] LSTM 모델을 이용한 리뷰 요약하기 (2)	2020.05.20
[머신러닝] 랜덤포레스트를 이용한 은행 마케팅 (deposit 예측) (0)	2020.05.19
[자연어처리] LSTM을 이용한 챗봇(chatbot) 만들기 (0)	2020.05.18
[자연어처리] 문장 생성하기 (text generation) (0)	2020.05.18
[자연어처리] word2vec 로 워드임베딩 하기 (0)	2020.05.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

다이엔 스페이스

다이엔 스페이스

태그

최근글

댓글

공지사항

아카이브

1. 결정트리 (DecisionTreeClassifier)

2. 랜덤포레스트 (RandomForest)

'노트 > Python : 프로그래밍' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역