최적의 모델 선택 | COSADAMA Curriculum

최적의 모델 선택: sklearn.model_selection

앞서 학습 데이터와 테스트 데이터를 활용해 모델 학습하고 적용한다고 말씀 드렸습니다. 이때 과적합을 방지하고 더 좋은 모델을 만들기 위해 **데이터를 섞거나, 나누거나, 무작위로 모델을 적용(교차 검증 분할)**하기도 합니다. 마찬가지로 대표적인 모듈들만 설명 드리겠습니다. 자세한 내용은 sklearn 공식 API를 확인하세요.

1. `model_selection` 모듈

과적합 방지를 위해 교차 검증 분할 및 평가
Estimator의 하이퍼 파라미터 튜닝을 위한 다양한 함수와 클래스 제공

2. `train_test_split()`: 학습/테스트 데이터 세트 분리

model_selection.train_test_split(*arrays[, ...])
train-test 혹은 train-validation 데이터를 나눠줍니다.
train_size, test_size: 학습train 또는 테스트test 데이터의 크기 설정. 보통 0.8:0.2 에서 0.7:0.3 정도의 비율
random_state: 데이터 split이 일관되게 나타나게 하기 위한 지표. 예컨대
```
random_state = 121
```
이렇게 설정하면 다음 split에서도 동일한 train/test dataset 제공
shuffle: True/False로 입력, shuffle=True인 경우 split 전에 데이터를 섞음
stratify: train/test의 비율이 원본 데이터와 유사하도록 추출
```
stratify = y_data #test(validation)에 해당하는 값
```

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()

feature = iris.data
target = iris.target


# 학습/테스트 데이터 세트 분리
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=121)


print('x_train 데이터의 행, 열: ', x_train.shape)
print('y_train 데이터의 행, 열: ', y_train.shape)
print('x_test 데이터의 행, 열: ', x_test.shape)
print('y_test 데이터의 행, 열: ', y_test.shape)

>>>
x_train 데이터의 행, 열:  (105, 4)
y_train 데이터의 행, 열:  (105,)
x_test 데이터의 행, 열:  (45, 4)
y_test 데이터의 행, 열:  (45,)

3. `kfold()`: 교차 검증1

model_selection.KFold([n_splits, shuffle, ...])
학습 데이터를 k개로 나누어 그 중 하나를 test(validation) 데이터로 사용합니다. 이 과정을 k번 반복합니다.
train_test_split()이 train/test 데이터를 나눴다면, KFold는 train데이터 내부를 K개로 나눕니다.
- n_splits: kfold의 n_splits와 같습니다.
- shuffle: 기본값은 False고, True를 적용하면 Fold 전에 무작위로 섞습니다.
- n_iter: 반복 횟수를 의미합니다.
주의: kfold는 인덱스 순서대로 자르기 때문에 어떤 fold의 accuracy는 100%, 어떤 fold는 83%까지 정확도accuracy 편차가 큽니다.

kfold

# accuracy score
from sklearn.metrics import accuracy_score


features = iris.data
target = iris.target

# 모델 학습
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(random_state = 11)

# 모델 fitting, predicting
dt_clf.fit(x_train, y_train)
pred = dt_clf.predict(x_test)

print('Decision Tree 예측 정확도: {0:.4f}'.format(accuracy_score(y_test, pred)))

>>>
Decision Tree 예측 정확도: 0.9556

from sklearn.model_selection import KFold

# 5개의 데이터 셋으로 나누기
kfold = KFold(n_splits = 5)

# 정확도accuracy 결과를 담을 리스트
cv_accuracy = []

# 반복 횟수
n_iter = 0

# KFold 객체의 split()을 호출하면 폴드 별 학습용, 검증용 테스트의 로우 인덱스를 array로 반환
# for문을 통해 K번 반복
for train_index, test_index in kfold.split(features):
    #kfold.split()으로 반환된 인덱스를 이용해 학습용, 검증용 데이터 추출
    x_train, x_test = features[train_index], features[test_index]
    y_train, y_test = target[train_index], target[test_index]
    
    #학습 및 예측
    dt_clf.fit(x_train, y_train)
    pred = dt_clf.predict(x_test)
    
    # 반복마다 정확도 측정
    n_iter += 1
    accuracy = np.round(accuracy_score(y_test, pred), 4)
    train_size = x_train.shape[0]
    test_size = x_train.shape[0]


    print('\n{0}번 교차검증 정확도 : {1}\n학습 데이터 크기 : {2}\n검증 데이터 크기 : {3}'.format(n_iter, accuracy, train_size, test_size))
    print('#{0} 검증 세트 인덱스: {1}'.format(n_iter, test_index))
    cv_accuracy.append(accuracy)
    
# 개별 iteration별 정확도를 합하여 평균 정확도 계산
print('\n## 평균검증 정확도: ', np.mean(cv_accuracy))

>>>
1번 교차검증 정확도 : 1.0
학습 데이터 크기 : 120
검증 데이터 크기 : 120
#1 검증 세트 인덱스: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]


2번 교차검증 정확도 : 0.9667
학습 데이터 크기 : 120
검증 데이터 크기 : 120
#2 검증 세트 인덱스: [30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
 54 55 56 57 58 59]


3번 교차검증 정확도 : 0.8667
학습 데이터 크기 : 120
검증 데이터 크기 : 120
#3 검증 세트 인덱스: [60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
 84 85 86 87 88 89]


4번 교차검증 정확도 : 0.9333
학습 데이터 크기 : 120
검증 데이터 크기 : 120
#4 검증 세트 인덱스: [ 90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119]


5번 교차검증 정확도 : 0.8333
학습 데이터 크기 : 120
검증 데이터 크기 : 120
#5 검증 세트 인덱스: [120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
 138 139 140 141 142 143 144 145 146 147 148 149]


## 평균검증 정확도:  0.9200000000000002

4. `Stratified K fold`: 교차 검증2

model_selection.RepeatedStratifiedKFold(*[, ...])
stratified KFold는 kfold의 데이터 편향을 막기 위해 target label을 고려합니다.
n_splits: kfold의 n_splits와 같습니다.
shuffle: 기본값은 False고, True를 적용하면 Fold 전에 무작위로 섞습니다.
n_iter: 반복 횟수를 의미합니다.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)
n_iter = 0
skf_cv_accuracy = []

# 학습train 데이터에 target label 추가
iris_df = pd.DataFrame(data = iris.data, columns = iris.feature_names)
iris_df['target'] = iris.target

# label 데이터로 골고루 섞어 교차 검증
for train_index, test_index in skf.split(iris_df, iris_df['target']):
    n_iter += 1

    accuracy = np.round(accuracy_score(y_test, pred),4)

    target_train = iris_df['target'].iloc[train_index]
    target_test = iris_df['target'].iloc[test_index]

    x_train, x_test = features[train_index], features[test_index]
    y_train, y_test = target[train_index], target[test_index]

    dt_clf.fit(x_train, y_train)
    pred = dt_clf.predict(x_test)

    print('====================\n{0}번\n교차검증 정확도: {1}\n학습데이터 크기: {2}\n검증데이터 크기: {3}'.format(n_iter, accuracy, train_size, test_size))
    print('학습 레이블target_train 데이터 분포: \n', target_train.value_counts())
    print('검증 레이블target_test 데이터 분포: \n', target_test.value_counts())
    skf_cv_accuracy.append(accuracy)

# fold별 정확도 및 평균정확도 계산
print('\n교차검증별 정확도: ', skf_cv_accuracy)
print('평균 검증 정확도: ', np.mean(skf_cv_accuracy))

>>>
====================
1번
교차검증 정확도: 0.8333
학습데이터 크기: 120
검증데이터 크기: 120
학습 레이블target_train 데이터 분포: 
 2    34
0    33
1    33
Name: target, dtype: int64
검증 레이블target_test 데이터 분포: 
 0    17
1    17
2    16
Name: target, dtype: int64
====================
2번
교차검증 정확도: 0.98
학습데이터 크기: 120
검증데이터 크기: 120
학습 레이블target_train 데이터 분포: 
 1    34
0    33
2    33
Name: target, dtype: int64
검증 레이블target_test 데이터 분포: 
 0    17
2    17
1    16
Name: target, dtype: int64
====================
3번
교차검증 정확도: 0.92
학습데이터 크기: 120
검증데이터 크기: 120
학습 레이블target_train 데이터 분포: 
 0    34
1    33
2    33
Name: target, dtype: int64
검증 레이블target_test 데이터 분포: 
 1    17
2    17
0    16
Name: target, dtype: int64


교차검증별 정확도:  [0.8333, 0.98, 0.92]
평균 검증 정확도:  0.9110999999999999

5. `cross_val_score()`: 교차 검증3

kFold, StratifiedKFold 모두 fold 설정, for loop를 통해 반복적으로 학습/예측했습니다.
cross_val_score()는 이 과정을 한번에 알아서 수행합니다.
따라서 선언 형태도 다릅니다.
- estimator: 분류/회귀 알고리즘
- X: 피처 데이터셋(train data)
- y: 타겟 데이터셋(target data)
- scoring: 예측 성능 평가 지표. accuracy, top_k_accuracy 등
- n_jobs: cpu 코어 수. -1을 입력하면 최대로 사용
- verbose: 상세 정보 출력 여부(0:출력x, 1: 자세히, 2: 간략히)
- fit_params: 분류/회귀 알고리즘에 fit할 매개변수를 추가합니다. default = None입니다.
- pre_dispatch: 병렬 실행 중에 발송되는 작업 수를 제어합니다. 이 수를 줄이면 CPU가 처리할 수 있는 것보다 더 많은 작업이 디스패치될 때 메모리 사용량이 폭발적으로 증가하는 것을 방지할 수 있습니다.
- error_score: 분류/회귀 알고리즘 fitting에 오류가 발생할 경우 score에 할당할 값을 의미합니다.

from sklearn.model_selection import cross_val_score
cross_val_score(estimator, X, y=None, scoring=None, n_jobs = 1, verbose = 0, fit_params = None, pre_dispatch = '2*n_jobs')

참고

from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.datasets import load_iris

iris = load_iris()
dt_clf = DecisionTreeClassifier(random_state=11)

feature = iris.data
label = iris.target

# 성능지표는 정확도(accuracy), 교차검증 세트는 3개
scores = cross_val_score(dt_clf, feature, label, cv =3 , scoring = 'accuracy')
print('교차 검증별 정확도: ', np.round(scores, 4))
print('평균 검증 정확도: ', np.round(np.mean(scores),4))

>>>
교차 검증별 정확도:  [0.98 0.92 0.98]
평균 검증 정확도:  0.96

6. `GridSearchCV`: 교차 검증을 그리드로 나누어 적용해 최적 하이퍼 파라미터 찾기

최상의 파라미터를 찾는 방법!
모델 훈련을 자동화하고, grid별 교차 검사로 최적 값을 제공합니다.
파라미터
- estimator: 분류/회귀 알고리즘
- param_grid: dict나 list 형태의 dict 형태의 인자를 받습니다. key에는 파라미터의 이름(str)이나 파라미터 이름이 포함된 딕셔너리가 값으로 입력됩니다.
- cv: 몇개의 fold를 사용할지 정합니다. None값을 입력하면 자동으로 5 Fold가 적용됩니다.정수를 입력하면 KFold가 적용됩니다. Stratified KFold도 사용 가능합니다.
  - refit: 전체 데이터 세트에서 가장 높은 점수를 기록한 매개 변수(하이퍼 파라미터)를 사용하여 모델을 다시 작동시킵니다.

주의: 파라미터 개수 x 모델 개수만큼 수행 시간이 늘어납니다. 만약 하나의 모델 수행 시간이 1분이고, 파라미터가 10개, cv가 10개라면 이론상 100분도 걸릴 수 있습니다. 무작정 파라미터, cv를 늘리는 방법은 모델링 시간만 증가시킬수도 있습니다.
잠깐! 파라미터, 하이퍼 파라미터란?:
- 파라미터parameter: 매개변수를 의미합니다. 평균, 표준편차, 정확도accuracy,회귀 계수 등 데이터와 모델을 통해 구해지는 값입니다. 즉, 조정이 불가능한 정해진 값입니다.
- 하이퍼 파라미터hyper parameter: 하이퍼 파라미터는 모델링 할 때 분석가가 지정하는 값을 의미합니다. KFold의 K 값, n_jobs, cv 값 등이 해당합니다. 따라서 우리가 모델에서 값을 지정해줄 때 "하이퍼 파라미터를 조정한다"고 표현해야 합니다.

model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)

from sklearn.model_selection import GridSearchCV

# 데이터를 로딩하고 학습데이터와 테스트 데이터 분리
iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.2, random_state=121)

dt_clf = DecisionTreeClassifier()

## 파라미터를 딕셔너리 형태로 설정
parameters = {'max_depth' : [1, 2, 3], 'min_samples_split' : [2, 3]}

from sklearn.model_selection import GridSearchCV

# 데이터를 로딩하고 학습데이터와 테스트 데이터 분리
iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.2, random_state=121)

dt_clf = DecisionTreeClassifier()

## 파라미터를 딕셔너리 형태로 설정
parameters = {'max_depth' : [1, 2, 3], 'min_samples_split' : [2, 3]}


>>>

params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
4	{'max_depth': 3, 'min_samples_split': 2}	0.975000	1	0.975	1.0
5	{'max_depth': 3, 'min_samples_split': 3}	0.975000	1	0.975	1.0
2	{'max_depth': 2, 'min_samples_split': 2}	0.958333	3	0.925	1.0
3	{'max_depth': 2, 'min_samples_split': 3}	0.958333	3	0.925	1.0
0	{'max_depth': 1, 'min_samples_split': 2}	0.700000	5	0.700	0.7
1	{'max_depth': 1, 'min_samples_split': 3}	0.700000	5	0.700	0.7

print('GridSearch 최적 파라미터: ', grid_dt.best_params_)
print('GridSearch 최고 점수: ', grid_dt.best_score_)

>>>
GridSearch 최적 파라미터:  {'max_depth': 3, 'min_samples_split': 2}
GridSearch 최고 점수:  0.975

# GridSearchCV의 refit으로 학습된 estimator 반환
estimator = grid_dt.best_estimator_

# GridSearchCV의 best_estmator_ 는 이미 최적!
pred = estimator.predict(x_test)
print('테스트 데이터세트 정확도: {0: .4f}'.format(accuracy_score(y_test, pred)))

>>>
테스트 데이터세트 정확도:  0.9667

7. Randomized Search CV

RandomizedSearchCV는 하이퍼 파라미터를 무작위로 검색합니다.
GridSearchCV와 달리 모든 모수 값이 시도되지 않습니다. 지정된 분포에서 고정된 수의 모수 설정이 샘플링됩니다. 매개 변수의 수는 n_iter로 지정됩니다.
- cv: GridSearchCV와 같이 cv값을 통해 Fold를 결정합니다.
주의: GridSearchCV 보다 시간은 적게 걸릴 수 있지만 정확도가 더 높다고 확답할 수는 없습니다. random한 파라미터를 사용하기 때문이죠.

RandomizedSearchCV(estimator, param_distributions, *, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import RandomizedSearchCV

random_search = {'criterion': ['entropy', 'gini'],
               'max_depth': [2],
               'max_features': ['auto', 'sqrt'],
               'min_samples_leaf': [4, 6, 8],
               'min_samples_split': [5, 7,10],
               'n_estimators': [20]}

rf_clf = RandomForestClassifier()

model = RandomizedSearchCV(estimator = rf_clf, param_distributions = random_search, n_iter = 10, 
                               cv = 4, verbose= 1, random_state= 121, n_jobs = -1)
model.fit(x_train,y_train)


>>>
Fitting 4 folds for each of 10 candidates, totalling 40 fits
RandomizedSearchCV(cv=4, estimator=RandomForestClassifier(), n_jobs=-1,
                   param_distributions={'criterion': ['entropy', 'gini'],
                                        'max_depth': [2],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [4, 6, 8],
                                        'min_samples_split': [5, 7, 10],
                                        'n_estimators': [20]},
                   random_state=121, verbose=1)

print(model.best_params_)

>>>
{'n_estimators': 20, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 2, 'criterion': 'gini'}

randompredict = model.best_estimator_.predict(x_test)
print(classification_report(y_test, randompredict))
print("\nRandom Search : ", accuracy_score(y_test, randompredict))

>>>

sklearn 모듈 외에도 Bayesian Optimization, Optuna 등 여러 하이퍼파라미터 최적화 방법이 있습니다. 이후 더 정확한 모델을 만드실 때 활용해 보셔도 좋을 것 같습니다.

최적의 모델 선택: sklearn.model_selection

1. model_selection 모듈

2. train_test_split(): 학습/테스트 데이터 세트 분리

3. kfold(): 교차 검증1

4. Stratified K fold: 교차 검증2

5. cross_val_score(): 교차 검증3