Feature Importances

[CodesSates] AI 부트캠프

Feature Importances

웅탈 2021. 4. 22. 23:56

Feature Importances

Feature Importances(MDI)

- 각 특성을 모든 트리에 대해 평균불순도감소(MDI)를 계산한 값

# Feature Importances example

rf = pipe.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)

import matplotlib.pyplot as plt

n = 20
plt.figure(figsize=(10,n/2))
plt.title(f'Top {n} features')
importances.sort_values()[-n:].plot.barh();

Permutation importance(순열중요도, MDA)

- 관심있는 특성에만 무작위로 노이즈를 주고 예측을 하였을 때 성능 평가지표(정확도, F1,R2 등)가 얼마나 감소하는지를 측정

from sklearn.pipeline import Pipeline
# encoder, imputer를 preprocessing으로 묶었습니다. 후에 eli5 permutation 계산에 사용합니다
pipe = Pipeline([
    ('preprocessing', make_pipeline(OrdinalEncoder(), SimpleImputer())),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=2, n_jobs=-1)) 
])

pipe.fit(X_train, y_train)


import eli5
from eli5.sklearn import PermutationImportance

# permuter 정의
permuter = PermutationImportance(
    pipe.named_steps['rf'], # model
    scoring='accuracy', # metric
    n_iter=5, # 다른 random seed를 사용하여 5번 반복
    random_state=2
)

# permuter 계산은 preprocessing 된 X_val을 사용합니다.
X_val_transformed = pipe.named_steps['preprocessing'].transform(X_val)

# 실제로 fit 의미보다는 스코어를 다시 계산하는 작업입니다
permuter.fit(X_val_transformed, y_val);

feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()


n_adult_r                     -0.003511
hhs_region                    -0.003108
census_region                 -0.003084
behavioral_face_mask          -0.003060
sex_i                         -0.002942
state                         -0.002918
behavioral_wash_hands         -0.002657
n_people_r                    -0.002562
behavioral_large_gatherings   -0.002538
behavioral_antiviral_meds     -0.002420
behavioral_avoidance          -0.002159
behavioral_outside_home       -0.002064
behavioral_touch_face         -0.001755
chronic_med_condition         -0.001613
census_msa                    -0.001542
behaviorals                   -0.001471
child_under_6_months          -0.001400
marital                       -0.001328
rent_own_r                    -0.000712
inc_pov                       -0.000474
raceeth4_i                    -0.000213
household_children            -0.000119
education_comp                 0.001234
health_insurance               0.002775
health_worker                  0.003060
opinion_seas_sick_from_vacc    0.004792
agegrp                         0.007733
opinion_seas_risk              0.041039
opinion_seas_vacc_effective    0.043814
doctor_recc_seasonal           0.071427
dtype: float64

Boosting

- 한 트리를 깊게 학습시키면 과적합 확률이 높기 때문에, 배깅이나 부스팅을 사용

- Python libraries for Gradient Boosting

scikit-learn Gradient Tree Boosting — 상대적으로 속도가 느릴 수 있습니다.
- Anaconda: already installed
- Google Colab: already installed
xgboost — 결측값을 수용하며, monotonic constraints를 강제할 수 있습니다.
- Anaconda, Mac/Linux: conda install -c conda-forge xgboost
- Windows: conda install -c anaconda py-xgboost
- Google Colab: already installed
LightGBM — 결측값을 수용하며, monotonic constraints를 강제할 수 있습니다.
- Anaconda: conda install -c conda-forge lightgbm
- Google Colab: already installed
CatBoost — 결측값을 수용하며, categorical features를 전처리 없이 사용할 수 있습니다.
- Anaconda: conda install -c conda-forge catboost
- Google Colab: pip install catboos

- Early stopping

model.fit(X_train_encoded, y_train, 
          eval_set=eval_set,
          eval_metric='error', # #(wrong cases)/#(all cases)
          early_stopping_rounds=50
         ) # 50 rounds 동안 스코어의 개선이 없으면 멈춤

<하이퍼파라미터 튜닝>

Random Forest

class_weight (imbalanced 클래스인 경우)
max_depth (높은값에서 감소시키며 튜닝, 너무 깊어지면 과적합)
n_estimators (적을경우 과소적합, 높을경우 긴 학습시간)
min_samples_leaf (과적합일경우 높임)
max_features (줄일 수록 다양한 트리생성, 높이면 같은 특성을 사용하는 트리가 많아져 다양성이 감소)

XGBoost

scale_pos_weight (imbalanced 클래스인 경우)
max_depth (낮은값에서 증가시키며 튜닝, 너무 깊어지면 과적합)
n_estimators (작을경우 과소적합, 높을경우 긴 학습시간) - Early Stopping 사용!
learning_rate (작을경우 과소적합, 높을경우 과적합)

저작자표시

'[CodesSates] AI 부트캠프' 카테고리의 다른 글

개발환경 (0)	2021.05.12
Interpreting ML Model (0)	2021.04.23
Choose Your ML Problems (0)	2021.04.22
Model Selection (0)	2021.04.15
Evaluation Metrics for Classification (0)	2021.04.14

현재글Feature Importances

2_H_J

코드스테이츠 #AI부트캠프,

Today :
Yesterday :

2_H_J

Feature Importances

Feature Importances

<하이퍼파라미터 튜닝>

'[CodesSates] AI 부트캠프' 카테고리의 다른 글

'[CodesSates] AI 부트캠프'의 다른글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Feature Importances

Feature Importances

<하이퍼파라미터 튜닝>

'[CodesSates] AI 부트캠프' 카테고리의 다른 글

'[CodesSates] AI 부트캠프'의 다른글

관련글

티스토리툴바