Preprocessing (1) train_test

Others/Data Science

Preprocessing (1) train_test_split

Porits789 2022. 7. 14. 15:43

※실습에서 사용된 데이터는 Iris data set입니다.

데이터를 학습과 테스트하기위해 분할하는 과정을 수행한다.

'Scikit Learn'의 train_test_split 함수를 이용하여 분할하는 방법을 소개한다.

우선 scikit learn의 train_test_split 함수를 불러온다.

from sklearn.model_selection import train_test_split

기본적인 사용방법은 아래와 같습니다.

x_train, x_test, y_train, y_test = train_test_split(x_data,y_data, test_size= 0.2, random_state=42)

위의 코드는 데이터 x,y를 train 80%, test 20%의 비율로 랜덤하게 분할한다는 것입니다.

그런데 해당코드로 분할하게 되는경우 특정 값이 과도하게 몰려 정확도가 낮아지는 경우가 발생합니다.

따라서 각 계층마다 적절한 수를 추출하는 계층적 샘플링이 이루어져야 합니다.

train_test_split에서는 stratify를 사용하게되면 됩니다.

x_train, x_test, y_train, y_test = train_test_split(x_data,y_data, 
					test_size= 0.2, random_state=42, stratify = y_data)

아래는 iris 데이터를 이용한 실습입니다.

1. stratify 미적용

x_train, x_test, y_train, y_test = train_test_split(df_features,df['target'], test_size= 0.3, random_state=42)
y_test.hist()

다음과 같이 테스트셋의 불균형이 나타납니다.

2. stratify 적용

x_train2, x_test2, y_train2, y_test2 = train_test_split(df_features,df['target'], test_size= 0.3, random_state=42,stratify = df['target'])
y_test2.hist()

테스트 값의 분포가 동일한 것을 확인 할 수 있습니다.