Preprocessing (1) train_test_split

Porits789 2022. 7. 14. 15:43

2022. 7. 14. 15:43

※실습에서 사용된 데이터는 Iris data set입니다.

데이터를 학습과 테스트하기위해 분할하는 과정을 수행한다.

'Scikit Learn'의 train_test_split 함수를 이용하여 분할하는 방법을 소개한다.

우선 scikit learn의 train_test_split 함수를 불러온다.

from sklearn.model_selection import train_test_split

기본적인 사용방법은 아래와 같습니다.

x_train, x_test, y_train, y_test = train_test_split(x_data,y_data, test_size= 0.2, random_state=42)

위의 코드는 데이터 x,y를 train 80%, test 20%의 비율로 랜덤하게 분할한다는 것입니다.

그런데 해당코드로 분할하게 되는경우 특정 값이 과도하게 몰려 정확도가 낮아지는 경우가 발생합니다.

따라서 각 계층마다 적절한 수를 추출하는 계층적 샘플링이 이루어져야 합니다.

train_test_split에서는 stratify를 사용하게되면 됩니다.

x_train, x_test, y_train, y_test = train_test_split(x_data,y_data, 
					test_size= 0.2, random_state=42, stratify = y_data)

아래는 iris 데이터를 이용한 실습입니다.

1. stratify 미적용

x_train, x_test, y_train, y_test = train_test_split(df_features,df['target'], test_size= 0.3, random_state=42)
y_test.hist()

다음과 같이 테스트셋의 불균형이 나타납니다.

2. stratify 적용

x_train2, x_test2, y_train2, y_test2 = train_test_split(df_features,df['target'], test_size= 0.3, random_state=42,stratify = df['target'])
y_test2.hist()

테스트 값의 분포가 동일한 것을 확인 할 수 있습니다.

'Others > Data Science' 카테고리의 다른 글

Preprocessing (2) 누락된 값의 처리(Null) (0)	2022.07.15
Pandas - DataFrame (3) 조회(loc/iloc) (0)	2022.07.15
Pandas - DataFrame (2) 삭제 (0)	2022.07.14
Pandas - DataFrame (1) 생성 (0)	2022.07.14
Machine Learning (1) 머신러닝 성능 평가 지표 (회귀) (3)	2022.07.11

Pori_IT

Preprocessing (1) train_test_split

'Others > Data Science' 카테고리의 다른 글

+ Recent posts

티스토리툴바