[ML] sklearn / iris dataset

__Data Analysis/__Machine Learning 2021. 4. 14. 00:19

-결측치가 없다.

-데이터가 깔끔하다.

-별다른 전처리를 하지 않고, 점수를 내봄.

1. 모듈 불러오기

import numpy as np
import pandas as pd

from sklearn.datasets import load_iris # 데이터셋 호출
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score, recall_score, f1_score, roc_auc_score, roc_curve

# warnings 방지용
import warnings
warnings.filterwarnings(action='ignore')

- iris 데이터 불러오기. 사이킷런 데이터셋을 사용했다.

2. iris 데이터 확인

print(dataset.keys())
print(dataset.data[:5])
print(dataset.target)
print(dataset.target_names)  #['setosa' 'versicolor' 'virginica']
print(dataset.feature_names) #['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
df = pd.DataFrame(data=dataset.data, columns=cols)
df["target"] = dataset.target

- iris dataset으로 데이터프레임 생성

3. 생성한 iris 데이터프레임 확인

print(df.info())
print(df.head())

- 결측치 없음.

- 'sepal_length', 'sepal_width', 'petal_length', 'petal_width' 의 정보를 가지고 분류된 값이 target 피처

- target 피처의 값을 확인한 결과, 결과값이 0:50개 + 1:50개 + 2:50개로 정렬되어 있음

-> train, test 데이터 분할 시 셔플이 필요해보임.

4. 학습 전 데이터 분리

df_y = df.iloc[:,-1] # df['target']
df_X = df.iloc[:,:-1]

print(df_X.shape,len(df_y))

- target과 분리

- df_y의 결과는 series 타입

X_train, X_test,y_train,y_test = train_test_split(df_X,df_y, test_size=0.2, random_state=36, shuffle=True)

print('X_train',X_train.shape,'X_test',X_test.shape,'y_train',len(y_train),'y_test',len(y_test),sep=' ')

- 학습데이터 분리

- train 120 : test 30 비율로 분리됨

5. 학습

1) 모델 정하기, 학습하기

from sklearn.tree import DecisionTreeClassifier

- DecisionTree 분류 기법 선택

2) 학습

dt.fit(X_train,y_train)
pred = dt.predict(X_test) # 맞춰봐 -> 답안 내줌
proba = dt.predict_proba(X_test)


# 확인
print(pred[:5])
print(proba[:5])

- 내 답안 마지막 5개는 [1,2,1,2,2]

- Onehotencoding이 적용되어 있음.

myscore(y_test,pred,proba) # 답안지, 점수 내

- 만들어놓은 score 함수를 이용해, Accuracy, Precision, Recall, F1, AUC 점수를 확인함.

- test size 조절해봐도 target=1 에 대해 오차?가 생김.

-----------------------------------------------------------------------------------------------------------------------------------

'Data Analysis > Machine Learning' 카테고리의 다른 글

[Scikit-learn] 사이킷런 \| 머신러닝 라이브러리 (0)	2021.06.17
[DL] CNN \| 실습2 : Flower_Recognition (1)	2021.05.13
[DL] 딥러닝 기초 (0)	2021.05.06
[ML] confusion matrix (0)	2021.04.13
[ML] 분류 (0)	2021.04.13

ABOUT ME

KL's notebook KL's notebook

'Data Analysis > Machine Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'__Data Analysis > __Machine Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

'Data Analysis > Machine Learning' 카테고리의 다른 글