๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ Python/(ๅฎŒ) ๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ

[๋น…๋ถ„๊ธฐ_์‹ค๊ธฐ] ์ž‘์—…ํ˜•2: 3ํšŒ๊ธฐ์ถœ_๋ณดํ—˜๊ตฌ๋งค์—ฌ๋ถ€ ์˜ˆ์ธก๋ชจํ˜•(๋ถ„๋ฅ˜), ์—ฌํ–‰ ๋ฐ์ดํ„ฐ

by Chloe._. 2022. 6. 24.

๊ณต๋ถ€์ž๋ฃŒ https://class101.net/classes/6161bc52559cfb0015ef4ff1/contents/61ca8d860ef56e000dfda2be?productId=467P0ZPH0lVX9FwFBDz7 

 

์„ธ์ƒ์˜ ๋ชจ๋“  ํด๋ž˜์Šค, ํด๋ž˜์Šค101

NO.1 ์˜จ๋ผ์ธ ํด๋ž˜์Šค ํ”Œ๋žซํผ - ๋Œ€ํ•œ๋ฏผ๊ตญ 1๋“ฑ ์˜จ๋ผ์ธ ํด๋ž˜์Šค ํ”Œ๋žซํผ ์ค€๋น„๋ฌผ๊นŒ์ง€ ์ฑ™๊ฒจ์ฃผ๋Š” ํด๋ž˜์Šค101๊ณผ ์ทจ๋ฏธ๋ถ€ํ„ฐ ๋ถ€์—…, ์ƒˆ๋กœ์šด ์ ์„ฑ๊นŒ์ง€ ์ฐพ์•„๋ณด์„ธ์š”!

class101.net

๋ฌด๋ฃŒ๊ณต๊ฐœ๋กœ ์˜ฌ๋ ค์ฃผ์…จ๋‹ค..! ํ‡ด๊ทผํ›„๋”ด์ง“๋‹˜ ์งฑ

์ด๋ถ„์ด ๋”ฑ ์ด์ •๋„ ์ฝ”๋“œ๋งŒ ์“ฐ๊ณ  40์  ๋งŒ์ ์„ ๋ฐ›์œผ์…จ๋‹ค๊ณ  ํ•œ๋‹ค.

์ฝ”๋“œ์ฑ„์  ์—†์ด ์ œ์ถœ๋œ ํŒŒ์ผ๋งŒ ์ ์ˆ˜๋งค๊ธฐ๋‹ˆ๊นŒ, ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•๋ก ์— ๋Œ€ํ•œ ๋ถ€๋‹ด์ด ์ข€ ๋œ์–ด์กŒ๋‹ค.

๋ฒ ์ด์Šค๋ผ์ธ๋Œ€๋กœ๋งŒ ์‹ค์ˆ˜ ์—†์ด ํ•˜๊ธธ..!

 

๋ฌธ์ œ: ์—ฌํ–‰๋ณดํ—˜์„ ๊ตฌ๋งค์•ˆํ•จ(0), ๊ตฌ๋งค(1)์œผ๋กœ ๋‚˜๋ˆŒ ๋•Œ, ๊ตฌ๋งค์ผ ํ™•๋ฅ ์„ ์˜ˆ์ธก

3ํšŒ๊ธฐ์ถœ์€ 2ํšŒ๊ธฐ์ถœ๊ณผ ๋‹ฌ๋ฆฌ X์™€ y๋ฅผ ๋‚˜๋ˆ ์ฃผ์ง€ ์•Š์•˜๋‹ค. Train๊ณผ Test๋งŒ ์ฃผ์–ด์กŒ๊ณ  Train์—๋งŒ ์žˆ๋Š” ์ข…์†๋ณ€์ˆ˜๋ฅผ ๋‚˜์ค‘์— ๋–ผ์–ด์ฃผ๋ฉด ๋๋‹ค. ์ด๋ฒˆ์—๋„ ์ด์ง„๋ถ„๋ฅ˜ predict_proba๊ฐ€ ์ถœ์ œ๋๋‹ค. ๊ณต์‹์˜ˆ์ œ๊นŒ์ง€ ํ•˜๋ฉด ์„ธ๊ฐœ๊ฐ€ ๋‹ค ๋ถ„๋ฅ˜๋ชจ๋ธ์ธ๋ฐ... ๋‚ด์ผ ์‹œํ—˜์—๋งŒ ์ฒ˜์Œ์œผ๋กœ ํšŒ๊ท€๋ชจ๋ธ์ด ๋‚˜์˜ค๋ฉด ์–ด๋–กํ•˜๋‚˜ ๊ดœํžˆ ๋ถˆ์•ˆ.... 

๋˜ ์ œ์ถœํŒŒ์ผ ์ฝ”๋“œ์˜ˆ์‹œ๋ฅผ ๋ช…ํ™•ํžˆ ์ฃผ์ง€์•Š์•„ ์ด๊ฑธ๋กœ ์ด์˜์ œ๊ธฐ๊ฐ€ ๋งŽ์•˜๋‹ค๊ณ  ํ•œ๋‹ค. ๋‚ด๋ผ๋Š” ๋ชจ์–‘ ์ œ๋Œ€๋กœ ๋ณด๊ณ  to_csv ์ •ํ™•ํžˆ ํ•˜์ž

import numpy as np 
import pandas as pd

train = pd.read_csv('../input/jakuphyung23rdtest/train.csv') # 1490 * 10
test = pd.read_csv('../input/jakuphyung23rdtest/test.csv') # 497 * 9

pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows', 20)

# 0. EDA
print(train)
print(train.info())
print('-'*50)
print(test.info())

# 1. id drop๊ณผ pop
train.drop(columns='Unnamed: 0', inplace=True)
unnamed = test.pop('Unnamed: 0')
print(train.columns, test.columns)

# 2. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ-์—†์Œ

๊ฒฐ์ธก์น˜ ์—†๋‹ค๋‹ˆ! ํ˜„์—…์—์„  ๊ทธ๋Ÿด ๋ฆฌ๊ฐ€ ์—†๋Š”๋ฐ!

isnull()sum()์€ ์‚ฌ์‹ค์ƒ ์•ˆ์“ฐ๊ฒŒ ๋œ๋‹ค info()์— ๋‹ค ์žˆ๋Š”๋ฐ! shape๋„ ๋งˆ์ฐฌ๊ฐ€์ง€

์ปฌ๋Ÿผ๋ช…์„ ์–ด๋””์„  []ํ•˜๊ณ  ์–ด๋””์„  ''๋กœ ํ•ด์„œ Syntax Error๊ฐ€ ์ž๊พธ ๋‚œ๋‹ค... ์˜คํƒ€์‹ค์ˆ˜๋„ ํ•˜๊ณ ...

์ด๋ฒˆ์—” unnamed๋Š” ๋‹ค์‹œ ์“ธ ํ•„์š˜ ์—†์—ˆ์ง€๋งŒ ๊ทธ๋ž˜๋„ ํ‰์†Œ์ฒ˜๋Ÿผ pop()ํ•ด๋ดค๋‹ค

 

# 3. ์ธ์ฝ”๋”ฉ: 

print(train.describe(include='object'))
print(test.describe(include='object'))
# ์ „๋ถ€ nunique๊ฐ€ 2๊ฐœ๋กœ ๊ฐ„๋‹จํ•จ

cols = ['Employment Type', 'GraduateOrNot', 'FrequentFlyer', 'EverTravelledAbroad']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cols:
    train[col] = le.fit_transform(train[col])
    test[col] = le.transform(test[col])

print(train.head().T)
print(test.head().T)

describe()๋ฅผ EDA๊ฐ€ ์•„๋‹Œ ์—ฌ๊ธฐ์„œ ํ•ด๋ณธ๋‹ค ํ—ˆํ—ˆ

ํ•œ๋‹จ๊ณ„ ํ•œ๋‹จ๊ณ„ ํ•˜๊ณ ๋‚˜์„œ ์ ์šฉ๋๋Š”์ง€ ๊ผญ printํ•˜๋Š” ์Šต๊ด€์„ ๋“ค์ด์ž

ํŒŒ๋ฐ”๋ฐ• ์ฝ”๋“œ์“ฐ๊ณ  ํ•œ๋ฒˆ์— ๋Œ๋ฆฌ๋ฉด ์–ด๋””์„œ๋ถ€ํ„ฐ ์•ˆ๋๋Š”์ง€ ๊ท€์ฐฎ์•„์ง„๋‹ค

 

# 4. ์Šค์ผ€์ผ๋ง

print(train.describe())
print(test.describe())
## ์•„๊นŒ ๋ฒ„๋ฆฐ unnamed์™€ ์ข…์†๋ณ€์ˆ˜์ธ ๋ณดํ—˜๋ฃŒ๋ฅผ ์ œ์™ธํ•˜๋ฉด ์ˆ˜์น˜ํ˜• ์ปฌ๋Ÿผ์€ 4๊ฐœ
## ๋ณดํ†ต age๋Š” ๋ถ„ํฌ ๋ณด๊ณ  ๋‚˜์ด๋Œ€ ๋ฒ”์ฃผํ™”ํ•˜๋Š”๊ฒŒ ํ”ํ•˜์ง€๋งŒ ์ด๋ฒˆ์—” ๋ฒ”์œ„๊ฐ€ ์งง์•„ ํŒจ์Šค
## FamilyMembers 2~9, ChronicDiseases 0~1๋กœ ๊ทœ๋ชจ๊ฐ€ ์ž‘์•„ ์Šค์ผ€์ผ๋ง ํŒจ์Šค
## Age์™€ AnnualIncome์œผ๋กœ๋งŒ ์Šค์ผ€์ผ๋ง ์ง„ํ–‰

from sklearn.preprocessing import RobustScaler
cols = ['Age', 'AnnualIncome']
ro = RobustScaler()
for col in cols:
    train[col] = le.fit_transform(train[col])
    test[col] = le.transform(test[col])
    
print(train.head().T)
print(test.head().T)

๋‚ด ๋‚˜๋ฆ„์˜ ๋…ผ๋ฆฌ๋กœ ์Šค์ผ€์ผ๋งํ•  ์ปฌ๋Ÿผ์„ ์„ ๋ณ„ํ•œ๋‹ค

์–ด์ฐจํ”ผ ์ปฌ๋Ÿผ ๊ฐ„ ๊ทœ๋ชจ ๋งž์ถ”๋ ค๊ณ  ์Šค์ผ€์ผ๋งํ•˜๋Š”๊ฑฐ ์•„๋‹Œ๊ฐ€..?! ์ˆ˜์น˜ํ˜•์ด๋ผ๊ณ  ๋‹ค ํ•  ํ•„์š˜ ์—†๋‹ค ใ…Ž..

RobustScaler๋Š” ์ด์ƒ์น˜๋ฅผ ๊ณ ๋ คํ•ด์ฃผ๋Š” ๋ฐฉ๋ฒ•์ด๋ผํ•œ๋‹ค ์ž์„ธํ•œ๊ฑด๋ชจ๋ฅด๊ฒ ๋‹ค

๊ทธ๋ž˜์„œ ์‹œํ—˜ ๋•Œ๋„ ๋ฏผ๋งฅ์Šค๋‚˜ ์Šคํƒ ๋‹ค๋“œ๋ณด๋‹จ ๋กœ๋ฒ„์ŠคํŠธ ์“ฐ๋ ค ํ•œ๋‹ค

 

# 5. ๊ฒ€์ฆ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
from sklearn.model_selection import train_test_split
xx_train, x_val, yy_train, y_val = train_test_split(train.drop(columns='TravelInsurance'),
                                                    train['TravelInsurance'],
                                                    train_size=0.8,
                                                    random_state=42)
                                                   
print(xx_train.shape, x_val.shape, yy_train.shape, y_val.shape)

# 5-1. ๊ฒ€์ฆ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=500, max_depth=9, random_state=42)
model.fit(xx_train, yy_train)

# 5-2. ๊ฒ€์ฆ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก
pred_val = model.predict_proba(x_val)[:,1] # 1์ผ ํ™•๋ฅ 

# 5-3. ๊ฒ€์ฆ๋ฐ์ดํ„ฐ๋กœ ์ ์ˆ˜
from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_val, pred_val)
# print(score)
# 500, 5, 0.792
# 1000, 5, 0.791
# 500, 7, 0.804
# 500, 9, 0.805
# 700, 9, 0.804

๋‚ด๊ฐ€ ์–ด๋ ค์›Œํ•˜๋Š” ๋ถ€๋ถ„ ์‹œ์ž‘..!

train_test_split()ํ• ๋• x_train๊ณผ y_train์„ ๋„ฃ๊ณ , (์—ฌ๊ธฐ์„  ๋”ฐ๋กœ ์ œ๊ณต ์•ˆํ–ˆ์Œ)

fit()ํ• ๋•Œ๋„ x_train๊ณผ y_train์„ ๋„ฃ๊ณ ,

์˜ˆ์ธก์€ ๋‹น์—ฐํžˆ x_test๋งŒ, 

์ ์ˆ˜๋Š” y_test์™€ x_test์˜ ์˜ˆ์ธก๊ฐ’์„ ๋„ฃ๊ธฐ... (๋ฌผ๋ก  y_test๋ฅผ ๊ฐ€์ง„ ์ฑ„์ ๊ด€ ์ž…์žฅ)

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •ํ•ด๊ฐ€๋ฉฐ 0.805์—์„œ ๋งŒ์กฑํ–ˆ๋‹ค

 

# 6. ์‹ค์ œ test๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก
pred = model.predict_proba(test)[:,1]

# 7. ์ œ์ถœ: index์ปฌ๋Ÿผ, y_pred์ปฌ๋Ÿผ, 0~496์ธ๋ฑ์Šค, index=False ์ฃผ์˜
output =  pd.DataFrame({'index':test.index,
                       'y_pred':pred}).to_csv('์ˆ˜ํ—˜๋ฒˆํ˜ธ.csv', index=False)
                       
# _____________์ฑ„์ _____________________
y_test = pd.read_csv('../input/jakuphyung23rdtest/y_test.csv')
score = roc_auc_score(y_test, pred)
print(score) # 0.78

์ œ๊ณตํ•ด์ฃผ์‹  y_test๋กœ ํ•ด๋ณด๋‹ˆ 0.78์ ์œผ๋กœ ์‚ด์ง ๋‚ฎ์•„์กŒ๋‹ค.

๋

๋Œ“๊ธ€