๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๐Ÿ Python/(ๅฎŒ) ๋น…๋ฐ์ดํ„ฐ๋ถ„์„๊ธฐ์‚ฌ

[๋น…๋ถ„๊ธฐ_์‹ค๊ธฐ] ์ž‘์—…ํ˜•2: ์˜ˆ์ œ_์ง‘๊ฐ’์˜ˆ์ธก๋ชจํ˜•(ํšŒ๊ท€), ๋ถ€๋™์‚ฐ ๋ฐ์ดํ„ฐ

by Chloe._. 2022. 6. 25.

์•ž์˜ ๋‘๋ฒˆ์˜ ์‹ค๊ธฐ๋Š” ๋ถ„๋ฅ˜๊ฐ€ ๋‚˜์™”์ง€๋งŒ ํ˜น์‹œ ๋ชจ๋ฅด๋‹ˆ ํšŒ๊ท€ ํ•˜๋‚˜ ์—ฐ์Šตํ•ด๊ฐ„๋‹ค. ๋„ˆ๋ฌด ์–ด๋ ต๋‹ค...!

์˜ˆ์ธกํ•ด์•ผํ•  ์ข…์†๋ณ€์ˆ˜(ํƒ€๊ฒŸ)๊ฐ€ ๋ฒ”์ฃผํ˜•์ด๋ฉด ๋ถ„๋ฅ˜, ์ˆ˜์น˜ํ˜•์ด๋ฉด ํšŒ๊ท€๋‹ค. ๋˜๋Š” roc_aucํ‰๊ฐ€์ง€ํ‘œ๋ฅผ ์‚ฌ์šฉํ•  ๊ฑฐ๋ผ๊ณ  ๋ช…์‹œ๋ผ์žˆ์œผ๋ฉด ๋ถ„๋ฅ˜์ด๊ณ , (์•„์ง ์ถœ์ œ๋œ ์  ์—†์ง€๋งŒ) r2 score, RMSE ๋“ฑ์˜ ์ ์ˆ˜๋ฅผ ์“ธ๊ฑฐ๋ผํ•˜๋ฉด ํšŒ๊ท€๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. 

import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)

# 1. EDA
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# (1168, 79) (292, 79) (1168, 2) (292, 2)

print(y_train.columns) # (['Id', 'SalePrice']) -> ๊ฐ€๊ฒฉ์„ ์˜ˆ์ธกํ•˜๋ผ, ํšŒ๊ท€
print(X_train.info()) # float64(3), int64(33), object(43)
print(X_test.info()) # float64(3), int64(33), object(43)

# 2-1. id ์žˆ์œผ๋ฉด ๋“œ๋ž- ์—†์Œ

# 2-2. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ- ์ผ๋‹จ 0
print(X_train.isna().sum())

X_train.fillna(0, inplace=True)
X_test.fillna(0, inplace=True)

print(X_train.info())
print(X_test.info())

์ปฌ๋Ÿผ์ด 79๊ฐœ๋กœ ์ƒ๋‹นํ•œ ๋ฐ์ดํ„ฐ์˜€๋‹ค. ์ปฌ๋Ÿผ๋ช…๋„ ๋ญ˜์ง€ ์ถ”๋ก ์ด ์•ˆ๋˜๋Š” ์ด๋ฆ„์ด์—ˆ๋‹ค. ๊ฒฐ์ธก์น˜๋„ ๋งŽ์•˜๋‹ค.

๊ฒฐ์ธก์น˜๋Š”... ๋ฌธ์žํ˜•์€ ์ตœ๋นˆ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ณ  ์ˆ˜์น˜ํ˜•์€ ํ‰๊ท ์œผ๋กœ ์ฑ„์šธ๊นŒ? ๋ผ๋Š” ์ด์ƒ์„ ๊ฟˆ๊พธ๊ธด ํ•˜์˜€์ง€๋งŒ

ํ˜„์‹ค์ ์œผ๋กœ ์งง์€์ฝ”๋“œ๋ฅผ ์จ์•ผ ์‹œํ—˜์‹œ๊ฐ„์— ์•ˆ ์ซ„๋ฆด ๊ฑฐ ๊ฐ™์•„์„œ ์ผ๋‹จ 0์œผ๋กœ ์ฑ„์› ๋‹ค.

 

# 2-3. ์ธ์ฝ”๋”ฉ(์•ˆํ•จ)
print(X_train.describe(include='object'))

X_train = X_train.select_dtypes(exclude=['object']) # (1168, 36)
X_test = X_test.select_dtypes(exclude=['object']) # (292, 36)
print(X_train.info(), X_test.info())

์ธ์ฝ”๋”ฉํ•  43๊ฐœ์˜ ์ปฌ๋Ÿผ๋ช…์„ ๋ฆฌ์ŠคํŠธ๋กœ ๋ฐ›๋Š” ๊ฒƒ๋„ ๊ตฌ๊ตฌ์ ˆ์ ˆ ํž˜๋“ค๊ณ  ์˜ˆ์‹œ์—์„œ๋„ ๊น”๋”ํ•˜๊ฒŒ ๋“œ๋žํ•˜๊ธธ๋ž˜;; ์ผ๋‹จ ๋”ฐ๋ผํ•ด๋ดค๋‹ค.

select_dtypes ์ฒ˜์Œ ์จ๋ดค๋‹ค.

ํšŒ๊ท€ ๋•Œ๋Š” ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ด์„œ 1) ๋…๋ฆฝ๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€์„ฑ ๋†’์œผ๋ฉด ๋‹ค์ค‘๊ณต์„ ์„ฑ ์šฐ๋ ค๋กœ ๋“œ๋ž 2) ๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜ ๊ฐ„ ์ƒ๊ด€์„ฑ ๋‚ฎ์œผ๋ฉด ์˜ํ–ฅ๋ ฅ ๋ฏธ๋ฏธํ•˜๋‹ค ํŒ๋‹จํ•ด ๋“œ๋ž. ์ด ๋‘๊ฐ€์ง€๋ฅผ ํ•ด๋ณผ ์ˆœ ์žˆ๋Š”๋ฐ, (79*79)์˜ ํ‘œ๋ฅผ ๋ˆˆ์œผ๋กœ ๋ณด๊ธฐ ํž˜๋“ค์—ˆ๋‹ค;; ํ•ด๋ดค๋Š”๋ฐ '์–ด๋””๋ถ€ํ„ฐ๋ฅผ ๋“œ๋žํ• ๋งŒํผ ๋†’์€ ์ƒ๊ด€๊ณ„์ˆ˜๋กœ ๋ด์•ผํ•˜๋А๋ƒ?'์˜ ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๊ณ , 0.9์ดˆ๊ณผ์ธ๊ฑธ ๋งˆ์Šคํ‚นํ•ด๋ณด๋‹ˆ ์—†์—ˆ๋‹ค! ๊ทธ๋ž˜์„œ ์ด ๋‹จ๊ณ„์—์„  ์‹ค์งˆ์ ์œผ๋กœ ์–ป์€ ๊ฑด ์—†๋‹ค.

์‚ฌ์‹ค ์ปฌ๋Ÿผ๊ตฌ๋ถ„ ์—†์ด ์ „์ฒด์ปฌ๋Ÿผ ๋ผ๋ฒจ์ธ์ฝ”๋”ฉ for๋ฌธ ๋Œ๋ฆฌ๊ณ , ๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ์ด๋ฉด ์ธ์ฝ”๋”ฉ ๋  ๊ฒƒ์ด๊ณ (try), ์ˆ˜์น˜ํ˜•์ด๋ฉด ๊ทธ๋ƒฅ ์ง€๋‚˜๊ฐ€์‹œ๋ผ๋Š”(except) ์˜ˆ์™ธ์ฒ˜๋ฆฌ ์ฝ”๋“œ๋ฅผ ์จ๋ณด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ ์™œ์ธ์ง€ ์—๋Ÿฌ๋Š” ์•ˆ๋‚˜์ง€๋งŒ ์ ์šฉ๋„ ์•ˆ๋๋‹ค. ์•ˆ ๋จน๊ธธ๋ž˜ ํฌ๊ธฐํ•˜๊ณ  ๋“œ๋ž.

 

## ๋ถ€๋ก : ํ•ด๋ดค์ง€๋งŒ ์–ป์€๊ฒŒ ์—†๋Š” ์ฝ”๋“œ ๋ชจ์Œ##

### ์ƒ๊ด€๊ด€๊ณ„
corr = np.abs(X_train.corr())
print(corr[(corr>0.9)]) # ์ž๊ธฐ์ž์‹ ๊ณผ 1.0์ธ ์• ๋“ค๋ฐ–์— ์—†์—ˆ์Œ!

### try-except: ์•ˆ๋˜๋‹ˆ ๋”ฐ๋ผํ•˜์ง€ ๋งˆ์„ธ์š”
cols = list(X_train.columns)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

try:
	for col in cols:
    	X_train[col] = le.fit_transform(X_train[col])
        X_test[col] = le.transform(X_test[col])
except ValueError:
	pass

 

๋งˆ์ € ์ฝ”๋“œ๋กœ ๋Œ์•„๊ฐ€๋ฉด...

# 2-4. ์Šค์ผ€์ผ๋ง
from sklearn.preprocessing import RobustScaler
ro = RobustScaler()

cols = list(X_train.columns)
for col in cols:
    X_train[col] = ro.fit_transform(X_train[[col]])
    X_test[col] = ro.transform(X_test[[col]])
    
print(X_train.head().T)
print(X_train.describe())

์Šค์ผ€์ผ๋ง์„ ํ•ด๋„ ์ตœ๋Œ€๊ฐ’์ด ์—„์ฒญ ํŠ€๋Š” ํŠน์ดํ•œ ์ปฌ๋Ÿผ๋“ค์ด 7๊ฐœ๋‚˜ ๋๋‹ค. (์˜ˆ๋กœ ๋“ค๋ฉด ์ตœ์†Œ๊ฐ’0, ์ตœ๋นˆ๊ฐ’0, ์ค‘์•™๊ฐ’0์ธ๋ฐ ์ตœ๋Œ€๊ฐ’ 1543 ์ด๋Ÿฐ์‹...)์ผ๋‹จ ๋ƒ…๋‘๊ณ  ์ ์ˆ˜๊ฐ€ ๋ชป๋งˆ๋•…ํ•˜๋ฉด ๋‹ค์‹œ ๋Œ์•„์˜ค๊ธฐ๋กœ ํ–ˆ๋‹ค.

 

# 3. ๊ฒ€์ฆ-๋‚˜๋ˆ„๊ธฐ
from sklearn.model_selection import train_test_split
xx_train, x_val, yy_train, y_val = train_test_split(X_train, y_train, train_size=0.9)
print(xx_train.shape, x_val.shape, yy_train.shape, y_val.shape) # (1051, 36) (117, 36) (1051, 2) (117, 2)

# 4-1. ํ•™์Šต
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=500, max_depth=5, random_state=42)
model.fit(X_train, y_train['SalePrice']) # ์ด๊ฑฐ ํ—ท๊ฐˆ๋ฆฌ๋ฉด ๊ทธ๋ƒฅ val๋กœ ํ•ด...

# 3-1. ๊ฒ€์ฆ-์˜ˆ์ธก
pred_val = model.predict(x_val)

# 3-2. ๊ฒ€์ฆ-์ ์ˆ˜
from sklearn.metrics import mean_squared_error
RMSE = np.sqrt(mean_squared_error(y_val['SalePrice'], pred_val))
print(RMSE) # 22942.418081449574

# 4-2. X_test์— ์˜ˆ์ธก
pred = model.predict(X_test)

# 5. ์ œ์ถœํŒŒ์ผ
output = pd.DataFrame({'id':x_test['Id'],'Saleprice':pred}).to_csv("์ˆ˜ํ—˜๋ฒˆํ˜ธ.csv", index=False)

๋‚ด๊ฐ€ ์ œ์ผ ํ—ท๊ฐˆ๋ คํ•˜๋Š” ๋ถ€๋ถ„...! RandomForestRegressor๋Š” Classifier์™€ ์“ฐ๋Š” ๋ฒ•์€ ๋™์ผํ•ด์„œ ๊ดœ์ฐฎ์•˜๋‹ค. predict_proba๊ฐ€ ์•„๋‹ˆ๋ผ predict๋ฅผ ์จ์•ผํ–ˆ๋‹ค! ์‹œํ—˜์—์„œ ์ฃผ๋Š” ๋ฐ์ดํ„ฐ๋Š” ์ž‘์œผ๋‹ˆ๊นŒ ํ˜„์—…๊ณผ ๋‹ฌ๋ฆฌ, train_test_split์€ ์ •๋ง ๋‚ด ์ ์ˆ˜๋งŒ ๋Œ€๋žต ์•Œ๊ธฐ ์œ„ํ•ด์„œ๋งŒ ์“ฐ๊ณ , ๊ธฐ์กด์˜ X_train(๊ทธ๋‹ˆ๊นŒ ๋‚˜๋ˆˆ train๊ณผ val์„ ๋‹ค์‹œ ํ•ฉ์นœ ๊ฑฐ)์œผ๋กœ fitํ•˜๊ณ  test๋กœ ์˜ˆ์ธกํ•˜๋Š”๊ฒŒ ๋” ์„ฑ๋Šฅ ์ข‹์„ ๊ฑฐ๋ผ ๋“ค์–ด์„œ ๊ทธ๋ ‡๊ฒŒ ํ•ด๋ดค๋‹ค. ๊ทธ๋Ÿฐ๋ฐ y_train์—์„œ ์ž๊พธ ํƒ€๊ฒŸ์ปฌ๋Ÿผ๋งŒ ์ธ๋ฑ์‹ฑํ•˜๋Š”๊ฑธ ๊นŒ๋จน๋Š”๋‹ค...

from sklearn.metrics๋„ ์•„์ง ์•ˆ ๋ถ™๋Š”๋‹ค.. ํšŒ๊ท€๋ฉด ์ฑ„์ ์œผ๋กœ ์“ธ๋งŒํ•œ ์ ์ˆ˜๋“ค์ด ๋งŽ์€๋ฐ..! ํ˜ธ์ถœ ๋ชปํ•˜๋ฉด ๋‚ด ์ ์ˆ˜๋„ ๋ชจ๋ฅด๊ณ  ์ œ์ถœํ•˜๊ฒŒ ๋˜๋Š”๊ฑฐ๋‹ค

 

#______์ฑ„์ ________
RMSE = np.sqrt(mean_squared_error(y_test['SalePrice'], pred))
print(RMSE) # 28657.429279872926

์ œ๊ณฑ๊ทผ์ธ RMSE๋Š” ๋‚ฎ์„์ˆ˜๋ก ์ข‹์€ ์„ฑ๋Šฅ์ธ๊ฑด๋ฐ, ์ด์ •๋„๋ฉด ๋ฌด๋‚œํ•œ ๊ฑฐ ๊ฐ™๋‹ค.

 

๋.

 

๊ณต๋ถ€์ž๋ฃŒ https://www.kaggle.com/code/blighpark/t2-4-house-prices-regression

๋Œ“๊ธ€