๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Machine Learning/Case Study ๐Ÿ‘ฉ๐Ÿป‍๐Ÿ’ป

[ํ•ด์™ธ ๋ถ€๋™์‚ฐ ์›”์„ธ ์˜ˆ์ธก(1)] ๊ธฐ๋ณธ์ ์ธ EDA ์—ฐ์Šต

by ISLA! 2023. 9. 15.

Data Load

  • 8692๊ฐœ์˜ ๋ฐ์ดํ„ฐ
  • ID : ์ƒ˜ํ”Œ ๋ณ„ ๊ณ ์œ  ID
  • ๋ถ€๋™์‚ฐ ๊ด€๋ จ ์ •๋ณด
  • ํ•ด๋‹น ๊ฑด๋ฌผ์ด ์œ„์น˜ํ•œ ์œ„๋„์™€ ๊ฒฝ๋„(๋‹จ, ์ง€๋„ api๋ฅผ ์ด์šฉํ•˜์—ฌ ์ถ”๊ฐ€์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์—†์Œ)
  • target: monthlyRent(us_dollar) : 1๋‹ฌ๋Ÿฌ๋ฅผ ๋‹จ์œ„๋กœ ํ•˜๋Š” ์›”์„ธ ๊ฐ€๊ฒฉ

 

๊ธฐ๋ณธ์ ์ธ EDA

โ–ถ๏ธŽ ์งˆ์  ๋ณ€์ˆ˜์™€ ์–‘์  ๋ณ€์ˆ˜ ํ™•์ธ

# ์งˆ์  ๋ณ€์ˆ˜
qual_df = total_df[['propertyType', 'suburbName']]

# ์–‘์  ๋ณ€์ˆ˜
quan_df = total_df.drop(columns = ['propertyType', 'suburbName'])

 

โ–ถ๏ธŽ ๊ฒฐ์ธก์น˜, ํ–‰๊ณผ ์ปฌ๋Ÿผ ๊ตฌ์„ฑ, ๋ฐ์ดํ„ฐํƒ€์ž… / ํ†ต๊ณ„๋Ÿ‰ ํ™•์ธ

 

โ–ถ๏ธŽ ์‹œ๊ฐํ™”

1) ์–‘์  ๋ณ€์ˆ˜ ๋ถ„ํฌ ํ™•์ธ : ํžˆ์Šคํ† ๊ทธ๋žจ

quan_df.hist(bins = 100, figsize = (18, 18))
plt.show()

 

 

2) ์งˆ์  ๋ณ€์ˆ˜ ๋ถ„ํฌ ํ™•์ธ : countplot

  • ์งˆ์  ๋ณ€์ˆ˜๋Š” ์ˆ˜์น˜ํ™”๊ฐ€ ๋˜์ง€ ์•Š์œผ๋ฉด ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์—†๋‹ค
  • ๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์ˆœ ์ˆ˜์น˜ํ™”๊ฐ€ ์งˆ์  ๋ณ€์ˆ˜์˜ ํŠน์„ฑ์„ ์ ์ ˆํ•˜๊ฒŒ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•  ๋•Œ๊ฐ€ ๋งŽ์œผ๋ฏ€๋กœ,
  • ์งˆ์ ๋ณ€์ˆ˜๋Š” ๊ด€์ธก๊ฐ’์˜ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” countplot์„ ์ฃผ๋กœ ์ด์šฉํ•œ๋‹ค
fig, axes = plt.subplots(2, 1, figsize = (20, 15))

sns.countplot(x = qual_df['propertyType'],ax = axes[0])
sns.countplot(x = qual_df['suburbName'], ax = axes[1])

plt.show()

 

 

3) ์ด์ƒ์น˜ ํ™•์ธ : box plot

# ์ผ๋ถ€ ์ฝ”๋“œ
fig, axes = plt.subplots(3, 3, figsize = (15, 15))

sns.boxplot(y = quan_df['bedrooms'], ax = axes[0][0])
sns.boxplot(y = quan_df['latitude'], ax=axes[0][1])
sns.boxplot(y = quan_df['longitude'], ax=axes[0][2])
plt.show()

 

4) ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ : heatmap

plt.figure(figsize = (15, 15))
sns.heatmap(total_df.corr(), annot = True, fmt = '.1f', linewidth = 1, cmap = 'Blues')
plt.show()

728x90