[Spark] databrick으로 데이터 로드하고, 기술통계량 확인하기

데이터 업로드하기

메인화면 > Data > Create Table

Notebook에서 데이터 확인하기

위 이미지의 형광펜이 데이터 경로이다
spark.read.csv(경로.csv) 를 입력하여 데이터를 불러온다.
참고로, /FileStore는 DBFS 파일 시스템으로 Spark외에는 접근 불가하다.

titanic_sdf = spark.read.csv('/FileStore/tables/titanic_train-1.csv', header=True, inferSchema=True)
print('titanic sdf type:', type(titanic_sdf))

# databrick에서 데이터를 formatting 된 형식으로 확인하기
display(titanic_sdf)

데이터 형식이 pyspark의 dataframe이다. (pandas 와 다르다)
display를 통해 불러온 데이터 확인이 잘 되었다.

spark 데이터프레임을 pandas 데이터 프레임으로 만들 수 있다. : .select('*').toPandas()

import pandas as pd

# pandas DataFrame을 spark DataFrame으로 부터 생성. 
titanic_pdf = titanic_sdf.select('*').toPandas()
print(type(titanic_pdf))

# display()는 pandas DataFrame에도 적용됨. 
display(titanic_pdf)

# 데이터프레임의 첫 10줄 출력해보기

spark 데이터프레임은 .head() 가 안된다.
spark 에서는 print()를 쓰면 스키마를 출력하고, 데이터 내용이 보이지 않는다.
따라서, df.limit(10).show() 식으로 입력하거나, display(df.limit(10)) 식으로 데이터를 볼 수 있다.

print(titanic_sdf.limit(10).show())
titanic_sdf.limit(10).show()

display(titanic_sdf.limit(10))

# info() 기능을 spark에서 써보기

pandas 의 Info() 기능이 없다.
따라서, 컬럼명과 타입은 df.printSchema() 로 확인한다.
null 값의 개수는 다음과 같은 코드를 참고하여 SQL 식으로 작성하게 된다.

# SQL의 count case when과 유사
from pyspark.sql.functions import count, isnan, when, col

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in titanic_sdf.columns]).show()

# describe() 기능을 spark에서 써보기

spark도 describe()가 있지만, 내용이 동일하게 다 나오지는 않는다. 👉 건수/평균/표준편차/최소값/최대값
percentile 값 빼고 다 나오지만, 숫자형 컬럼 외 문자형 컬럼에 대해서도 동일하게 출력된다(특이!)

display(titanic_sdf.describe())

# 데이터 타입 확인 (dtypes)

df.dtypes

# number형 컬럼들에 대해서만 describe()수행
number_columns = [column_name for column_name, dtype in titanic_sdf.dtypes if dtype != 'string']
print(number_columns)

titanic_sdf.select(number_columns).describe().show()

# 데이터의 row, col 개수 확인 (shape 기능)

spark에는 별도 shape 기능이 없다.
.columns로 컬럼 내용을 리스트로 반환하고, 그 길이를 출력해야 한다.

# spark DataFrame은 columns 속성으로 컬럼명을 list로 반환. 
print('column들:', df.columns)
print('column개수:', len(df.columns))

전체 rows의 개수는 count() 메소드 사용

print(titanic_sdf.count())
print(type(titanic_sdf.count()))

다음과 같이 한번에 출력하기 위한 코드 작성
df.count(), len(df.columns)

print('titanic_sdf shape:', (titanic_sdf.count(), len(titanic_sdf.columns)))

728x90

저작자표시 비영리 변경금지 (새창열림)

'Python > spark(python)' 카테고리의 다른 글

[Spark] 레코드와 컬럼 삭제 / 결측치 확인 및 처리 (0)	2023.10.25
[Spark] 컬럼 생성/업데이트를 위한 withColumn() , substring(), split() (0)	2023.10.24
[Spark] spark DataFrame의 orderBy( )와 aggregation (1)	2023.10.24
[Spark] select() 와 filter() 메서드 (1)	2023.10.24
[Spark] Databricks로 시작하기 (0)	2023.10.23

A PIECE OF JOY

[Spark] databrick으로 데이터 로드하고, 기술통계량 확인하기

데이터 업로드하기

Notebook에서 데이터 확인하기

# 데이터프레임의 첫 10줄 출력해보기

# info() 기능을 spark에서 써보기

# describe() 기능을 spark에서 써보기

# 데이터 타입 확인 (dtypes)

# 데이터의 row, col 개수 확인 (shape 기능)

'Python > spark(python)' 카테고리의 다른 글

티스토리툴바

[Spark] databrick으로 데이터 로드하고, 기술통계량 확인하기

데이터 업로드하기

Notebook에서 데이터 확인하기

# 데이터프레임의 첫 10줄 출력해보기

# info() 기능을 spark에서 써보기

# describe() 기능을 spark에서 써보기

# 데이터 타입 확인 (dtypes)

# 데이터의 row, col 개수 확인 (shape 기능)

'Python > spark(python)' 카테고리의 다른 글

관련글

티스토리툴바