๐Ÿ‘ฝ Language & Frameworks/Spark

[๋Ÿฌ๋‹ ์ŠคํŒŒํฌ] ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ฝ๊ณ  ๋‚ด๋ณด๋‚ด๊ธฐ

๋ณต๋งŒ 2023. 11. 19. 22:15

๊ตฌ์กฐํ™”๋œ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ์†Œ์Šค์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด Spark ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋กœ๋“œํ•˜๊ณ ,

ํŠน์ • ํฌ๋งท์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์จ์„œ ๋‚ด๋ณด๋‚ด๊ธฐ ์œ„ํ•ด

DataFrameReader์™€ DataFrameWriter ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

 

pyspark.sql.DataFrameReader — PySpark 3.5.0 documentation

Interface used to load a DataFrame from external storage systems (e.g. file systems, key-value stores, etc). Use SparkSession.read to access this. Changed in version 3.4.0: Supports Spark Connect.

spark.apache.org

 

pyspark.sql.DataFrameWriter — PySpark 3.5.0 documentation

Interface used to write a DataFrame to external storage systems (e.g. file systems, key-value stores, etc). Use DataFrame.write to access this. Changed in version 3.4.0: Supports Spark Connect.

spark.apache.org

 

์ง€์›๋˜๋Š” ํŒŒ์ผ ํฌ๋งท์€ csv, json, orc, parquet ๋“ฑ์ด๋‹ค.

 

๋‹ค์Œ์€ csv ํŒŒ์ผ์„ ์ฝ๊ณ  ์“ฐ๋Š” ์˜ˆ์‹œ์ด๋‹ค.

 

df = spark.read.csv("data.csv", header=True, schema=schema)
df.write.format("csv").save("data_copy.csv")

 

๋ฐ˜์‘ํ˜•