๋ฐ˜์‘ํ˜•

์ŠคํŒŒํฌ 3

[๋Ÿฌ๋‹ ์ŠคํŒŒํฌ] ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์—ฐ์‚ฐ๊ณผ ์ „์ฒ˜๋ฆฌ

spark์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์—ฐ์‚ฐ๋“ค์„ ์ด์šฉํ•ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ๋ณ€ํ™˜, ํ†ต๊ณ„ ๋“ฑ ๋‹ค์–‘ํ•œ ์ผ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์Œ์€ ๋ช‡๊ฐ€์ง€ ์—ฐ์‚ฐ๋“ค๊ณผ ํ™œ์šฉ ์˜ˆ์‹œ์ด๋‹ค. ํ”„๋กœ์ ์…˜๊ณผ ํ•„ํ„ฐ df = df.select(df.colA, df.colB) # ํ”„๋กœ์ ์…˜ (colA์™€ colB๋งŒ ์„ ํƒ) df = df.where(df.colB 10000")) # colA์˜ ๊ฐ’์ด 10000์ด์ƒ์ด๋ฉด True๋ฅผ ๊ฐ–๋Š” column largeA๋ฅผ ์ถ”๊ฐ€ df = df.drop("colA") # colA ์‚ญ์ œ ์ฐธ๊ณ ) alias์™€..

[๋Ÿฌ๋‹ ์ŠคํŒŒํฌ] ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ฝ๊ณ  ๋‚ด๋ณด๋‚ด๊ธฐ

๊ตฌ์กฐํ™”๋œ ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ ์†Œ์Šค์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์–ด Spark ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋กœ๋“œํ•˜๊ณ , ํŠน์ • ํฌ๋งท์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์จ์„œ ๋‚ด๋ณด๋‚ด๊ธฐ ์œ„ํ•ด DataFrameReader์™€ DataFrameWriter ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. pyspark.sql.DataFrameReader — PySpark 3.5.0 documentation Interface used to load a DataFrame from external storage systems (e.g. file systems, key-value stores, etc). Use SparkSession.read to access this. Changed in version 3.4.0: Supports Spark Connect. spark.apache.or..

[๋Ÿฌ๋‹ ์ŠคํŒŒํฌ] Column๊ณผ Row

์ปฌ๋Ÿผ Column ์ŠคํŒŒํฌ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ๋Š” Column์˜ ์ด๋ฆ„์„ ์ด์šฉํ•ด ๋‹ค์–‘ํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. pyspark.sql.Column — PySpark 3.5.0 documentation A column in a DataFrame. Changed in version 3.4.0: Supports Spark Connect. Select a column out of a DataFrame >>> df.name Column >>> df[“name”] Column spark.apache.org Pyspark์—์„œ column์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ํ•˜๋‚˜๋Š” col("columnName") ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ,๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” df.columnName์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋‹ค์Œ์€ Column์„ ์ด์šฉํ•œ ์—ฐ์‚ฐ์˜..

๋ฐ˜์‘ํ˜•