๐Ÿ‘ฝ Language & Frameworks/Spark

[๋Ÿฌ๋‹ ์ŠคํŒŒํฌ] Column๊ณผ Row

๋ณต๋งŒ 2023. 11. 19. 22:00

์ปฌ๋Ÿผ Column

 

์ŠคํŒŒํฌ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ๋Š” Column์˜ ์ด๋ฆ„์„ ์ด์šฉํ•ด ๋‹ค์–‘ํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

pyspark.sql.Column — PySpark 3.5.0 documentation

A column in a DataFrame. Changed in version 3.4.0: Supports Spark Connect. Select a column out of a DataFrame >>> df.name Column<’name’> >>> df[“name”] Column<’name’>

spark.apache.org

 

Pyspark์—์„œ column์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€๊ฐ€ ์žˆ๋Š”๋ฐ, ํ•˜๋‚˜๋Š” col("columnName") ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ,๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” df.columnName์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๋‹ค์Œ์€ Column์„ ์ด์šฉํ•œ ์—ฐ์‚ฐ์˜ ๋ช‡๊ฐ€์ง€ ์˜ˆ์‹œ์ด๋‹ค.

 

from pyspark.sql.functions import col, concat

df.withColumn("colABC", concat(col("colA"), col("colB"), col("colC"))) # colA, colB, colC๋ฅผ ํ•ฉ์ณ colABC๋ผ๋Š” column ์ƒ์„ฑ

df.select(col("colA")) # colA๋งŒ ์„ ํƒ

df.sort(col("colA").desc()) # colA์— ๋Œ€ํ•ด ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ
df.sort(df.colA.desc()) # ์œ„์™€ ๋™์ผํ•œ ์ฝ”๋“œ

 

 

๋กœ์šฐ Row

 

์ŠคํŒŒํฌ์˜ Row๋Š” ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” ํ•„๋“œ์˜ ์ง‘ํ•ฉ ๊ฐ์ฒด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.

 

pyspark.sql.Row — PySpark 3.5.0 documentation

key in row will search through row keys. Row can be used to create a row object by using named arguments. It is not allowed to omit a named argument to represent that the value is None or missing. This should be explicitly set to None in this case. Changed

spark.apache.org

 

from pyspark.sql import Row

row = Row(6, "text", ["a", "b"])

row[1]
>> "text"

 

๋‹ค์Œ๊ณผ ๊ฐ™์ด Row๋“ค์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค.

 

rows = [Row("Alice", 11), Row("Bob", 8)]

df = spark.createDataFrame(rows, ["Name", "Age"])

 

 

๋ฐ˜์‘ํ˜•