[๋ฌ๋ ์คํํฌ] Column๊ณผ Row
์ปฌ๋ผ Column
์คํํฌ ๋ฐ์ดํฐํ๋ ์์์๋ Column์ ์ด๋ฆ์ ์ด์ฉํด ๋ค์ํ ์ฐ์ฐ์ ์ํํ ์ ์๋ค.
pyspark.sql.Column — PySpark 3.5.0 documentation
A column in a DataFrame. Changed in version 3.4.0: Supports Spark Connect. Select a column out of a DataFrame >>> df.name Column<’name’> >>> df[“name”] Column<’name’>
spark.apache.org
Pyspark์์ column์ ์ ๊ทผํ๋ ๋ฐฉ์์ ์ฌ๋ฌ ๊ฐ์ง๊ฐ ์๋๋ฐ, ํ๋๋ col("columnName") ํจ์๋ฅผ ์ฌ์ฉํ๋ ๊ฒ,๋ค๋ฅธ ํ๋๋ df.columnName์ ์ฌ์ฉํ๋ ๊ฒ์ด๋ค.
๋ค์์ Column์ ์ด์ฉํ ์ฐ์ฐ์ ๋ช๊ฐ์ง ์์์ด๋ค.
from pyspark.sql.functions import col, concat
df.withColumn("colABC", concat(col("colA"), col("colB"), col("colC"))) # colA, colB, colC๋ฅผ ํฉ์ณ colABC๋ผ๋ column ์์ฑ
df.select(col("colA")) # colA๋ง ์ ํ
df.sort(col("colA").desc()) # colA์ ๋ํด ๋ด๋ฆผ์ฐจ์์ผ๋ก ์ ๋ ฌ
df.sort(df.colA.desc()) # ์์ ๋์ผํ ์ฝ๋
๋ก์ฐ Row
์คํํฌ์ Row๋ ์์๊ฐ ์๋ ํ๋์ ์งํฉ ๊ฐ์ฒด๋ผ๊ณ ๋ณผ ์ ์๋ค. ๋ฐ๋ผ์ ์ธ๋ฑ์ค๋ฅผ ์ด์ฉํ์ฌ ์ ๊ทผํ ์ ์๋ค.
pyspark.sql.Row — PySpark 3.5.0 documentation
key in row will search through row keys. Row can be used to create a row object by using named arguments. It is not allowed to omit a named argument to represent that the value is None or missing. This should be explicitly set to None in this case. Changed
spark.apache.org
from pyspark.sql import Row
row = Row(6, "text", ["a", "b"])
row[1]
>> "text"
๋ค์๊ณผ ๊ฐ์ด Row๋ค์ ๋ฐ์ดํฐํ๋ ์์ผ๋ก ๋ง๋ค ์ ์๋ค.
rows = [Row("Alice", 11), Row("Bob", 8)]
df = spark.createDataFrame(rows, ["Name", "Age"])