在spark中给DataFrame新增一例的方法,通常都是使用withColumn,但是withColumn只能在
将原有的列换个名字增加,不能增加自定义的一列,比如增加个id,增加个时间
// 新建一个dataFrameval sparkconf = new SparkConf() .setMaster("local") .setAppName("test")val spark = SparkSession.builder().config(sparkconf).getOrCreate()val tempDataFrame = spark.createDataFrame(Seq( (1, "asf"), (2, "2143"), (3, "rfds"))).toDF("id", "content")// 增加一列val addColDataframe = tempDataFrame.withColumn("col", tempDataFrame("id")*0)addColDataframe.show(10,false)
打印结果
+---+-------+---+|id |content|col|+---+-------+---+|1 |asf |0 ||2 |2143 |0 ||3 |rfds |0 |+---+-------+---+
但是,这并不满足需求,所以可以用udf写自定义函数新增列
import org.apache.spark.sql.functions.udf// 新建一个dataFrameval sparkconf = new SparkConf() .setMaster("local") .setAppName("test")val spark = SparkSession.builder().config(sparkconf).getOrCreate()val tempDataFrame = spark.createDataFrame(Seq( ("a, "asf"), ("b, "2143"), ("c, "rfds"))).toDF("id", "content")// 自定义udf的函数val code = (arg: String) => { if (arg.getClass.getName == "java.lang.String") 1 else 0 }val addCol = udf(code)// 增加一列val addColDataframe = tempDataFrame.withColumn("col", addCol(tempDataFrame("id")))addColDataframe.show(10, false)
结果
+---+-------+---+|id |content|col|+---+-------+---+|a |asf |1 ||b |2143 |1 ||c |rfds |1 |+---+-------+---+