1. 程式人生 > >Spark的Dataset操作(二)-過濾的filter和where

Spark的Dataset操作(二)-過濾的filter和where

scala> val df = spark.createDataset(Seq(   ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))   ).toDF("key1","key2","key3") df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]

scala> df.show +----+----+----+ |key1|key2|key3| +----+----+----+ | aaa|   1|   2| | bbb|   3|   4| | ccc|   3|   5| | bbb|   4|   6| +----+----+----+

    scala> val df = spark.createDataset(Seq(  

("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6))   ).toDF("key1","key2","key3")

df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field] scala> df.show +----+----+----+

|key1|key2|key3| +----+----+----+

| aaa| 1| 2|

| bbb| 3| 4|

| ccc| 3| 5|

| bbb|   4|   6| +----+----+----+

filter函式

從Spark官網的文件中看到,filter函式有下面幾種形式:

def filter(func: (T) ⇒ Boolean): Dataset[T] def filter(conditionExpr: String): Dataset[T] def filter(condition: Column): Dataset[T]

def filter(func: (T) ⇒ Boolean): Dataset[T]
def filter(conditionExpr: String): Dataset[T]
def filter(condition: Column): Dataset[T]

所以,以下幾種寫法都是可以的:

scala> df.filter($"key1">"aaa").show +----+----+----+ |key1|key2|key3| +----+----+----+ | bbb|   3|   4| | ccc|   3|   5| | bbb|   4|   6| +----+----+----+

scala> df.filter($"key1"==="aaa").show +----+----+----+ |key1|key2|key3| +----+----+----+ | aaa|   1|   2| +----+----+----+

scala> df.filter("key1='aaa'").show +----+----+----+ |key1|key2|key3| +----+----+----+ | aaa|   1|   2| +----+----+----+

scala> df.filter("key2=1").show +----+----+----+ |key1|key2|key3| +----+----+----+ | aaa|   1|   2| +----+----+----+

scala> df.filter($"key2"===3).show +----+----+----+ |key1|key2|key3| +----+----+----+ | bbb|   3|   4| | ccc|   3|   5| +----+----+----+

scala> df.filter($"key2"===$"key3"-1).show +----+----+----+ |key1|key2|key3| +----+----+----+ | aaa|   1|   2| | bbb|   3|   4| +----+----+----+

其中, ===是在Column類中定義的函式,對應的不等於是=!=。 $”列名”這個是語法糖,返回Column物件 where函式

scala> df.where("key1 = 'bbb'").show +----+----+----+ |key1|key2|key3| +----+----+----+ | bbb|   3|   4| | bbb|   4|   6| +----+----+----+

scala> df.where($"key2"=!= 3).show +----+----+----+ |key1|key2|key3| +----+----+----+ | aaa|   1|   2| | bbb|   4|   6| +----+----+----+

scala> df.where($"key3">col("key2")).show +----+----+----+ |key1|key2|key3| +----+----+----+ | aaa|   1|   2| | bbb|   3|   4| | ccc|   3|   5| | bbb|   4|   6| +----+----+----+

scala> df.where($"key3">col("key2")+1).show +----+----+----+ |key1|key2|key3| +----+----+----+ | ccc|   3|   5| | bbb|   4|   6| +----+----+----+