Passing Functions to Spark

2020-11-10 Spark function 0 Comments Word Count: 290(words) Read Count: 1(minutes)

https://spark.apache.org/docs/2.2.1/rdd-programming-guide.html

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:

Anonymous function syntax, which can be used for short pieces of code.
Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:

object MyFunctions {
  def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)

Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method. For example, consider:

class MyClass {
  def func1(s: String): String = { ... }
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}

Here, if we create a new MyClass instance and call doStuff on it, the map inside there references the func1 method of that MyClass instance, so the whole object needs to be sent to the cluster. It is similar to writing rdd.map(x => this.func1(x)).

In a similar way, accessing fields of the outer object will reference the whole object:

class MyClass {
  val field = "Hello"
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}

is equivalent to writing rdd.map(x => this.field + x), which references all of this. To avoid this issue, the simplest way is to copy field into a local variable instead of accessing it externally:

def doStuff(rdd: RDD[String]): RDD[String] = {
  val field_ = this.field
  rdd.map(x => field_ + x)
}

本文链接： https://wangjiosw.github.io/2020/11/10/bigdata/spark/Passing-Functions-to-Spark/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

王某Bug开发工程师

技术是手段而非目的。