Build fat jar file (使用 sbt assembly)

Posted by Tim Lin on 2019-02-24

如果想要在主機上自動跑排程, 就得先包成 fat jar, 透過 spark-submit 來跑

什麼是 fat jar ?

What is an uber(fat) jar
defined as one that contains both your package and all its dependencies in one single JAR file.

先安裝 sbt

windows:
https://www.scala-sbt.org/0.13/docs/Installing-sbt-on-Windows.html

linux:
https://www.scala-sbt.org/0.13/docs/Installing-sbt-on-Linux.html

設定 sbt assembly

參考這篇
Creating Scala Fat Jars for Spark on SBT with sbt-assembly Plugin

src/project 多加一個 assembly.sbt , 內容如下

1
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")

build.sbt 之前的內容改成

1
2
3
4
5
6
7
8
9
lazy val root = (project in file(".")).
settings(
name := "ScalaSBTTest",
version := "1.0",
scalaVersion := "2.11.12",
mainClass in Compile := Some("WordCount")
)

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"

到 cmd, 專案根目錄下跑 (此例是 D:\workspace\scala\ScalaSBTTest)

1
sbt assembly

但此時會噴錯…

1
2
"deduplicate: different file contents found in the following:"...
一堆 jar 衝突

參考這篇
Spark 2: “deduplicate: different file contents found in the following:”

build.sbt 加上排除的策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
assemblyMergeStrategy in assembly := {
case PathList("org","aopalliance", xs @ _*) => MergeStrategy.last
case PathList("javax", "inject", xs @ _*) => MergeStrategy.last
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.last
case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
case PathList("org", "apache", xs @ _*) => MergeStrategy.last
case PathList("com", "google", xs @ _*) => MergeStrategy.last
case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
case PathList("com", "codahale", xs @ _*) => MergeStrategy.last
case PathList("com", "yammer", xs @ _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
case "META-INF/mailcap" => MergeStrategy.last
case "META-INF/mimetypes.default" => MergeStrategy.last
case "plugin.properties" => MergeStrategy.last
case "log4j.properties" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

再跑一次

1
sbt assembly

打包成功

產在 D:\workspace\scala\ScalaSBTTest\target\scala-2.11\ScalaSBTTest-assembly-1.0.jar

src code:

https://github.com/timmyBeef/SparkSbtAssemblyDemo.git

run by spark-submit

直接到 D:\workspace\scala\ScalaSBTTest\target\scala-2.11 這跑看看

D:\workspace\scala\ScalaSBTTest\target\scala-2.11>spark-submit ScalaSBTTest-assembly-1.0.jar

不意外, 噴錯了, 因為還沒放 word.txt

補上 word.txt

有成功印出結果, 但temp檔案刪不掉…


What are key differences between sbt-pack and sbt-assembly?

https://stackoverflow.com/questions/22556499/what-are-key-differences-between-sbt-pack-and-sbt-assembly

sbt-assembly

sbt-assembly creates a fat JAR - a single JAR file containing all class files from your code and libraries. By evolution, it also contains ways of resolving conflicts when multiple JARs provide the same file path (like config or README file). It involves unzipping of all library JARs, so it’s a bit slow, but these are heavily cached.

sbt-pack

sbt-pack keeps all the library JARs intact, moves them into target/pack directory (as opposed to ivy cache where they would normally live), and makes a shell script for you to run them.

sbt-native-packager

sbt-native-packager is similar to sbt-pack but it was started by a sbt committer Josh Suereth, and now maintained by highly capable Nepomuk Seiler (also known as muuki88). The plugin supports a number of formats like Windows msi file and Debian deb file. The recent addition is a support for Docker images.

All are viable means of creating deployment images. In certain cases like deploying your application to a web framework etc., it might make things easier if you’re dealing with one file as opposed to a dozen.

Reference:

sbt 官網
sbt assembly 官網