-
Notifications
You must be signed in to change notification settings - Fork 35
Quick Start Guide
This tutorial provides instructions to help you get started with Kotlin Spark API. We use an example similar to the official Apache Spark Quick Start Guide. You'll learn what you need to set up your environment, how to write, package and execute a simple self-contained application.
Prerequisites:
- You need to have Java installed and have the JAVA_HOME environment variable pointing to the Java installation.
- You need to have Apache Spark installed and have SPARK_HOME environment variable pointing to the Spark installation. We recommend using Apache Spark 3.3.2 version. You can download it from the Spark official website.
For the purposes of this tutorial, let's write a Kotlin program that counts the number of lines containing 'a',
and the number containing 'b' in the Spark README. Note that you'll need to replace YOUR_SPARK_HOME
with the
location where Spark is installed:
/* SimpleApp.kt */
@file:JvmName("SimpleApp")
import org.jetbrains.kotlinx.spark.api.*
fun main() {
val logFile = "YOUR_SPARK_HOME/README.md" // Change to your Spark Home path
withSpark {
spark.read().textFile(logFile).withCached {
val numAs = filter { it.contains("a") }.count()
val numBs = filter { it.contains("b") }.count()
println("Lines with a: $numAs, lines with b: $numBs")
}
}
}
Because Kotlin Spark API is not part of the official Apache Spark distribution yet, it is not enough to add Spark as a dependency to your build file. You need to:
- Add Spark as a dependency
- Add Kotlin Spark API as a dependency
- Add Kotlin Standard Library as a dependency
When packaging your project into a jar file, you need to explicitly include Kotlin Spark API and Kotlin Standard Library dependencies. Here you can find an example of building your application with Maven, and with Gradle.
Here's what the pom.xml
looks like for this example:
<project>
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>kotlin-spark-example</artifactId>
<version>1.0-SNAPSHOT</version>
<name>Sample Project</name>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<kotlin.version>1.8.20</kotlin.version>
<kotlin.code.style>official</kotlin.code.style>
</properties>
<dependencies>
<dependency>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-stdlib</artifactId>
<version>${kotlin.version}</version>
</dependency>
<dependency> <!-- Kotlin Spark API dependency -->
<groupId>org.jetbrains.kotlinx.spark</groupId>
<artifactId>kotlin-spark-api_3.3.2_2.13</artifactId>
<version>1.2.4</version>
</dependency>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.3.2</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.4</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<includes>
<include>org.jetbrains.kotlinx.spark:*</include>
<include>org.jetbrains.kotlin:* </include>
</includes>
</artifactSet>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-maven-plugin</artifactId>
<version>${kotlin.version}</version>
<configuration>
<sourceDirs>src/main/kotlin</sourceDirs>
<jvmTarget>1.8</jvmTarget>
<myIncremental>true</myIncremental>
</configuration>
<executions>
<execution>
<id>compile</id>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
Here's what the project structure should look like:
./pom.xml
./src
./src/main
./src/main/kotlin
./src/main/kotlin/SimpleApp.kt
Now you can package the application using Maven:
mvn package
Here's what the build.gradle
looks like for this example:
plugins {
id 'org.jetbrains.kotlin.jvm' version '1.8.20'
id 'com.github.johnrengelman.shadow' version '5.2.0'
}
group = 'org.example'
version = '1.0-SNAPSHOT'
repositories {
mavenCentral()
}
dependencies {
// Kotlin stdlib
implementation 'org.jetbrains.kotlin:kotlin-stdlib:1.8.20'
// Kotlin Spark API
implementation 'org.jetbrains.kotlinx.spark:kotlin-spark-api_3.3.2_2.13:1.2.4'
// Apache Spark
compileOnly 'org.apache.spark:spark-sql_2.13:3.3.2'
}
compileKotlin {
kotlinOptions.jvmTarget = '1.8'
}
shadowJar {
dependencies {
exclude(dependency {
it.moduleGroup == 'org.apache.spark' || it.moduleGroup == "org.scala-lang"
})
}
}
Now you can package the application using Gradle:
gradle shadowJar
Once you have your jar, you can execute the packaged application with ./bin/spark-submit
:
YOUR_SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local [path to your jar]
This example is also available as (outdated) GitHub repo, feel free to give it a try.