Reduce dependency on NVME storage / HDFS #8

matz-e · 2024-09-23T11:28:47Z

Due to the very degraded performance with many small files, Functionalizer spawns a Hadoop file system cluster and stores checkpoint data there.

This blows up the Functionalizer Docker container size due to a full installation of Hadoop being required, and it requires us to use larger SSD storage on nodes. We should look into storing checkpoints somewhere else.

matz-e · 2024-11-05T14:46:20Z

I managed to create the following pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>ch.epfl.bbp.functionalizer</groupId>
  <artifactId>Functionalizer</artifactId>
  <version>1.0-SNAPSHOT</version>

  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.4</version>
    </dependency>
  </dependencies>
</project>

and installed the contents with Maven via:

mvn install dependency:copy-dependencies

Then using the following Python test script (called sls.py):

import pyspark
from pathlib import Path

conf = pyspark.conf.SparkConf()
conf.setMaster("local").setAppName("teschd")
jars = Path(".").resolve() / "target" / "dependency" / "*"
conf.set("spark.driver.extraClassPath", jars)
conf.set("spark.executor.extraClassPath", jars)
# conf.set("spark.hadoop.fs.s3a.endpoint", "https://bbpobjectstorage.epfl.ch")
# conf.set("spark.hadoop.fs.s3a.endpoint.region", "ch-gva-1")
# conf.set("spark.hadoop.fs.s3a.access.key", "")
# conf.set("spark.hadoop.fs.s3a.secret.key", "")
# conf.set("log4j.logger.software.amazon.awssdk.request", "DEBUG")

sc = pyspark.context.SparkContext(conf=conf)
sql = pyspark.sql.SQLContext(sc)

df = sql.read.parquet("s3a://hornbach-please-delete-me/touchesData.0.parquet")
df.show()
# sql.read.parquet("s3a://access-test/dumbo.parquet")

I was able to access the S3 bucket referenced in the script with

python sls.py

of course with the right AWS access keys exported into the shell environment. This did not work attempting to access an S3 bucket on NetApp.

This would allow to store the checkpoints on a temporary S3 bucket rather than spawning a Hadoop cluster just for this purpose.

matz-e · 2024-11-18T23:19:34Z

See also: https://medium.com/@ramachandrankrish/integrating-org-apache-hadoop-fs-s3a-s3afilesystem-to-access-the-aws-s3-bucket-via-spark-java-3744ffadb60d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce dependency on NVME storage / HDFS #8

Reduce dependency on NVME storage / HDFS #8

matz-e commented Sep 23, 2024

matz-e commented Nov 5, 2024 •

edited

Loading

matz-e commented Nov 18, 2024

Reduce dependency on NVME storage / HDFS #8

Reduce dependency on NVME storage / HDFS #8

Comments

matz-e commented Sep 23, 2024

matz-e commented Nov 5, 2024 • edited Loading

matz-e commented Nov 18, 2024

matz-e commented Nov 5, 2024 •

edited

Loading