Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce dependency on NVME storage / HDFS #8

Open
matz-e opened this issue Sep 23, 2024 · 2 comments
Open

Reduce dependency on NVME storage / HDFS #8

matz-e opened this issue Sep 23, 2024 · 2 comments

Comments

@matz-e
Copy link
Member

matz-e commented Sep 23, 2024

Due to the very degraded performance with many small files, Functionalizer spawns a Hadoop file system cluster and stores checkpoint data there.

This blows up the Functionalizer Docker container size due to a full installation of Hadoop being required, and it requires us to use larger SSD storage on nodes. We should look into storing checkpoints somewhere else.

@matz-e
Copy link
Member Author

matz-e commented Nov 5, 2024

I managed to create the following pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>ch.epfl.bbp.functionalizer</groupId>
  <artifactId>Functionalizer</artifactId>
  <version>1.0-SNAPSHOT</version>

  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.4</version>
    </dependency>
  </dependencies>
</project>

and installed the contents with Maven via:

mvn install dependency:copy-dependencies

Then using the following Python test script (called sls.py):

import pyspark
from pathlib import Path

conf = pyspark.conf.SparkConf()
conf.setMaster("local").setAppName("teschd")
jars = Path(".").resolve() / "target" / "dependency" / "*"
conf.set("spark.driver.extraClassPath", jars)
conf.set("spark.executor.extraClassPath", jars)
# conf.set("spark.hadoop.fs.s3a.endpoint", "https://bbpobjectstorage.epfl.ch")
# conf.set("spark.hadoop.fs.s3a.endpoint.region", "ch-gva-1")
# conf.set("spark.hadoop.fs.s3a.access.key", "")
# conf.set("spark.hadoop.fs.s3a.secret.key", "")
# conf.set("log4j.logger.software.amazon.awssdk.request", "DEBUG")

sc = pyspark.context.SparkContext(conf=conf)
sql = pyspark.sql.SQLContext(sc)

df = sql.read.parquet("s3a://hornbach-please-delete-me/touchesData.0.parquet")
df.show()
# sql.read.parquet("s3a://access-test/dumbo.parquet")

I was able to access the S3 bucket referenced in the script with

python sls.py

of course with the right AWS access keys exported into the shell environment. This did not work attempting to access an S3 bucket on NetApp.

This would allow to store the checkpoints on a temporary S3 bucket rather than spawning a Hadoop cluster just for this purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant