-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce dependency on NVME storage / HDFS #8
Comments
I managed to create the following <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ch.epfl.bbp.functionalizer</groupId>
<artifactId>Functionalizer</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.3.4</version>
</dependency>
</dependencies>
</project> and installed the contents with Maven via: mvn install dependency:copy-dependencies Then using the following Python test script (called import pyspark
from pathlib import Path
conf = pyspark.conf.SparkConf()
conf.setMaster("local").setAppName("teschd")
jars = Path(".").resolve() / "target" / "dependency" / "*"
conf.set("spark.driver.extraClassPath", jars)
conf.set("spark.executor.extraClassPath", jars)
# conf.set("spark.hadoop.fs.s3a.endpoint", "https://bbpobjectstorage.epfl.ch")
# conf.set("spark.hadoop.fs.s3a.endpoint.region", "ch-gva-1")
# conf.set("spark.hadoop.fs.s3a.access.key", "")
# conf.set("spark.hadoop.fs.s3a.secret.key", "")
# conf.set("log4j.logger.software.amazon.awssdk.request", "DEBUG")
sc = pyspark.context.SparkContext(conf=conf)
sql = pyspark.sql.SQLContext(sc)
df = sql.read.parquet("s3a://hornbach-please-delete-me/touchesData.0.parquet")
df.show()
# sql.read.parquet("s3a://access-test/dumbo.parquet") I was able to access the S3 bucket referenced in the script with python sls.py of course with the right AWS access keys exported into the shell environment. This did not work attempting to access an S3 bucket on NetApp. This would allow to store the checkpoints on a temporary S3 bucket rather than spawning a Hadoop cluster just for this purpose. |
Due to the very degraded performance with many small files, Functionalizer spawns a Hadoop file system cluster and stores checkpoint data there.
This blows up the Functionalizer Docker container size due to a full installation of Hadoop being required, and it requires us to use larger SSD storage on nodes. We should look into storing checkpoints somewhere else.
The text was updated successfully, but these errors were encountered: