-
Notifications
You must be signed in to change notification settings - Fork 14
Getting Started
Set up Quetzal with Docker with Linux (recommended)
-
Install docker using a package manager (instructions are available on the docker website). The following instructions were tested on Ubuntu 14.10.
-
Clone the code from the github repository. **Please use the latest Maven and JDK 1.8 (JRE is not sufficient). **
-
cd ~/git/quetzal/com.ibm.research.quetzal.core/docker/postgresql
-
Build the base docker image for the project: `sudo docker build --no-cache --rm -t "ibmresearch/quetzal_postgres" .`` What this step does is build a docker container with Ubuntu as the OS, pre-populated with the quetzal code, as cloned from the git repository, and the postgreSQL server and client code installed. The -rm option removes all intermediate images in the build process, -t refers to the tag given to the image so we can re-use it later, and --no-cache tells the system to build from scratch without consulting left over images.
-
Create the directory /tmp/test:
mkdir -p /tmp/test
-
Copy the nt file you want to load into /tmp/test/ by:
cp test.nt /tmp/test/test.nt
. A sample nt file is provided for testing inquetzal/com.ibm.research.quetzal.core/docker/test.nt
. -
sudo docker run -i -t -v /tmp/test:/tmp/test ibmresearch/quetzal_postgres /bin/bash
. This step will now run the docker container, and log you in as a postgres user, and put you in a directory called /data. The volume in the host machine/tmp/test
is mapped to/tmp/test
inside the container. -
cp /tmp/test/test.nt /data
so the rest of the scripts can access the nt file to load. -
Make sure the current directory is /data:
cd /data
-
bash quetzal/com.ibm.research.quetzal.core/docker/postgresql/runLoadPostgres.sh
You should see output like:
INSERT INTO kb_TOPKSTATS(TYPE, GRAPH , CNT)select 'graph',gid,count(*) as COUNT from quetzal.kb_DS group by GID having count(*) > 100000 order by count(*) fetch first 5000 rows only INSERT INTO kb_TOPKSTATS(TYPE , CNT)VALUES('nr_triples',5)
-
Your dataset is now loaded.
-
To run queries run
bash quetzal/com.ibm.research.quetzal.core/docker/postgresql/run-dir.sh <query-dir>
. A sample set of queries is provided inquetzal/com.ibm.research.quetzal.core/docker/queries
.
We currently support spark 2.2.0
-
Build quetzal by
cd ~/git/quetzal/com.ibm.research.quetzal.core/
andmvn clean install -DskipTests=true
-
cd ~/git/quetzal/com.ibm.research.quetzal.core/docker/spark
-
Build the base docker image for the project:
sudo docker build --no-cache -t "ibmresearch/quetzal_spark" .
What this step does is build a docker container with Ubuntu as the OS and the spark server and hive client code installed. The -rm option removes all intermediate images in the build process, -t refers to the tag given to the image so we can re-use it later, and --no-cache tells the system to build from scratch without consulting left over images. -
docker run -it -v ~/git/quetzal/:/quetzal/ -e PASSWD=<PASSWORD> ibmresearch/quetzal_spark
-
Copy the nt file you want to load into /data. A sample nt file is provided for testing in
/quetzal/com.ibm.research.quetzal.core/docker/test.nt
. -
bash /quetzal/com.ibm.research.quetzal.core/docker/spark/runLoadSpark.sh
You should see output like:
INSERT INTO kb_TOPKSTATS(TYPE, GRAPH , CNT)select 'graph',gid,count(*) as COUNT from quetzal.kb_DS group by GID having count(*) > 100000 order by count(*) fetch first 5000 rows only INSERT INTO kb_TOPKSTATS(TYPE , CNT)VALUES('nr_triples',5)
-
Your dataset is now loaded.
-
If you want to check the data, run
sh /beeline.sh
on the command line. This will put you into a hive client. -
Type any sql command, example:
show tables;
-
To run queries run
bash /quetzal/com.ibm.research.quetzal.core/docker/spark/run-dir.sh <query-dir>
. A sample set of queries is provided in/quetzal/com.ibm.research.quetzal.core/docker/queries
.
-
cd ~/git/quetzal/com.ibm.research.quetzal.core/docker/db2
-
Copy the installer for the DB2 Server to:
cp <pathToDb2ServerTarBall> .
-
sudo bash ./createDockerImage.sh DB2_Svr_10.5.0.3_Linux_x86-64.tar.gz
. This will create a docker image for the DB2 server install, a second docker image that contains the instance for db2inst1, and a data only container for the actual data that db2 writes to (see the docker documentation for data only containers). All of the scripting here relies on https://github.com/bryantsai/db2-docker. The last step generates a quetzal specific image, which contains the quetzal code. -
Create the directory /tmp/test:
mkdir -p /tmp/test
-
Copy the DB2 JDBC driver to /tmp/test by:
cp <path to jdbc driver> /tmp/test
-
Copy the nt file you want to load into /tmp/test/ by:
cp test.nt /tmp/test/test.nt
. A sample nt file is provided for testing inquetzal/com.ibm.research.quetzal.core/docker/test.nt
. -
sudo docker run --privileged=true -it -P --volumes-from=db2_data_1 -v /tmp/test:/tmp/test --hostname=db2_inst_1 --name=db2_inst_1 ibmresearch/quetzal-db2 /bin/bash
. This step will log you into the quetzal container as root. -
su - db2inst1
to switch to db2inst1 as a user -
cp /tmp/test/test.nt /data
so the rest of the scripts can access the nt file to load. -
cp /tmp/test/db2jcc4.jar /data/quetzal/com.ibm.research.quetzal.core/target/lib/
-
(For datasets with hundreds of properties, edit the /data/quetzal/com.ibm.research.quetzal.core/docker/db2/runLoadDB2.sh load script as follows. Comment
#db2 "CREATE DATABASE QUETZAL"
and uncommentdb2 "CREATE DATABASE QUETZAL PAGESIZE 32 K"
) -
Make sure the current directory is /data:
cd /data
-
bash /data/quetzal/com.ibm.research.quetzal.core/docker/db2/runLoadDB2.sh
to launch the load script which will start DB2, create a new database called quetzal, load the nt file called test.nt into a knowledge base called kb. You should see output like:
INSERT INTO kb_TOPKSTATS(TYPE, GRAPH , CNT)select 'graph',gid,count(*) as COUNT from quetzal.kb_DS group by GID having count(*) > 100000 order by count(*) fetch first 5000 rows only INSERT INTO kb_TOPKSTATS(TYPE , CNT)VALUES('nr_triples',5)
-
Your dataset is now loaded.
-
To run queries run
bash quetzal/com.ibm.research.quetzal.core/docker/db2/run-dir.sh <query-dir>
. A sample set of queries is provided inquetzal/com.ibm.research.quetzal.core/docker/queries
.
Once your dataset is loaded, you can run queries against that dataset from within the container. Use the command quetzal/com.ibm.research.quetzal.core/docker/<dbbackend-engine>/run-dir.sh <test dir>
, where <test dir>
denotes a directory with files that are .sparql files containing queries to run, and refers to db2, postgresql or spark.
There is a nice report of running database workloads on Linux containers, contrasted with performance on VMs (see here). To run databases with minimal performance degradation, use data only containers. The DB2 Docker file actually uses this technique (DB2 won't even start if its running on AUFS). For PostgreSQL and Spark, there is an extra step of using data only container volumes for key data directories (e.g., the location of data files as specified by PGDATA and HDFS for Spark).
-
sudo docker ps -a
lists running containers. In general, if you want to start over and clean up anything from quetzal, removesudo docker rm -v db2_inst_1
(aftersudo docker stop db2_inst_1
if the container is still running) to remove the running db2instance andsudo docker rm -v db2_inst_data_1
to remove the data volumes that the instance uses. Ensure nothing is running withsudo docker ps -a
. - If you want to create the docker images afresh because the code has changed in github, then
sudo docker images
to list all images. Remove all images related to quetzal:sudo docker rmi ibmresearch/quetzal-db2
,sudo docker rmi bryantsai/db2-server
,sudo docker rmi bryantsai/db2-server:db2_inst_1
.
##Compiling from a git clone For anyone who is need of instructions to compile the code after cloning (thanks to Amgad Madkour for putting it into a form that can be put into the wiki):
Quetzal includes two sub-projects where one depends on the other. To compile the project outside of docker, first clone the project and run the following command:
cd quetzal/com.ibm.research.quetzal.core
mvn install -DskipTests=true
Please note that the install command is needed to allow the second sub-project to be compiled. To compile the second project, run the following commands.
cd quetzal/rdfstore-server
mvn install -DskipTests=true
After compiling the project, the source code can be imported in Eclipse without compilation issues. In order to work with the project under Eclipse, use the import existing project into workspace option. Make sure to revise the project properties for any additional dependencies such as db2jcc4.jar.