Getting Started

Set up Quetzal with Docker with Linux (recommended)

Install docker using a package manager (instructions are available on the docker website). The following instructions were tested on Ubuntu 14.10.
Clone the code from the github repository. **Please use the latest Maven and JDK 1.8 (JRE is not sufficient). **

For loading into a postgresql docker container:

cd ~/git/quetzal/com.ibm.research.quetzal.core/docker/postgresql
Build the base docker image for the project: `sudo docker build --no-cache --rm -t "ibmresearch/quetzal_postgres" .`` What this step does is build a docker container with Ubuntu as the OS, pre-populated with the quetzal code, as cloned from the git repository, and the postgreSQL server and client code installed. The -rm option removes all intermediate images in the build process, -t refers to the tag given to the image so we can re-use it later, and --no-cache tells the system to build from scratch without consulting left over images.
Create the directory /tmp/test: mkdir -p /tmp/test
Copy the nt file you want to load into /tmp/test/ by: cp test.nt /tmp/test/test.nt. A sample nt file is provided for testing in quetzal/com.ibm.research.quetzal.core/docker/test.nt.
sudo docker run -i -t -v /tmp/test:/tmp/test ibmresearch/quetzal_postgres /bin/bash. This step will now run the docker container, and log you in as a postgres user, and put you in a directory called /data. The volume in the host machine /tmp/test is mapped to /tmp/test inside the container.
cp /tmp/test/test.nt /data so the rest of the scripts can access the nt file to load.
Make sure the current directory is /data: cd /data
bash quetzal/com.ibm.research.quetzal.core/docker/postgresql/runLoadPostgres.sh

You should see output like:

INSERT INTO kb_TOPKSTATS(TYPE, GRAPH , CNT)select 'graph',gid,count(*) as COUNT from quetzal.kb_DS group by GID having count(*) > 100000 order by count(*) fetch first 5000 rows only INSERT INTO kb_TOPKSTATS(TYPE , CNT)VALUES('nr_triples',5)

Your dataset is now loaded.
To run queries run bash quetzal/com.ibm.research.quetzal.core/docker/postgresql/run-dir.sh <query-dir>. A sample set of queries is provided in quetzal/com.ibm.research.quetzal.core/docker/queries.

For loading into a spark docker container:

We currently support spark 2.2.0

Build quetzal by cd ~/git/quetzal/com.ibm.research.quetzal.core/ and mvn clean install -DskipTests=true
cd ~/git/quetzal/com.ibm.research.quetzal.core/docker/spark
Build the base docker image for the project: sudo docker build --no-cache -t "ibmresearch/quetzal_spark" . What this step does is build a docker container with Ubuntu as the OS and the spark server and hive client code installed. The -rm option removes all intermediate images in the build process, -t refers to the tag given to the image so we can re-use it later, and --no-cache tells the system to build from scratch without consulting left over images.
docker run -it -v ~/git/quetzal/:/quetzal/ -e PASSWD=<PASSWORD> ibmresearch/quetzal_spark
Copy the nt file you want to load into /data. A sample nt file is provided for testing in /quetzal/com.ibm.research.quetzal.core/docker/test.nt.
bash /quetzal/com.ibm.research.quetzal.core/docker/spark/runLoadSpark.sh

You should see output like:

INSERT INTO kb_TOPKSTATS(TYPE, GRAPH , CNT)select 'graph',gid,count(*) as COUNT from quetzal.kb_DS group by GID having count(*) > 100000 order by count(*) fetch first 5000 rows only INSERT INTO kb_TOPKSTATS(TYPE , CNT)VALUES('nr_triples',5)

Your dataset is now loaded.
If you want to check the data, run sh /beeline.sh on the command line. This will put you into a hive client.
Type any sql command, example: show tables;
To run queries run bash /quetzal/com.ibm.research.quetzal.core/docker/spark/run-dir.sh <query-dir>. A sample set of queries is provided in /quetzal/com.ibm.research.quetzal.core/docker/queries.

For loading into a DB2 docker container:

cd ~/git/quetzal/com.ibm.research.quetzal.core/docker/db2
Copy the installer for the DB2 Server to: cp <pathToDb2ServerTarBall> .
sudo bash ./createDockerImage.sh DB2_Svr_10.5.0.3_Linux_x86-64.tar.gz. This will create a docker image for the DB2 server install, a second docker image that contains the instance for db2inst1, and a data only container for the actual data that db2 writes to (see the docker documentation for data only containers). All of the scripting here relies on https://github.com/bryantsai/db2-docker. The last step generates a quetzal specific image, which contains the quetzal code.
Create the directory /tmp/test: mkdir -p /tmp/test
Copy the DB2 JDBC driver to /tmp/test by: cp <path to jdbc driver> /tmp/test
Copy the nt file you want to load into /tmp/test/ by: cp test.nt /tmp/test/test.nt. A sample nt file is provided for testing in quetzal/com.ibm.research.quetzal.core/docker/test.nt.
sudo docker run --privileged=true -it -P --volumes-from=db2_data_1 -v /tmp/test:/tmp/test --hostname=db2_inst_1 --name=db2_inst_1 ibmresearch/quetzal-db2 /bin/bash. This step will log you into the quetzal container as root.
su - db2inst1 to switch to db2inst1 as a user
cp /tmp/test/test.nt /data so the rest of the scripts can access the nt file to load.
cp /tmp/test/db2jcc4.jar /data/quetzal/com.ibm.research.quetzal.core/target/lib/
(For datasets with hundreds of properties, edit the /data/quetzal/com.ibm.research.quetzal.core/docker/db2/runLoadDB2.sh load script as follows. Comment #db2 "CREATE DATABASE QUETZAL" and uncomment db2 "CREATE DATABASE QUETZAL PAGESIZE 32 K")
Make sure the current directory is /data: cd /data
bash /data/quetzal/com.ibm.research.quetzal.core/docker/db2/runLoadDB2.sh to launch the load script which will start DB2, create a new database called quetzal, load the nt file called test.nt into a knowledge base called kb. You should see output like:

INSERT INTO kb_TOPKSTATS(TYPE, GRAPH , CNT)select 'graph',gid,count(*) as COUNT from quetzal.kb_DS group by GID having count(*) > 100000 order by count(*) fetch first 5000 rows only INSERT INTO kb_TOPKSTATS(TYPE , CNT)VALUES('nr_triples',5)

Your dataset is now loaded.
To run queries run bash quetzal/com.ibm.research.quetzal.core/docker/db2/run-dir.sh <query-dir>. A sample set of queries is provided in quetzal/com.ibm.research.quetzal.core/docker/queries.

Running queries

Once your dataset is loaded, you can run queries against that dataset from within the container. Use the command quetzal/com.ibm.research.quetzal.core/docker/<dbbackend-engine>/run-dir.sh <test dir>, where <test dir> denotes a directory with files that are .sparql files containing queries to run, and refers to db2, postgresql or spark.

Performance considerations when running docker containers

There is a nice report of running database workloads on Linux containers, contrasted with performance on VMs (see here). To run databases with minimal performance degradation, use data only containers. The DB2 Docker file actually uses this technique (DB2 won't even start if its running on AUFS). For PostgreSQL and Spark, there is an extra step of using data only container volumes for key data directories (e.g., the location of data files as specified by PGDATA and HDFS for Spark).

Removing running docker containers

sudo docker ps -a lists running containers. In general, if you want to start over and clean up anything from quetzal, remove sudo docker rm -v db2_inst_1 (after sudo docker stop db2_inst_1 if the container is still running) to remove the running db2instance and sudo docker rm -v db2_inst_data_1 to remove the data volumes that the instance uses. Ensure nothing is running with sudo docker ps -a.
If you want to create the docker images afresh because the code has changed in github, then sudo docker images to list all images. Remove all images related to quetzal: sudo docker rmi ibmresearch/quetzal-db2, sudo docker rmi bryantsai/db2-server, sudo docker rmi bryantsai/db2-server:db2_inst_1.

##Compiling from a git clone For anyone who is need of instructions to compile the code after cloning (thanks to Amgad Madkour for putting it into a form that can be put into the wiki):

Quetzal includes two sub-projects where one depends on the other. To compile the project outside of docker, first clone the project and run the following command:

cd quetzal/com.ibm.research.quetzal.core

mvn install -DskipTests=true

Please note that the install command is needed to allow the second sub-project to be compiled. To compile the second project, run the following commands.

cd quetzal/rdfstore-server

mvn install -DskipTests=true

After compiling the project, the source code can be imported in Eclipse without compilation issues. In order to work with the project under Eclipse, use the import existing project into workspace option. Make sure to revise the project properties for any additional dependencies such as db2jcc4.jar.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started

Getting Started

For loading into a postgresql docker container:

For loading into a spark docker container:

For loading into a DB2 docker container:

Running queries

Performance considerations when running docker containers

Removing running docker containers

Clone this wiki locally