Download FINRA short sale volume from Quandl, then process it to be uploaded to Quantopian Self-Serve Data.
It has been commonly understood that a sizeable short interest may cause a stock's price to explode higher. How common is this knowledge, though? In this project, we create a data pipeline to periodically pull short interest data and stock pricing data and store them in an S3 server for future analysis.
The data pipeline is built on Apache Airflow, with Apache Spark + Hadoop plugin to pull the data and store them in a data lake on an S3 server (with Parquet format).
At the time of writing (2020-01-15), we have 3582 stocks from NASDAQ and 3092 stocks from NYSE. The earliest date is 2013-04-01, which accounts for nearly 1700 data points (261 working days each year). That means, we have more than 10 millions of data to process at max. At max, because many of the stocks won't have all of the days' data. Some of them may no longer exist. A complete run on 2020-02-05 returned approximately 6,798,996 rows of data (counted with rdd.countApprox
function - see 4.validate_completed_data.ipynb
for the code).
To demonstrate that the pipeline works, we only use a small subset of the data, consisting of only 7 stocks configured in airflow/config.cfg
file.
This is an example of an analysis we may perform with this dataset.
- The analysis was performed on the Quantopian Notebook environment.
- The data came from 2013-04-02 to 2020-02-10 (Y-m-d). I took the daily, 5 days, 22 days, and 65 days of returns (all of them are working days, and thus, in other words, they are daily, weekly, monthly, and 3-month returns) and took the short interest ratio at that date, that is short interest volume divided by total volume.
- From all the data above, I took 1000 sample data from each Morningstar sector. From there, I took the rows between 25% and 75% quantiles for each returns set (as such, there are 6000 data points in total)
As we have seen from the above visualizations, up until 1-month returns, short interests had some positive correlation with stock returns in the Technology sector. We also see interestingly an adverse correlation in the Healthcare stocks' returns.
For the 3-month returns, the highest positive effect was in the Industrials sector.
In other words, for stocks in the Healthcare industry, for instance, the visualization suggests that the higher the number of short interests a stock has, the lower its return in the next month, but not so much in the next 3 months.
This information might be useful for deciding whether to use short interest data to decide on which industry's stocks to go long and short with.
For future work, it might be interesting to see how the correlation changes in different periods. In addition to sector-based grouping, you may add time-based grouping e.g. according to business or political cycles.
The pipeline consists of the following tasks:
- Cluster DAG creates an Amazon EMR cluster, then waits for the other DAGs to complete.
- If
STOCKS
configuration inairflow/airflow.cfg
has not been filled with stock symbols, pull the list of stock information from old NASDAQ links for both NYSE and NASDAQ exchanges. The links as follows: - Short Interest DAG, we will refer to this, and other non-cluster DAGs in the future, as worker DAGs:
- Pull short interest data from Quandl's Financial Industry Regulatory Authority's short interest data.
- Store the data in the S3 server (or locally, depending on the setting in
config.cfg
). - Combine data from FINRA/NYSE and FINRA/NASDAQ exchanges (Todo: There is also ORF and ADF exchanges, but I'm not sure where to get the list of stocks from. Can anybody help?)
The pipeline is to be run once a day at 00:00. On the first run, it gets all data up to yesterday's date. On the following dates, we get one day of data for each day.
-
Create a key pair in the AWS EC2 console.
-
Create a CloudFormation stack from the
basic_network.yml
template. This is a generic VPC configuration with 2 private and 2 public subnets which should be quite useful for other similar projects too. I recommend setting the stack name with something generic, like "BasicNetwork". Take note of the first VPC ID and SubnetID of the first public subnet. Todo: Find a way to combine this into the main CloudFormation stack in the next point and to remove unneeded subnets. -
Create a CloudFormation stack from
aws-cf_template.yml
. Pass in your Quandl key and AWS Access ID and Private Access Key. I would name this stack with a specific project's name like "ShortInterests". -
After the stack created, go to the "Outputs" tab to get the URL of the Airflow admin, something like
http://ec2-3-219-234-248.compute-1.amazonaws.com:8080
. You can get the endpoint from there, which you can use to SSH connect.chmod 400 ~/path/to/airflow_pem.pem ssh -i "~/path/to/airflow_pem.pem" [email protected]
-
Connect via ssh to the server. Note that the Airflow admin may likely not be ready just yet, because there may be some code that is still running on the EC2 server. To check on the progress, SSH-connect to the EC2 instance, then run this command
cat /var/log/user-data.log
to see the entire log, ortail /var/log/user-data.log
to view the last few lines.
To kill Airflow scheduler:
$ sudo kill $(ps -ef | grep "airflow scheduler" | awk '{print $2}')
To kill Airflow webserver:
$ cat $AIRFLOW_HOME/airflow-webserver.pid | sudo xargs kill -9
Daily update should take about 1+ hour, with 35+ minutes for the gathering process of the short interest data.
If you are starting from a completely new database, I'm not quite sure how long it is going to take, since the project had undergone major changes from its initial version. Initially, it took about 20 minutes every 110,000 - 120,000 data points, so about a whooping 17 hours for the whole database. It is probably down to only a few hours now. Todo: Update this section once we have the information.
By default, Cluster DAG turns off the EMR cluster after use or if there is an error in the Short Interests DAG. If you want to keep it on, create a Variable named keep_emr_cluster
from the Admin > Variables
menu at the top. This is useful for debugging, as it saves time rather than recreating the cluster all the time. Don't forget to delete the variable to avoid paying for unused cluster time.
Click on the DAG's name, then on either Graph View or Tree View, click on the currently running task, then click on "View Log". You will need to keep refreshing and view the bottom of the page to check on the progress. The status is updated every 5 minutes.
Yes, No.
You may resize through the EMR cluster page, then click on Hardware. However, currently, changing it to anything above 3 nodes won't increase the speed of downloading short interest data. The bottleneck lies in the network connection between the EMR cluster's master node and Quandle or QuoteMedia server.
Todo: Parallelization of the requests to Quandl by using UDF (user-defined-function) to perform GET requests through slave nodes may improve the speed of this process. The initial version of the code is stored in airflow/dags/etl/pull_short_interests-udf.py
. It does not currently have any sort of reporting to tell us the progress as we pull the data, and, more critically, it creates duplicates for existing data.
The scheduler is smart enough not to re-process the data, so there is no worry here. However, the data are stored every after 100 requests, so some requests will need to be redone (should be okay however).
Unfortunately, no. To stop the server, you currently need to delete the CloudFormation stack and re-upload the template for future re-runs. On the bright side, though, the system is designed to pick up from your previous state of the database, so it is okay to recreate the whole stack multiple times. Todo: How do we update the code so EC2 server can be stopped and continued?
If there is an error on any of the steps in the Short Interests DAG, the system will do the following:
- The DAG will stop running. Set the state of this worker DAG into ERROR.
- The Cluster DAG will detect the state, then either:
- By default, proceeds with turning off the EMR cluster (this is to avoid exorbitant charges from leaving it running).
- If a Variable
keep_emr_cluster
exists, keep the EMR cluster.
Therefore, the final state is a SUCCESS state for the Cluster DAG and the ERROR state for the Short Interests DAG. To re-run only the step that contains the error, perform the following procedure:
Click on the delete button on the right part of the Cluster DAG:
Click on this OFF switch to turn it on:
By doing so, the Cluster DAG will restart the EMR cluster and wait for the Short Interests DAG to complete (or returns another error).
The waiting state should look like so:
Take note of the colors and positions of the circles.
Click on the short_interests_dag
link to open up its details, then click on "Tree View". Within it, click the red box (the task that contains the error):
Then click on the "Clear" button:
Now we just need to wait for the Short Interests Dag to complete running.
The main problem is with pulling the data from the APIs. Each request takes about 3 seconds. To make this process faster, we can try adding more nodes in our EM3 cluster to deal with this situation.
Update schedule_interval
setting accordingly for all of the DAGs for this.
The answer to this comes in two flavors:
- If the users need to flexibly access the database to perform any SQL queries, we opt for a Redshift cluster with auto-scaling capabilities.
- If the users would run basically a few sets of queries, use the combination of Apache Spark and Apache Cassandra to use the latter as a storage layer. Here is a link to a tutorial on this.
- Amazon EMR: Easy to scale up or shrink down, and we do not need to keep the server running all the time.
- Apache Airflow: Scheduler application. Another interesting alternative for this is AWS Step Functions, but for this project Apache Airflow is preferable as it is way easier to setup.
- Apache Spark: Good for working with huge datasets.
- A more comprehensive list of stock tickers is needed. Currently, the list of stock tickers are gathered from the Nasdaq's site, but it does not include stocks of closed or acquired companies. ARRY, for example, does not exist when the code was run on 2020-02-19, while Quantopian has this ticker.
- Run the code to build an empty database, and update the FAQ to include this information.
- There are a couple of Todos above. If you feel generous, post a pull request to improve them.
- The
aws-latest
branch is meant to gather pricing data from QuoteMedia then combine them with the short interest data from Quandl. However, the branch still currently has some bugs, and it does not make sense to combine the data for later slicing them off when preparing the data for use in Quantopian. When the need to analyze the data outside of Quantopian arises, work on this branch further. - The
local-only
branch is for the local version of Airflow. This was needed when the system was initially developed. I don't see how it is ever going to be needed again, but I keep this branch just in case someone may need it. - Both the
aws-latest
andlocal-only
branches are currently not fast enough for use with large datasets. Diff theairflow/etl/short_interests.py
withquantopian-only
branch and move the needed updates to the other branches. Also, do similar updates to theairflow/etl/pull_prices.py
file. - Investigate whether parallelization of the GET requests would improve the ETL speed (see the "FAQs" section above).
- The
UserData
code inaws-cf_template.yml
was made by trial-and-error. There are some redundant code like the AWS configure code and how Airflow scheduler cannot be created from theec2-user
(and therefore we can only shut it down withsudo
). If you are a bash code expert, this code will certainly benefit from your expertise. - The CloudFormation stack currently has 2 public and 2 private subnets, while the scheduler only resides on a single public subnet. It took a lot of experiments to get to this working point so I did not improve this. If you are a CloudFormation expert, your assistance here would be totally appreciated.