Skip to content

Commit

Permalink
Merge pull request #15 from 4dn-dcic/ff_utils_docs
Browse files Browse the repository at this point in the history
Changes from 4DN meeting and also docs
  • Loading branch information
Carl Vitzthum authored Jun 4, 2018
2 parents c8e7e3c + ab310e1 commit 6415641
Show file tree
Hide file tree
Showing 6 changed files with 305 additions and 30 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# utils
Various utility modules shared amongst several projects in the 4DN-DCIC.
This repository contains various utility modules shared amongst several projects in the 4DN-DCIC. It is meant to be used internally by the DCIC team and externally as a Python API to [Fourfront](https://data.4dnucleome.org), the 4DN data portal.

pip installable with: `pip install dcicutils`
pip installable as the `dcicutils` package with: `pip install dcicutils`

See [this document](./docs/getting_started.md) for tips on getting started. [Go here](./docs/examples.md) for examples of some of the most useful functions.

[![Build Status](https://travis-ci.org/4dn-dcic/utils.svg?branch=master)](https://travis-ci.org/4dn-dcic/utils)
[![Coverage](https://coveralls.io/repos/github/4dn-dcic/utils/badge.svg?branch=master)](https://coveralls.io/github/4dn-dcic/utils?branch=master)
2 changes: 1 addition & 1 deletion dcicutils/_version.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Version information."""

# The following line *must* be the last in the module, exactly as formatted:
__version__ = "0.2.5"
__version__ = "0.2.6"
121 changes: 103 additions & 18 deletions dcicutils/ff_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from __future__ import print_function
import sys
import json
import time
import random
Expand All @@ -7,6 +8,13 @@
from uuid import UUID
from dcicutils import s3_utils, submit_utils
import requests
# urlparse import differs between py2 and 3
if sys.version_info[0] < 3:
import urlparse
from urllib import urlencode as urlencode
else:
import urllib.parse as urlparse
from urllib.parse import urlencode


HIGLASS_BUCKETS = ['elasticbeanstalk-fourfront-webprod-wfoutput',
Expand All @@ -17,6 +25,7 @@
# Widely used metadata functions #
##################################


def standard_request_with_retries(request_fxn, url, auth, verb, **kwargs):
"""
Standard function to execute the request made by authorized_request.
Expand Down Expand Up @@ -200,48 +209,105 @@ def patch_metadata(patch_item, obj_id='', key=None, ff_env=None, add_on=''):

def post_metadata(post_item, schema_name, key=None, ff_env=None, add_on=''):
'''
Patch metadata given the post body and a string schema name.
Post metadata given the post body and a string schema name.
Either takes a dictionary form authentication (MUST include 'server')
or a string fourfront-environment.
This function checks to see if an existing object already exists
with the same body, and if so, runs a patch instead.
add_on is the string that will be appended to the post url (used
with tibanna)
'''
auth = get_authentication_with_server(key, ff_env)
post_url = '/'.join([auth['server'], schema_name]) + process_add_on(add_on)
# format item to json
post_item = json.dumps(post_item)
response = authorized_request(post_url, auth=auth, verb='POST', data=post_item)
return get_response_json(response)


def upsert_metadata(upsert_item, schema_name, key=None, ff_env=None, add_on=''):
'''
UPSERT metadata given the upsert body and a string schema name.
UPSERT means POST or PATCH on conflict.
Either takes a dictionary form authentication (MUST include 'server')
or a string fourfront-environment.
This function checks to see if an existing object already exists
with the same body, and if so, runs a patch instead.
add_on is the string that will be appended to the upsert url (used
with tibanna)
'''
auth = get_authentication_with_server(key, ff_env)
upsert_url = '/'.join([auth['server'], schema_name]) + process_add_on(add_on)
# format item to json
upsert_item = json.dumps(upsert_item)
try:
response = authorized_request(post_url, auth=auth, verb='POST', data=post_item)
response = authorized_request(upsert_url, auth=auth, verb='POST', data=upsert_item)
except Exception as e:
# this means there was a conflict. try to patch
if '409' in str(e):
return patch_metadata(json.loads(post_item), key=auth, add_on=add_on)
return patch_metadata(json.loads(upsert_item), key=auth, add_on=add_on)
else:
raise Exception(str(e))
return get_response_json(response)


def search_metadata(search, key=None, ff_env=None):
def get_search_generator(search_url, auth=None, ff_env=None, page_limit=50):
"""
Returns a generator given a search_url (which must contain server!), an
auth and/or ff_env, and an int page_limit, which is used to determine how
many results are returned per page (i.e. per iteration of the generator)
Paginates by changing the 'from' query parameter, incrementing it by the
page_limit size until fewer results than the page_limit are returned.
If 'limit' is specified in the query, the generator will stop when that many
results are collectively returned.
"""
url_params = get_url_params(search_url)
# indexing below is needed because url params are returned in lists
curr_from = int(url_params.get('from', ['0'])[0]) # use query 'from' or 0 if not provided
search_limit = url_params.get('limit', ['all'])[0] # use limit=all by default
if search_limit != 'all':
search_limit = int(search_limit)
url_params['limit'] = [str(page_limit)]
if not url_params.get('sort'): # sort needed for pagination
url_params['sort'] = ['-date_created']
# stop when fewer results than the limit are returned
last_total = None
while last_total is None or last_total == page_limit:
if search_limit != 'all' and curr_from >= search_limit:
break
url_params['from'] = [str(curr_from)] # use from to drive search pagination
search_url = update_url_params_and_unparse(search_url, url_params)
# use a different retry_fxn, since empty searches are returned as 400's
response = authorized_request(search_url, auth=auth, ff_env=ff_env,
retry_fxn=search_request_with_retries)
try:
search_res = get_response_json(response)['@graph']
except KeyError:
raise('Cannot get "@graph" from the search request for %s. Response '
'status code is %s.' % (search_url, response.status_code))
last_total = len(search_res)
curr_from += last_total
if search_limit != 'all' and curr_from > search_limit:
limit_diff = curr_from - search_limit
yield search_res[:-limit_diff]
else:
yield search_res


def search_metadata(search, key=None, ff_env=None, page_limit=50):
"""
Make a get request of form <server>/<search> and returns the '@graph'
key from the request json. Include all query params in the search string
Make a get request of form <server>/<search> and returns a list of results
using a paginated generator. Include all query params in the search string.
Either takes a dictionary form authentication (MUST include 'server')
or a string fourfront-environment.
"""
auth = get_authentication_with_server(key, ff_env)
if search.startswith('/'):
search = search[1:]
search_url = '/'.join([auth['server'], search])
# use a different retry_fxn, since empty searches are returned as 400's
response = authorized_request(search_url, auth=key, ff_env=ff_env,
retry_fxn=search_request_with_retries)
try:
return get_response_json(response)['@graph']
except KeyError:
raise('Cannot get "@graph" from the search request for %s. Response '
'status code is %s.' % (search_url, response.status_code))
search_res = []
for page in get_search_generator(search_url, auth=auth, page_limit=page_limit):
search_res.extend(page)
return search_res


def delete_field(obj_id, del_field, key=None, ff_env=None):
Expand Down Expand Up @@ -287,7 +353,7 @@ def fdn_connection(key='', connection=None, keyname='default'):
return connection


def unified_authentication(auth, ff_env):
def unified_authentication(auth=None, ff_env=None):
"""
One authentication function to rule them all.
Has several options for authentication, which are:
Expand Down Expand Up @@ -319,7 +385,7 @@ def unified_authentication(auth, ff_env):
return use_auth


def get_authentication_with_server(auth, ff_env):
def get_authentication_with_server(auth=None, ff_env=None):
"""
Pass in authentication information and ff_env and attempts to either
retrieve the server info from the auth, or if it cannot, get the
Expand Down Expand Up @@ -408,6 +474,25 @@ def process_add_on(add_on):
return add_on


def get_url_params(url):
"""
Returns a dictionary of url params using urlparse.parse_qs.
Example: get_url_params('<server>/search/?type=Biosample&limit=5') returns
{'type': ['Biosample'], 'limit': '5'}
"""
parsed_url = urlparse.urlparse(url)
return urlparse.parse_qs(parsed_url.query)


def update_url_params_and_unparse(url, url_params):
"""
Takes a string url and url params (in format of what is returned by
get_url_params). Returns a string url param with newly formatted params
"""
parsed_url = urlparse.urlparse(url)._replace(query=urlencode(url_params, True))
return urlparse.urlunparse(parsed_url)


def convert_param(parameter_dict, vals_as_string=False):
'''
converts dictionary format {argument_name: value, argument_name: value, ...}
Expand Down
96 changes: 96 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Example usage of dcicutils functions

See [getting started]('./getting_started.md') for help with getting up and running with dcicutils.

As a first step, we will import our modules from the dcicutils package.

```
from dcicutils import ff_utils
```

### <a name="key"></a>Making your key

Authentication methods differ if you are an external user or an internal 4DN team member. If you are an external user, create a Python dictionary called `key` using your access key. This will be used in the examples below.

```
key = {'key': <YOUR KEY>, 'secret' <YOUR SECRET>, 'server': 'https://data.4dnucleome.org/'}
```

If you are an internal user, you may simply use the string Fourfront environment name for your metadata functions to get administrator access. For faster requests or if you want to emulate another user, you can also pass in keys manually. The examples below will use `key`, but could also use `ff_env`. It assumes you want to use the data Fourfront environment.

```
key = ff_utils.get_authentication_with_server(ff_env='data')
```

### <a name="metadata"></a>Examples for metadata functions

You can use `get_metadata` to get the metadata for a single object. It returns a dictionary of metadata on a successful get request. In our example, we get a publicly available HEK293 biosource, which has an internal accession of 4DNSRVF4XB1F.

```
metadata = ff_utils.get_metadata('4DNSRVF4XB1F', key=key)
# the response is a python dictionary
metadata['accession'] == '4DNSRVF4XB1F'
>> True
```

To post new data to the system, use the `post_metadata` function. You need to provide the body of data you want to post, as well as the schema name for the object. We want to post a fastq file.

```
post_body = {
'file_format': 'fastq',
'lab': '/labs/4dn-dcic-lab/',
'award': '/awards/1U01CA200059-01/'
}
response = ff_utils.post_metadata(post_body, 'file_fastq', key=key)
# response is a dictionary containing info about your post
response['status']
>> 'success'
# the dictionary body of the metadata object created is in response['@graph']
metadata = response['@graph'][0]
```


If you want to edit data, use the `patch_metadata` function. Let's say that the fastq file you just made has an accession of `4DNFIP74UWGW` and we want to add a description to it.

```
patch_body = {'description': 'My cool fastq file'}
# you can explicitly pass the object ID (in this case accession)...
response = ff_utils.patch_metadata(patch_body, '4DNFIP74UWGW', key=key)
# or you can include the ID in the data you patch
patch_body['accession'] = '4DNFIP74UWGW'
response = ff_utils.patch_metadata(patch_body, key=key)
# the response has the same format as in post_metadata
metadata = response['@graph'][0]
```

Similar to `post_metadata` you can "UPSERT" metadata, which will perform a POST if the metadata doesn't yet exist within the system and will PATCH if it does. The `upsert_metadata` function takes the exact same arguments as `post_metadata` but will not raise an error on a metadata conflict.

```
upsert_body = {
'file_format': 'fastq',
'lab': '/labs/4dn-dcic-lab/',
'award': '/awards/1U01CA200059-01/',
'accession': '4DNFIP74UWGW'
}
# this will POST if file 4DNFIP74UWGW does not exist and will PATCH if it does
response = ff_utils.post_metadata(post_body, 'upsert_body', key=key)
# the response has the same format as in post_metadata
metadata = response['@graph'][0]
```

You can use `search_metadata` to easily search through metadata in Fourfront. This function takes a string search url starting with 'search', as well as the the same authorization information as the other metadata functions. It returns a list of metadata results. Optionally, the `page_limit` parameter can be used to internally adjust the size of the pagination used in underlying generator used to get search results.

```
# let's search for all biosamples
# hits is a list of metadata dictionaries
hits = ff_utils.search_metadata('search/?type=Biosample', key=key)
# you can also specify a limit on the number of results for your search
# other valid query params are also allowed, including sorts and filters
hits = ff_utils.search_metadata('search/?type=Biosample&limit=10', key=key)
```
27 changes: 27 additions & 0 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Getting started

The dcicutils package contains a number of helpful utility functions that are useful for both internal use (both infrastructure and scripting) and external user use. Before getting into the functions themselves, we will go over how to set up your authentication as both as internal DCIC user and external user.

First, install dcicutils using pip. Python 2.7 and 3.x are supported.

`pip install dcicutils`

### Internal DCIC set up

To fully utilize the utilities, you should have your AWS credentials set up. In addition, you should also have the `SECRET` environment variable needed for decrypting the administrator access keys stored on Amazon S3. If you would rather not set these up, using a local administrator access key generated from Fourfront is also an option; see the instructions for external set up below.

### External set up

The utilities require an access key, which is generated using your use account on Fourfront. If you do not yet have an account, the first step is to [request one](https://data.4dnucleome.org/help/user-guide/account-creation). You can then generate an access key on your [user information page](https://data.4dnucleome.org/me) when your account is set up and you are logged in. Make sure to take note of the information generated when you make an access key. Store it in a safe place, because it will be needed when you make a request to Fourfront.

The main format of the authorization used for the utilities is:

`{'key': <YOUR KEY>, 'secret' <YOUR SECRET>, 'server': 'https://data.4dnucleome.org/'}`

You can replace server with another Fourfront environment if you have an access key made on that environment.

### Central metadata functions

The most useful utilities functions for most users are the metadata functions, which generally are used to access, create, or edit object metadata on the Fourfront portal. Since this utilities module is a pip-installable Python package, they can be leveraged as an API to the portal in your scripts. All of these functions are contained within `dcicutils.ff_utils.py`.

See example usage of these functions [here](./examples.md#metadata)
Loading

0 comments on commit 6415641

Please sign in to comment.