Skip to content

Commit

Permalink
feat: add a CLI entrypoint for upload_vcf (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
clintval authored Dec 14, 2024
1 parent 6f79cba commit 5659bfb
Show file tree
Hide file tree
Showing 12 changed files with 215 additions and 15 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
.DS_Store
.vscode/
testdata/

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
61 changes: 61 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,67 @@ The package can be installed with `pip`:
pip install tp53
```

## Upload a VCF to the Seshat TP53 Annotation Server

Upload a VCF to the [Seshat TP53 annotation server](http://vps338341.ovh.net/) using a headless browser.

```bash
❯ python -m tp53.seshat.upload_vcf \
--input "input.vcf" \
--email "[email protected]"
```
```console
INFO:tp53.seshat.upload_vcf:Uploading 0 %...
INFO:tp53.seshat.upload_vcf:Uploading 53%...
INFO:tp53.seshat.upload_vcf:Uploading 53%...
INFO:tp53.seshat.upload_vcf:Uploading 60%...
INFO:tp53.seshat.upload_vcf:Uploading 60%...
INFO:tp53.seshat.upload_vcf:Uploading 66%...
INFO:tp53.seshat.upload_vcf:Uploading 66%...
INFO:tp53.seshat.upload_vcf:Uploading 80%...
INFO:tp53.seshat.upload_vcf:Uploading 80%...
INFO:tp53.seshat.upload_vcf:Upload complete!
```

This tool is used to programmatically configure and upload batch variants in VCF format to the Seshat annotation server.
The tool works by building a headless Chrome browser instance and then interacting with the Seshat website directly through simulated key presses and mouse clicks.
Unfortunately, Seshat does not provide a native programmatic API and one could not be reverse engineered.
Seshat also utilizes custom JavaScript in their form processing, so a lightweight approach of simply interacting with the HTML form elements was also not possible.

###### VCF Input Requirements

Seshat will not let the user know why a VCF fails to annotate, but it has been observed that Seshat can fail to parse some of [VarDictJava](https://github.com/AstraZeneca-NGS/VarDictJava)'s structural variants (SVs) as valid variant records.
One solution that has worked in the past is to remove SVs.
The following command will exclude all variants with a non-empty SVTYPE INFO key:

```bash
❯ bcftools view in.vcf --exclude 'SVTYPE!="."' > out.noSV.vcf
```

###### Automation

There are no terms and conditions posted on the Seshat annotation server's website, and there is no server-side `robots.txt` rule set.
In lieu of usage terms, we strongly encourage all users of this script to respect the Seshat resource by adhering to the following best practice:

- **Minimize Load**: Limit the rate of requests to the server
- **Minimize Connections**: Limit the number of concurrent requests

If you need to batch process dozens, or hundreds, of VCF callsets, you may consider improving this underlying Python script to randomize the user agent and IP address of your headless browser session to prevent from being labelled as a bot.

###### Environment Setup

This script relies on Google Chrome:

```console
brew install --cask google-chrome
```

Distributions of MacOS may require you to authenticate the Chrome driver ([link](https://stackoverflow.com/a/60362134)).

## Development and Testing

See the [contributing guide](./CONTRIBUTING.md) for more information.

## References

- [Soussi, Thierry, et al. “Recommendations for Analyzing and Reporting TP53 Gene Variants in the High-Throughput Sequencing Era.” Human Mutation, vol. 35, no. 6, 2014, pp. 766–778., doi:10.1002/humu.22561](https://doi.org/10.1002/humu.22561)
13 changes: 12 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ classifiers = [
[tool.poetry.dependencies]
python = "^3.11"
beautifulsoup4 = "~4.12"
chromedriver-py = "*"
google-api-python-client = "~2.151"
google-auth-httplib2 = "~0.2"
google-auth-oauthlib = "~1.2.1"
Expand Down Expand Up @@ -123,7 +124,7 @@ exclude = [
]

[[tool.mypy.overrides]]
module = "defopt"
module = "chromedriver_py"
ignore_missing_imports = true

[[tool.mypy.overrides]]
Expand Down
2 changes: 1 addition & 1 deletion tests/seshat/test_upload.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from tp53.seshat import HumanGenomeAssembly
from tp53.seshat.upload_vcf import HumanGenomeAssembly


def test_human_genome_assembly() -> None:
Expand Down
3 changes: 0 additions & 3 deletions tp53/seshat/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1 @@
from tp53.seshat._exceptions import SeshatError as SeshatError
from tp53.seshat._gmail_find import find_in_gmail as find_in_gmail
from tp53.seshat._upload import HumanGenomeAssembly as HumanGenomeAssembly
from tp53.seshat._upload import upload_vcf as upload_vcf
1 change: 1 addition & 0 deletions tp53/seshat/find_in_gmail/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from ._find_in_gmail import find_in_gmail as find_in_gmail
2 changes: 2 additions & 0 deletions tp53/seshat/find_in_gmail/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
if __name__ == "__main__":
...
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
from google_auth_oauthlib.flow import InstalledAppFlow

Check warning on line 20 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.11)

Stub file not found for "google_auth_oauthlib.flow" (reportMissingTypeStubs)

Check warning on line 20 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.12)

Stub file not found for "google_auth_oauthlib.flow" (reportMissingTypeStubs)

Check warning on line 20 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.13)

Stub file not found for "google_auth_oauthlib.flow" (reportMissingTypeStubs)
from googleapiclient.discovery import build as build_google_client

Check warning on line 21 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.11)

Stub file not found for "googleapiclient.discovery" (reportMissingTypeStubs)

Check warning on line 21 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.11)

Type of "build_google_client" is partially unknown   Type of "build_google_client" is "(serviceName: Unknown, version: Unknown, http: Unknown | None = None, discoveryServiceUrl: Unknown | None = None, developerKey: Unknown | None = None, model: Unknown | None = None, requestBuilder: type[HttpRequest] = HttpRequest, credentials: Unknown | None = None, cache_discovery: bool = True, cache: Unknown | None = None, client_options: Unknown | None = None, adc_cert_path: Unknown | None = None, adc_key_path: Unknown | None = None, num_retries: int = 1, static_discovery: Unknown | None = None, always_use_jwt_access: bool = False) -> Unknown" (reportUnknownVariableType)

Check warning on line 21 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.12)

Stub file not found for "googleapiclient.discovery" (reportMissingTypeStubs)

Check warning on line 21 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.12)

Type of "build_google_client" is partially unknown   Type of "build_google_client" is "(serviceName: Unknown, version: Unknown, http: Unknown | None = None, discoveryServiceUrl: Unknown | None = None, developerKey: Unknown | None = None, model: Unknown | None = None, requestBuilder: type[HttpRequest] = HttpRequest, credentials: Unknown | None = None, cache_discovery: bool = True, cache: Unknown | None = None, client_options: Unknown | None = None, adc_cert_path: Unknown | None = None, adc_key_path: Unknown | None = None, num_retries: int = 1, static_discovery: Unknown | None = None, always_use_jwt_access: bool = False) -> Unknown" (reportUnknownVariableType)

Check warning on line 21 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.13)

Stub file not found for "googleapiclient.discovery" (reportMissingTypeStubs)

Check warning on line 21 in tp53/seshat/find_in_gmail/_find_in_gmail.py

View workflow job for this annotation

GitHub Actions / Tests (3.13)

Type of "build_google_client" is partially unknown   Type of "build_google_client" is "(serviceName: Unknown, version: Unknown, http: Unknown | None = None, discoveryServiceUrl: Unknown | None = None, developerKey: Unknown | None = None, model: Unknown | None = None, requestBuilder: type[HttpRequest] = HttpRequest, credentials: Unknown | None = None, cache_discovery: bool = True, cache: Unknown | None = None, client_options: Unknown | None = None, adc_cert_path: Unknown | None = None, adc_key_path: Unknown | None = None, num_retries: int = 1, static_discovery: Unknown | None = None, always_use_jwt_access: bool = False) -> Unknown" (reportUnknownVariableType)

from ._exceptions import SeshatError
from .._exceptions import SeshatError

logger: Logger = getLogger("tp53.seshat")

Expand Down
2 changes: 2 additions & 0 deletions tp53/seshat/upload_vcf/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from ._upload_vcf import HumanGenomeAssembly as HumanGenomeAssembly
from ._upload_vcf import upload_vcf as upload_vcf
118 changes: 118 additions & 0 deletions tp53/seshat/upload_vcf/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
"""
Upload a VCF to the Seshat TP53 annotation server using a headless browser.
This tool is used to programmatically configure and upload batch variants in VCF
format to the Seshat annotation server. The tool works by building a headless
Chrome browser instance and then interacting with the Seshat website directly
through simulated key presses and mouse clicks. Unfortunately, Seshat does not
provide a native programmatic API and one could not be reverse engineered.
Seshat also utilizes custom JavaScript in their form processing, so a
lightweight approach of simply interacting with the HTML form elements was
also not possible.
#### VCF Input Requirements
Seshat will not let the user know why a VCF fails to annotate, but it has
been observed that Seshat can fail to parse some of VarDictJava's structural
variants (SVs) as valid variant records. One solution that has worked in the
past is to remove SVs. The following command will exclude all variants with a
non-empty SVTYPE INFO key:
bcftools view in.vcf --exclude 'SVTYPE!="."' > out.noSV.vcf
#### Automation
There are no terms and conditions posted on the Seshat annotation server's
website, and there is no server-side `robots.txt` rule set. In lieu of usage
terms, we strongly encourage all users of this script to respect the Seshat
resource by adhering to the following best practice:
- Minimize Load: Limit the rate of requests to the server
- Minimize Connections: Limit the number of concurrent requests
If you need to batch process dozens, or hundreds, of VCF callsets, you may
consider improving this underlying Python script to randomize the user agent and
IP address of your headless browser session to prevent from being labelled as a
bot.
#### Environment Setup
This script relies on Chrome:
brew install --cask google-chrome
Distributions of MacOS require you to authenticate the Chrome driver:
- https://stackoverflow.com/a/60362134
#### References
1. Soussi, Thierry, et al. “Recommendations for Analyzing and Reporting TP53
Gene Variants in the High-Throughput Sequencing Era.” Human Mutation,
vol. 35, no. 6, 2014, pp. 766–778., doi:10.1002/humu.22561.
───────
"""

import argparse
import logging
import sys
from pathlib import Path

from ._upload_vcf import DEFAULT_REMOTE_URL
from ._upload_vcf import HumanGenomeAssembly
from ._upload_vcf import upload_vcf

if __name__ == "__main__":
formatter = argparse.RawTextHelpFormatter

cli_args = sys.argv[1:]

parser = argparse.ArgumentParser(
description=__doc__,
add_help=True,
formatter_class=formatter,
epilog=r"Copyright © Clint Valentine 2024",
)

_ = parser.add_argument(
"--input",
required=True,
type=Path,
help="The path to the VCF to upload.",
)
_ = parser.add_argument(
"--email",
required=True,
type=str,
help="The email address to receive annotated variants at.",
)
_ = parser.add_argument(
"--assembly",
type=HumanGenomeAssembly,
default=HumanGenomeAssembly.hg38,
help="The human genome assembly of the VCF.\n(default: hg38)",
)
_ = parser.add_argument(
"--url",
type=str,
default=DEFAULT_REMOTE_URL,
help="The Seshat TP53 web server URL.\n(default: http://vps338341.ovh.net/batch_analysis)",
)
_ = parser.add_argument(
"--wait_for",
type=int,
default=5,
help="Seconds to wait for upload to occur before failure.\n(default: 5)",
)
args = parser.parse_args(cli_args)

logging.basicConfig(datefmt="[%X]", level=logging.INFO)

upload_vcf(
vcf=args.input,
email=args.email,
assembly=args.assembly,
url=args.url,
wait_for=args.wait_for,
)
21 changes: 13 additions & 8 deletions tp53/seshat/_upload.py → tp53/seshat/upload_vcf/_upload_vcf.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
import logging
import time
from datetime import datetime
from datetime import timedelta
from enum import StrEnum
from enum import auto
from logging import Logger
from logging import getLogger
from pathlib import Path

from chromedriver_py import binary_path
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webdriver import WebDriver as RemoteWebDriver

from ._exceptions import SeshatError
from .._exceptions import SeshatError

logger: Logger = getLogger("tp53.seshat")
logger: Logger = logging.getLogger("tp53.seshat.upload_vcf")

DEFAULT_REMOTE_URL: str = "http://vps338341.ovh.net/batch_analysis"
"""The default remote Seshat batch analysis URL."""
Expand All @@ -34,7 +36,7 @@ class HumanGenomeAssembly(StrEnum):
"""The human genome assembly GRCh37 (hg19)."""


def seshat_upload_status(driver: RemoteWebDriver) -> str:
def upload_status(driver: RemoteWebDriver) -> str:
"""Query the file uploading status and return its text representation."""
modal = driver.find_element(By.XPATH, '//*[@id="uploading-status-text"]')
inner = modal.get_attribute("innerText")
Expand All @@ -55,16 +57,18 @@ def upload_vcf(
Args:
vcf: The path to the VCF to upload.
email: The email address to receive Seshat TP53 variant annotations.
email: The email address to receive annotated variants at.
assembly: The human genome assembly of the VCF.
url: The Seshat TP53 web server URL.
wait_for: The total amount of time in seconds to wait for the upload occur before failure.
wait_for: Seconds to wait for upload to occur before failure.
"""
vcf = str(Path(vcf).expanduser().absolute())

service = webdriver.ChromeService(executable_path=binary_path)
options = webdriver.ChromeOptions()
options.add_argument("headless")
driver = webdriver.Chrome(options=options)

driver = webdriver.Chrome(service=service, options=options)

driver.get(url)
driver.find_element(By.XPATH, f'//select[@id="reference"]/option[@value="{assembly}"]').click()
Expand All @@ -75,8 +79,9 @@ def upload_vcf(

status: str = ""
while (SUCCESS not in status) and datetime.now() < upload_start + timedelta(seconds=wait_for):
status = seshat_upload_status(driver)
status = upload_status(driver)
logger.info(status)
time.sleep(0.1)

driver.quit()

Expand Down

0 comments on commit 5659bfb

Please sign in to comment.