Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(dev-branch-pacbio) #3453

Merged
merged 76 commits into from
Aug 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
5018359
first draft
ChrOertlin Jul 22, 2024
b04281b
Merge branch 'master' into dev-pacbio-flow
ChrOertlin Jul 23, 2024
6c69db4
add new classes
ChrOertlin Jul 23, 2024
6137696
flesh out skeleton
ChrOertlin Jul 23, 2024
6c07296
iniital pacbio setup
ChrOertlin Jul 23, 2024
1d2c34e
Merge branch 'master' into dev-pacbio-flow
ChrOertlin Jul 23, 2024
e83b91c
Apply suggestions from code review
ChrOertlin Jul 23, 2024
32517f0
Merge branch 'master' into dev-pacbio-flow
diitaz93 Jul 24, 2024
166a830
add pacbio run data generator (#3458)
ChrOertlin Jul 24, 2024
8162060
Merge branch 'master' into dev-pacbio-flow
ChrOertlin Jul 24, 2024
dbeeeae
add PacBioRunFileManager (#3460)
diitaz93 Jul 24, 2024
53a3cd2
Move the modules to the correct location (#3462)
diitaz93 Jul 25, 2024
d2b5835
refactor pacbio metrics parser (#3467)
diitaz93 Jul 26, 2024
25a53b8
PacBio - implement parsing of failed reads metrics (#3469)
diitaz93 Jul 26, 2024
2b49c33
setup dtos (#3472)
ChrOertlin Jul 26, 2024
ee9ab63
add(hk service to pacbio flow) (#3466)
ChrOertlin Jul 26, 2024
ca4b6b1
add sample dto (#3475)
ChrOertlin Jul 29, 2024
cf5a0e4
Merge branch 'master' into dev-pacbio-flow
ChrOertlin Jul 29, 2024
1a21ecb
small change in dto
ChrOertlin Jul 29, 2024
b9c23ea
add sample id to dto
ChrOertlin Jul 29, 2024
0103b45
fix types of pacbio dto attributes
diitaz93 Jul 29, 2024
597871b
remove unused import
diitaz93 Jul 29, 2024
c0c4ce6
black
diitaz93 Jul 29, 2024
7a5df2b
Add PacBio Data Transfer service (#3477)(patch)
diitaz93 Jul 29, 2024
4b380b9
abstract classes cleanup
ChrOertlin Jul 29, 2024
e842068
fix self
ChrOertlin Jul 29, 2024
8b78253
add(pacbio store service) (#3478)
ChrOertlin Jul 29, 2024
9e7c594
Merge branch 'master' into dev-pacbio-flow
diitaz93 Jul 29, 2024
94efad2
add abstract method decorator and removed unused imports
diitaz93 Jul 29, 2024
705a0be
add PacBio post processing service (#3481)
diitaz93 Jul 29, 2024
8380cf2
remove unused function in HK service
diitaz93 Jul 29, 2024
325b045
Add cli post processing (#3485)
diitaz93 Jul 30, 2024
e4872ee
Remove pbi file from post processing (#3491)
diitaz93 Jul 31, 2024
eec50e6
rename test module
diitaz93 Jul 31, 2024
7a41cf6
add error handlers pacbio flow (#3487)
ChrOertlin Jul 31, 2024
28183de
add dry run functionality to PacBio post-processing (#3483)
diitaz93 Jul 31, 2024
266822e
black
diitaz93 Jul 31, 2024
77ce384
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 7, 2024
9692bfa
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 7, 2024
a0c1775
black
diitaz93 Aug 7, 2024
171eef6
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 7, 2024
dba4415
Set PacBio sequencing times (#3518)
diitaz93 Aug 8, 2024
6940d27
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 8, 2024
4a3ff25
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 8, 2024
6720120
add CLI command to CLI
diitaz93 Aug 9, 2024
b0501e7
improve help docstrings
diitaz93 Aug 9, 2024
2606b19
add logging to post-process base
diitaz93 Aug 9, 2024
079d9c7
Improve error raising with wrong run name
diitaz93 Aug 9, 2024
53167d9
validate existence of run path
diitaz93 Aug 9, 2024
5ed5666
fix bug in PacBio service definition
diitaz93 Aug 9, 2024
fbe13d3
Merge branch 'master' into dev-pacbio-flow
ChrOertlin Aug 12, 2024
b04330e
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 14, 2024
abccba0
fix error handling logging
diitaz93 Aug 14, 2024
e90789b
rename dir (#3562)
ChrOertlin Aug 14, 2024
65c1995
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 14, 2024
487ccb8
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 14, 2024
9d3bd2b
reordered dto, metric and model attributes
diitaz93 Aug 14, 2024
a823c99
add missing parameter to sample DTO
diitaz93 Aug 14, 2024
463b2d4
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 14, 2024
17052d4
fix names in model
diitaz93 Aug 14, 2024
06482a3
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 14, 2024
3f36046
switch pacbio model column names back
diitaz93 Aug 14, 2024
d4a6de1
apply column name change in crud
diitaz93 Aug 14, 2024
e26ef8e
change dry run logging from debug to info
diitaz93 Aug 14, 2024
848eba2
fix sample run metrics table name
diitaz93 Aug 14, 2024
c616758
Improve docsrings
diitaz93 Aug 14, 2024
3fbf44f
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 15, 2024
bc46e44
Make hifi_mean_read_length int
diitaz93 Aug 15, 2024
445c21c
Make failed_mean_read_length int
diitaz93 Aug 15, 2024
ebc3ae3
Make polymerase reads int
diitaz93 Aug 15, 2024
f4eb5d2
Make sample metrics int
diitaz93 Aug 15, 2024
3af0845
Make control metrics int
diitaz93 Aug 15, 2024
f0a4d42
Add cell tag to bam file
diitaz93 Aug 15, 2024
f1ed67b
Add cell tag to bam file and fix tag concatenation (#3571)
diitaz93 Aug 15, 2024
758d862
Merge branch 'master' into dev-pacbio-flow
diitaz93 Aug 15, 2024
063bd25
Merge remote-tracking branch 'origin/dev-pacbio-flow' into dev-pacbio…
diitaz93 Aug 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 11 additions & 15 deletions cg/apps/housekeeper/hk.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,7 @@

from housekeeper.include import checksum as hk_checksum
from housekeeper.include import include_version
from housekeeper.store.database import (
create_all_tables,
drop_all_tables,
initialize_database,
)
from housekeeper.store.database import create_all_tables, drop_all_tables, initialize_database
from housekeeper.store.models import Archive, Bundle, File, Tag, Version
from housekeeper.store.store import Store
from sqlalchemy.orm import Query
Expand Down Expand Up @@ -570,19 +566,19 @@ def add_bundle_and_version_if_non_existent(self, bundle_name: str) -> None:
else:
LOG.debug(f"Bundle with name {bundle_name} already exists")

def store_fastq_path_in_housekeeper(
def create_bundle_and_add_file_with_tags(
self,
sample_internal_id: str,
sample_fastq_path: Path,
flow_cell_id: str,
bundle_name: str,
file_path: Path,
tags: list[str],
) -> None:
"""Add the fastq file path with tags to a bundle and version in Housekeeper."""
self.add_bundle_and_version_if_non_existent(sample_internal_id)
self.add_tags_if_non_existent([sample_internal_id])
"""Add a file path with tags to a bundle and version in Housekeeper."""
self.add_bundle_and_version_if_non_existent(bundle_name)
self.add_tags_if_non_existent([bundle_name])
self.add_file_to_bundle_if_non_existent(
file_path=sample_fastq_path,
bundle_name=sample_internal_id,
tag_names=[SequencingFileTag.FASTQ, flow_cell_id, sample_internal_id],
file_path=file_path,
bundle_name=bundle_name,
tag_names=tags,
)

def get_archive_entries(
Expand Down
5 changes: 3 additions & 2 deletions cg/cli/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@
from sqlalchemy.orm import scoped_session

import cg
from cg.cli.sequencing_qc.sequencing_qc import sequencing_qc
from cg.cli.validate import validate
from cg.cli.add import add as add_cmd
from cg.cli.archive import archive
from cg.cli.backup import backup
Expand All @@ -22,6 +20,8 @@
from cg.cli.downsample import downsample
from cg.cli.generate.base import generate as generate_cmd
from cg.cli.get import get
from cg.cli.post_process.post_process import post_process_group as post_processing
from cg.cli.sequencing_qc.sequencing_qc import sequencing_qc
from cg.cli.set.base import set_cmd
from cg.cli.store.base import store as store_cmd
from cg.cli.transfer import transfer_group
Expand Down Expand Up @@ -120,5 +120,6 @@ def init(context: CGConfig, reset: bool, force: bool):
base.add_command(demultiplex_cmd)
base.add_command(generate_cmd)
base.add_command(downsample)
base.add_command(post_processing)
base.add_command(validate)
base.add_command(sequencing_qc)
39 changes: 39 additions & 0 deletions cg/cli/post_process/post_process.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""CLI commands to post-process a sequencing run."""

import logging

import click

from cg.cli.post_process.utils import get_post_processing_service_from_run_name
from cg.cli.utils import CLICK_CONTEXT_SETTINGS
from cg.constants.cli_options import DRY_RUN
from cg.models.cg_config import CGConfig
from cg.services.run_devices.abstract_classes import PostProcessingService

LOG = logging.getLogger(__name__)


@click.group(name="post-process", context_settings=CLICK_CONTEXT_SETTINGS)
def post_process_group():
"""Post-process sequencing runs from the sequencing instruments."""
LOG.info("Running cg post-processing.")


@post_process_group.command(name="run")
@DRY_RUN
@click.argument("run-name")
@click.pass_obj
def post_process_sequencing_run(context: CGConfig, run_name: str, dry_run: bool):
"""Post-process a sequencing run from the PacBio instrument.

run-name is the full name of the sequencing unit of run. For example:
PacBio: 'r84202_20240522_133539/1_A01'
"""
post_processing_service: PostProcessingService = get_post_processing_service_from_run_name(
context=context, run_name=run_name
)
post_processing_service.post_process(run_name=run_name, dry_run=dry_run)


post_process_group: click.Group
diitaz93 marked this conversation as resolved.
Show resolved Hide resolved
post_process_group.add_command(post_process_sequencing_run)
23 changes: 23 additions & 0 deletions cg/cli/post_process/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from cg.exc import CgError
from cg.models.cg_config import CGConfig
from cg.services.run_devices.pacbio.post_processing_service import PacBioPostProcessingService
from cg.utils.mapping import get_item_by_pattern_in_source

PATTERN_TO_DEVICE_MAP: dict[str, str] = {
r"^r\d+_\d+_\d+/(1|2)_[^/]+$": "pacbio",
}


def get_post_processing_service_from_run_name(
context: CGConfig, run_name: str
) -> PacBioPostProcessingService:
"""Get the correct post-processing service based on the run name."""
try:
device: str = get_item_by_pattern_in_source(
source=run_name, pattern_map=PATTERN_TO_DEVICE_MAP
)
Comment on lines +16 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to use the same function for mapping files to tags as we do to map directories to post-process classes. I feel like that function might be too general. I don't think having a helper function that only applies PATTERN_TO_DEVICE_MAP is problematic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can wrap the map plus the function that returns the device.

except CgError as error:
raise NameError(
f"Run name {run_name} does not match with any known sequencing run name pattern"
) from error
return getattr(context.post_processing_services, device)
1 change: 1 addition & 0 deletions cg/constants/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,7 @@ class HastaSlurmPartitions(StrEnum):


class FileExtensions(StrEnum):
BAM: str = ".bam"
BED: str = ".bed"
COMPLETE: str = ".complete"
CONFIG: str = ".config"
Expand Down
50 changes: 40 additions & 10 deletions cg/constants/pacbio.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,30 @@
"""Constants related to PacBio sequencing."""

from cg.constants import FileExtensions


class PacBioDirsAndFiles:
ADAPTER_REPORT: str = "adapter.report.json"
BASECALLING_REPORT: str = "ccs.report.json"
CCS_REPORT_SUFFIX: str = "ccs_report.json"
diitaz93 marked this conversation as resolved.
Show resolved Hide resolved
CONTROL_REPORT: str = "control.report.json"
LOADING_REPORT: str = "loading.report.json"
HIFI_READS: str = "hifi_reads"
RAW_DATA_REPORT: str = "raw_data.report.json"
SMRTLINK_DATASETS_REPORT: str = "smrtlink-datasets.json"
CCS_REPORT_SUFFIX: str = "ccs_report.json"
STATISTICS_DIR: str = "statistics"
UNZIPPED_REPORTS_DIR: str = "unzipped_reports"


class CCSAttributeIDs:
NUMBER_OF_READS: str = "ccs2.number_of_ccs_reads"
TOTAL_NUMBER_OF_BASES: str = "ccs2.total_number_of_ccs_bases"
MEAN_READ_LENGTH: str = "ccs2.mean_ccs_readlength"
MEDIAN_READ_LENGTH: str = "ccs2.median_ccs_readlength"
READ_LENGTH_N50: str = "ccs2.ccs_readlength_n50"
MEDIAN_ACCURACY: str = "ccs2.median_accuracy"
PERCENT_Q30: str = "ccs2.percent_ccs_bases_q30"
HIFI_READS: str = "ccs_processing.number_of_ccs_reads_q20"
HIFI_YIELD: str = "ccs_processing.total_number_of_ccs_bases_q20"
HIFI_MEAN_READ_LENGTH: str = "ccs_processing.mean_ccs_readlength_q20"
HIFI_MEDIAN_READ_LENGTH: str = "ccs_processing.median_ccs_readlength_q20"
HIFI_READ_LENGTH_N50: str = "ccs_processing.ccs_readlength_n50_q20"
HIFI_MEDIAN_READ_QUALITY: str = "ccs_processing.median_qv_q20"
PERCENT_Q30: str = "ccs_processing.base_percentage_q30"
FAILED_READS: str = "ccs_processing.number_of_ccs_reads_lq"
FAILED_YIELD: str = "ccs_processing.total_number_of_ccs_bases_lq"
FAILED_MEAN_READ_LENGTH: str = "ccs_processing.mean_ccs_readlength_lq"


class ControlAttributeIDs:
Expand Down Expand Up @@ -48,5 +54,29 @@ class SmrtLinkDatabasesIDs:
CELL_INDEX: str = "cellIndex"
MOVIE_NAME: str = "metadataContextId"
PATH: str = "path"
RUN_COMPLETED_AT = "createdAt"
WELL_NAME: str = "wellName"
WELL_SAMPLE_NAME: str = "wellSampleName"


class PacBioHousekeeperTags:
CCS_REPORT: str = "ccs-report"
CONTROL_REPORT: str = "control-report"
LOADING_REPORT: str = "loading-report"
RAWDATA_REPORT: str = "raw-data-report"
DATASETS_REPORT: str = "datasets-report"


class PacBioBundleTypes:
SAMPLE: str = "sample"
SMRT_CELL: str = "smrt_cell"


file_pattern_to_bundle_type: dict[str, str] = {
PacBioDirsAndFiles.CONTROL_REPORT: PacBioBundleTypes.SMRT_CELL,
f".*{PacBioDirsAndFiles.CCS_REPORT_SUFFIX}$": PacBioBundleTypes.SMRT_CELL,
PacBioDirsAndFiles.LOADING_REPORT: PacBioBundleTypes.SMRT_CELL,
PacBioDirsAndFiles.RAW_DATA_REPORT: PacBioBundleTypes.SMRT_CELL,
PacBioDirsAndFiles.SMRTLINK_DATASETS_REPORT: PacBioBundleTypes.SMRT_CELL,
f"{PacBioDirsAndFiles.HIFI_READS}{FileExtensions.BAM}$": PacBioBundleTypes.SAMPLE,
}
4 changes: 0 additions & 4 deletions cg/exc.py
Original file line number Diff line number Diff line change
Expand Up @@ -314,10 +314,6 @@ class OverrideCyclesError(CgError):
"""Exception raised when the override cycles are not correct."""


class PacBioMetricsParsingError(CgError):
"""Exception raised when PacBio metric files are not in place."""


class Chanjo2APIClientError(CgError):
"""Exception related to the Chanjo2 API client."""

Expand Down
58 changes: 58 additions & 0 deletions cg/models/cg_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,23 @@
FastqConcatenationService,
)
from cg.services.pdc_service.pdc_service import PdcService
from cg.services.run_devices.pacbio.data_storage_service.pacbio_store_service import (
PacBioStoreService,
)
from cg.services.run_devices.pacbio.data_transfer_service.data_transfer_service import (
PacBioDataTransferService,
)
from cg.services.run_devices.pacbio.housekeeper_service.pacbio_houskeeper_service import (
PacBioHousekeeperService,
)
from cg.services.run_devices.pacbio.metrics_parser.metrics_parser import PacBioMetricsParser
from cg.services.run_devices.pacbio.post_processing_service import PacBioPostProcessingService
from cg.services.run_devices.pacbio.run_data_generator.pacbio_run_data_generator import (
PacBioRunDataGenerator,
)
from cg.services.run_devices.pacbio.run_file_manager.run_file_manager import (
PacBioRunFileManager,
)
from cg.services.sequencing_qc_service.sequencing_qc_service import SequencingQCService
from cg.services.slurm_service.slurm_cli_service import SlurmCLIService
from cg.services.slurm_service.slurm_service import SlurmService
Expand Down Expand Up @@ -326,6 +343,13 @@ class RunInstruments(BaseModel):
illumina: IlluminaConfig


class PostProcessingServices(BaseModel):
pacbio: PacBioPostProcessingService

class Config:
arbitrary_types_allowed = True


class CGConfig(BaseModel):
data_input: DataInput | None = None
database: str
Expand Down Expand Up @@ -382,6 +406,7 @@ class CGConfig(BaseModel):
mutacc_auto_api_: MutaccAutoAPI = None
pdc: CommonAppConfig | None = None
pdc_service_: PdcService | None
post_processing_services_: PostProcessingServices | None = None
pigz: CommonAppConfig | None = None
sample_sheet_api_: IlluminaSampleSheetService | None = None
scout: CommonAppConfig = None
Expand Down Expand Up @@ -425,6 +450,7 @@ class Config:
"madeline_api_": "madeline_api",
"mutacc_auto_api_": "mutacc_auto_api",
"pdc_service_": "pdc_service",
"post_processing_services_": "post_processing_services",
"scout_api_": "scout_api",
"status_db_": "status_db",
"trailblazer_api_": "trailblazer_api",
Expand Down Expand Up @@ -563,6 +589,38 @@ def mutacc_auto_api(self) -> MutaccAutoAPI:
self.mutacc_auto_api_ = api
return api

@property
def post_processing_services(self) -> PostProcessingServices:
services = self.__dict__.get("post_processing_services_")
if services is None:
LOG.debug("Instantiating post-processing services")
services = PostProcessingServices(
pacbio=self.get_pacbio_post_processing_service(),
)
self.post_processing_services_ = services
return services

def get_pacbio_post_processing_service(self) -> PacBioPostProcessingService:
LOG.debug("Instantiating PacBio post-processing service")
run_data_generator = PacBioRunDataGenerator()
file_manager = PacBioRunFileManager()
metrics_parser = PacBioMetricsParser(file_manager=file_manager)
transfer_service = PacBioDataTransferService(metrics_service=metrics_parser)
store_service = PacBioStoreService(
store=self.status_db, data_transfer_service=transfer_service
)
hk_service = PacBioHousekeeperService(
hk_api=self.housekeeper_api,
file_manager=file_manager,
metrics_parser=metrics_parser,
)
return PacBioPostProcessingService(
run_data_generator=run_data_generator,
hk_service=hk_service,
store_service=store_service,
sequencing_dir=self.run_instruments.pacbio.data_dir,
)

@property
def pdc_service(self) -> PdcService:
service = self.__dict__.get("pdc_service_")
Expand Down
16 changes: 8 additions & 8 deletions cg/services/illumina/post_processing/housekeeper_storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,10 +78,10 @@ def add_sample_fastq_files_to_housekeeper(
device_internal_id=run_directory_data.id,
store=store,
):
hk_api.store_fastq_path_in_housekeeper(
sample_internal_id=sample_internal_id,
sample_fastq_path=sample_fastq_path,
flow_cell_id=run_directory_data.id,
hk_api.create_bundle_and_add_file_with_tags(
bundle_name=sample_internal_id,
file_path=sample_fastq_path,
tags=[run_directory_data.id, sample_internal_id, SequencingFileTag.FASTQ],
)


Expand All @@ -106,10 +106,10 @@ def store_undetermined_fastq_files(
device_internal_id=run_directory_data.id,
store=store,
):
hk_api.store_fastq_path_in_housekeeper(
sample_internal_id=sample_id,
sample_fastq_path=fastq_path,
flow_cell_id=run_directory_data.id,
hk_api.create_bundle_and_add_file_with_tags(
bundle_name=sample_id,
file_path=fastq_path,
tags=[run_directory_data.id, SequencingFileTag.FASTQ, sample_id],
)


Expand Down
39 changes: 0 additions & 39 deletions cg/services/pacbio/metrics/metrics_parser.py

This file was deleted.

Loading
Loading