Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Interval constructor from UCSC-formatted string #29

Merged
merged 8 commits into from
Jul 15, 2024

Conversation

msto
Copy link
Contributor

@msto msto commented Apr 4, 2024

See function docstring for details

@msto msto requested review from jdidion, nh13 and clintval April 4, 2024 08:24
Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. My major concern is that it is too restrictive on parsing contig names

pybedlite/overlap_detector.py Outdated Show resolved Hide resolved
pybedlite/overlap_detector.py Outdated Show resolved Hide resolved
negative = False

# Then parse the location
position_re = re.compile(r"^(chr(\d+|X|Y|M|MT)(?:_[A-Za-z0-9]+_alt)?):(\d+)-(\d+)$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. ditto about pre-compiling
  2. I am confused by this regex and why it has so much assumed about the contig name? Have you thought about this deeply for non-human or alternate/decoy/custom contig names? For example chrUn_JTFH01001499v1_decoy doesn't parse and neither does HLA-DRB1*15:01:01:02.

Copy link
Contributor Author

@msto msto Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points both.

Should we accept any string of characters followed by the start/end (i.e. ^(.*):(\d+)-(\d+)$) ? Or are there any constraints you think we should impose on the contig name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use the regex in the SAM spec? Alternatively, why not just split on the last colon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're fine splitting on the last colon, this is functionally equivalent and imo slightly more elegant since every element (refname, start, and end) is represented in a matched group

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a strong preference I'll put in the SAM spec regex. It wouldn't be my preference as it's somewhat gnarly and I don't think python supports defining custom character classes to simplify, the way [:rname:] is used in the spec

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nh13 bump 🙂

pybedlite/overlap_detector.py Outdated Show resolved Hide resolved
pybedlite/overlap_detector.py Outdated Show resolved Hide resolved
@msto msto requested a review from nh13 April 4, 2024 15:14
Copy link

codecov bot commented Apr 4, 2024

Codecov Report

Attention: Patch coverage is 92.30769% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 91.81%. Comparing base (674567c) to head (9e9eb97).

Files Patch % Lines
pybedlite/overlap_detector.py 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   91.79%   91.81%   +0.02%     
==========================================
  Files           8        8              
  Lines         524      550      +26     
  Branches       92       95       +3     
==========================================
+ Hits          481      505      +24     
- Misses         26       27       +1     
- Partials       17       18       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of these regexes over some simple string splitting? I am thinking about maintenance, performance, etc.

strand = string[-2] if string[-3:] == "(-)" or string[-3:] == "(+)" else None    
refname, rest = string[:-3].rsplit(":", maxsplit=1)
start, end = rest.split("-", maxsplit=1)
return cls(refname=refname, start=int(start), end=int(end), negative=negative == "-", name=name)

@@ -55,6 +56,24 @@
from pybedlite.bed_source import BedSource
from pybedlite.bed_record import BedRecord

UCSC_STRAND_REGEX = re.compile(r".*\((\+|-)\)$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add typing here and below typing.Pattern


# Check strand
assert Interval.from_ucsc("chr1:101-200(+)") == Interval("chr1", 100, 200, negative=False)
assert Interval.from_ucsc("chr1:101-200(-)") == Interval("chr1", 100, 200, negative=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use @pytest.mark.parametrize?

@msto msto force-pushed the ms_interval-from-string branch from 5d34be5 to 2f0f545 Compare July 15, 2024 12:53
@msto msto had a problem deploying to github-action-ci July 15, 2024 12:54 — with GitHub Actions Failure
@msto msto had a problem deploying to github-action-ci July 15, 2024 12:54 — with GitHub Actions Error
@msto msto had a problem deploying to github-action-ci July 15, 2024 12:54 — with GitHub Actions Error
@msto msto had a problem deploying to github-action-ci July 15, 2024 12:54 — with GitHub Actions Failure
@msto msto temporarily deployed to github-action-ci July 15, 2024 12:57 — with GitHub Actions Inactive
@msto msto temporarily deployed to github-action-ci July 15, 2024 12:57 — with GitHub Actions Inactive
@msto msto temporarily deployed to github-action-ci July 15, 2024 12:57 — with GitHub Actions Inactive
@msto msto temporarily deployed to github-action-ci July 15, 2024 12:57 — with GitHub Actions Inactive
@msto msto requested a review from nh13 July 15, 2024 12:58
@msto msto temporarily deployed to github-action-ci July 15, 2024 13:00 — with GitHub Actions Inactive
@msto msto temporarily deployed to github-action-ci July 15, 2024 13:00 — with GitHub Actions Inactive
@msto msto temporarily deployed to github-action-ci July 15, 2024 13:00 — with GitHub Actions Inactive
@msto msto temporarily deployed to github-action-ci July 15, 2024 13:00 — with GitHub Actions Inactive
Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @msto, this LGTM!

@msto msto merged commit 592fb8e into main Jul 15, 2024
4 checks passed
@msto msto deleted the ms_interval-from-string branch July 15, 2024 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants