-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Interval constructor from UCSC-formatted string #29
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. My major concern is that it is too restrictive on parsing contig names
pybedlite/overlap_detector.py
Outdated
negative = False | ||
|
||
# Then parse the location | ||
position_re = re.compile(r"^(chr(\d+|X|Y|M|MT)(?:_[A-Za-z0-9]+_alt)?):(\d+)-(\d+)$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ditto about pre-compiling
- I am confused by this regex and why it has so much assumed about the contig name? Have you thought about this deeply for non-human or alternate/decoy/custom contig names? For example
chrUn_JTFH01001499v1_decoy
doesn't parse and neither doesHLA-DRB1*15:01:01:02
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points both.
Should we accept any string of characters followed by the start/end (i.e. ^(.*):(\d+)-(\d+)$
) ? Or are there any constraints you think we should impose on the contig name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use the regex in the SAM spec? Alternatively, why not just split on the last colon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're fine splitting on the last colon, this is functionally equivalent and imo slightly more elegant since every element (refname
, start
, and end
) is represented in a matched group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a strong preference I'll put in the SAM spec regex. It wouldn't be my preference as it's somewhat gnarly and I don't think python supports defining custom character classes to simplify, the way [:rname:]
is used in the spec
[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nh13 bump 🙂
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #29 +/- ##
==========================================
+ Coverage 91.79% 91.81% +0.02%
==========================================
Files 8 8
Lines 524 550 +26
Branches 92 95 +3
==========================================
+ Hits 481 505 +24
- Misses 26 27 +1
- Partials 17 18 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of these regexes over some simple string splitting? I am thinking about maintenance, performance, etc.
strand = string[-2] if string[-3:] == "(-)" or string[-3:] == "(+)" else None
refname, rest = string[:-3].rsplit(":", maxsplit=1)
start, end = rest.split("-", maxsplit=1)
return cls(refname=refname, start=int(start), end=int(end), negative=negative == "-", name=name)
pybedlite/overlap_detector.py
Outdated
@@ -55,6 +56,24 @@ | |||
from pybedlite.bed_source import BedSource | |||
from pybedlite.bed_record import BedRecord | |||
|
|||
UCSC_STRAND_REGEX = re.compile(r".*\((\+|-)\)$") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add typing here and below typing.Pattern
|
||
# Check strand | ||
assert Interval.from_ucsc("chr1:101-200(+)") == Interval("chr1", 100, 200, negative=False) | ||
assert Interval.from_ucsc("chr1:101-200(-)") == Interval("chr1", 100, 200, negative=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use @pytest.mark.parametrize
?
5d34be5
to
2f0f545
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @msto, this LGTM!
See function docstring for details