feat: Add Interval constructor from UCSC-formatted string #29

msto · 2024-04-04T08:24:29Z

See function docstring for details

nh13

Looking good. My major concern is that it is too restrictive on parsing contig names

pybedlite/overlap_detector.py

nh13 · 2024-04-04T14:28:29Z

pybedlite/overlap_detector.py

+            negative = False
+
+        # Then parse the location
+        position_re = re.compile(r"^(chr(\d+|X|Y|M|MT)(?:_[A-Za-z0-9]+_alt)?):(\d+)-(\d+)$")


ditto about pre-compiling

I am confused by this regex and why it has so much assumed about the contig name? Have you thought about this deeply for non-human or alternate/decoy/custom contig names? For example chrUn_JTFH01001499v1_decoy doesn't parse and neither does HLA-DRB1*15:01:01:02.

Good points both.

Should we accept any string of characters followed by the start/end (i.e. ^(.*):(\d+)-(\d+)$) ? Or are there any constraints you think we should impose on the contig name?

Lets use the regex in the SAM spec? Alternatively, why not just split on the last colon?

If you're fine splitting on the last colon, this is functionally equivalent and imo slightly more elegant since every element (refname, start, and end) is represented in a matched group

If you have a strong preference I'll put in the SAM spec regex. It wouldn't be my preference as it's somewhat gnarly and I don't think python supports defining custom character classes to simplify, the way [:rname:] is used in the spec

[0-9A-Za-z!#$%&+./:;?@^_|~-][0-9A-Za-z!#$%&*+./:;=?@^_|~-]*

@nh13 bump 🙂

pybedlite/overlap_detector.py

codecov · 2024-04-04T15:16:00Z

Codecov Report

Attention: Patch coverage is 92.30769% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 91.81%. Comparing base (674567c) to head (9e9eb97).

Files	Patch %	Lines
pybedlite/overlap_detector.py	88.23%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #29      +/-   ##
==========================================
+ Coverage   91.79%   91.81%   +0.02%     
==========================================
  Files           8        8              
  Lines         524      550      +26     
  Branches       92       95       +3     
==========================================
+ Hits          481      505      +24     
- Misses         26       27       +1     
- Partials       17       18       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nh13

What's the advantage of these regexes over some simple string splitting? I am thinking about maintenance, performance, etc.

strand = string[-2] if string[-3:] == "(-)" or string[-3:] == "(+)" else None    
refname, rest = string[:-3].rsplit(":", maxsplit=1)
start, end = rest.split("-", maxsplit=1)
return cls(refname=refname, start=int(start), end=int(end), negative=negative == "-", name=name)

nh13 · 2024-07-13T16:37:35Z

pybedlite/overlap_detector.py

@@ -55,6 +56,24 @@
 from pybedlite.bed_source import BedSource
 from pybedlite.bed_record import BedRecord

+UCSC_STRAND_REGEX = re.compile(r".*\((\+|-)\)$")


add typing here and below typing.Pattern

nh13 · 2024-07-13T16:39:10Z

pybedlite/tests/test_overlap_detector.py

+
+    # Check strand
+    assert Interval.from_ucsc("chr1:101-200(+)") == Interval("chr1", 100, 200, negative=False)
+    assert Interval.from_ucsc("chr1:101-200(-)") == Interval("chr1", 100, 200, negative=True)


use @pytest.mark.parametrize?

nh13

Thanks @msto, this LGTM!

msto requested review from jdidion, nh13 and clintval April 4, 2024 08:24

nh13 requested changes Apr 4, 2024

View reviewed changes

msto commented Apr 4, 2024

View reviewed changes

pybedlite/overlap_detector.py Outdated Show resolved Hide resolved

msto commented Apr 4, 2024

View reviewed changes

pybedlite/overlap_detector.py Outdated Show resolved Hide resolved

msto requested a review from nh13 April 4, 2024 15:14

nh13 requested changes Jul 13, 2024

View reviewed changes

msto added 6 commits July 15, 2024 08:53

feat: add Interval constructor from UCSC formatted string

39991bf

feat: permit strand

5abaf8a

fix: don't make assumptions about contig names

037fbdd

refactor: rename value and update strand extraction

a40b87f

refactor: string splitting

38561e0

test: refactor tests

2f0f545

msto force-pushed the ms_interval-from-string branch from 5d34be5 to 2f0f545 Compare July 15, 2024 12:53

msto had a problem deploying to github-action-ci July 15, 2024 12:54 — with GitHub Actions Failure

msto had a problem deploying to github-action-ci July 15, 2024 12:54 — with GitHub Actions Error

msto had a problem deploying to github-action-ci July 15, 2024 12:54 — with GitHub Actions Failure

refactor: cleanup

73be62b

msto temporarily deployed to github-action-ci July 15, 2024 12:57 — with GitHub Actions Inactive

msto requested a review from nh13 July 15, 2024 12:58

msto assigned nh13 Jul 15, 2024

test: add doc

4ebdba4

msto temporarily deployed to github-action-ci July 15, 2024 13:00 — with GitHub Actions Inactive

nh13 approved these changes Jul 15, 2024

View reviewed changes

msto merged commit 592fb8e into main Jul 15, 2024
4 checks passed

msto deleted the ms_interval-from-string branch July 15, 2024 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Interval constructor from UCSC-formatted string #29

feat: Add Interval constructor from UCSC-formatted string #29

msto commented Apr 4, 2024

nh13 left a comment

nh13 Apr 4, 2024

msto Apr 4, 2024 •

edited

Loading

nh13 Apr 4, 2024

msto Apr 4, 2024

msto Apr 4, 2024

msto Apr 24, 2024

codecov bot commented Apr 4, 2024 •

edited

Loading

nh13 left a comment

nh13 Jul 13, 2024

nh13 Jul 13, 2024

nh13 left a comment

feat: Add Interval constructor from UCSC-formatted string #29

feat: Add Interval constructor from UCSC-formatted string #29

Conversation

msto commented Apr 4, 2024

nh13 left a comment

Choose a reason for hiding this comment

nh13 Apr 4, 2024

Choose a reason for hiding this comment

msto Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

nh13 Apr 4, 2024

Choose a reason for hiding this comment

msto Apr 4, 2024

Choose a reason for hiding this comment

msto Apr 4, 2024

Choose a reason for hiding this comment

msto Apr 24, 2024

Choose a reason for hiding this comment

codecov bot commented Apr 4, 2024 • edited Loading

Codecov Report

nh13 left a comment

Choose a reason for hiding this comment

nh13 Jul 13, 2024

Choose a reason for hiding this comment

nh13 Jul 13, 2024

Choose a reason for hiding this comment

nh13 left a comment

Choose a reason for hiding this comment

msto Apr 4, 2024 •

edited

Loading

codecov bot commented Apr 4, 2024 •

edited

Loading