Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signal Documentation Coverage Endpoint #1584

Open
wants to merge 13 commits into
base: dev
Choose a base branch
from
Open

Conversation

nolangormley
Copy link
Contributor

@nolangormley nolangormley commented Jan 27, 2025

addresses issue(s) #1583

Summary:

Create a new endpoint that returns a list of signals when given a geo. There were other endpoints that did this in the past, but the tables they read from are no longer around in epi v4. Starting from scratch will be much quicker and rely less on legacy code.

Once this is merged, I will also make issues to delete the old coverage code so that we don't have duplicate endpoints.

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

@nolangormley
Copy link
Contributor Author

Proposed SQL for this endpoint

CREATE TABLE coverage_crossref (
    signal_key_id bigint NOT NULL,
    geo_key_id bigint NOT NULL,
    min_time_value int NOT NULL,
    max_time_value int NOT NULL
)
SELECT
    el.signal_key_id,
    el.geo_key_id,
    MIN(el.time_value) as min_time_value,
    MAX(el.time_value) as max_time_value
FROM epimetric_latest el
GROUP BY el.signal_key_id, el.geo_key_id;

CREATE INDEX coverage_crossref_signal_key_id ON coverage_crossref (signal_key_id);
CREATE INDEX coverage_crossref_geo_key_id ON coverage_crossref (geo_key_id);


CREATE VIEW coverage_crossref_v AS
SELECT
    sd.source,
    sd.signal,
    gd.geo_type,
    gd.geo_value,
    cc.min_time_value,
    cc.max_time_value
FROM coverage_crossref cc
JOIN signal_dim sd USING (signal_key_id)
JOIN geo_dim gd USING (geo_key_id)

@nolangormley nolangormley self-assigned this Jan 27, 2025
@nolangormley nolangormley changed the title first pass at coverage endpoint Signal Documentation Coverage Endpoint Jan 27, 2025
@nolangormley nolangormley marked this pull request as ready for review January 29, 2025 21:17
@nolangormley
Copy link
Contributor Author

Sonar is failing on a security flaw around using HTTP instead of HTTPS, but it is for a test. This is how it's done in other tests too.

Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good stuff, its nice and tight! i wouldnt have even thought about going with QueryBuilder and a db VIEW, but together they work well in getting the job done.

ive got a number of smallish suggestions that you can hopefully just hit "accept" for, a couple that you can ignore if you want, and a couple for removing duplicate results. for a bigger change, you should at least one more test case (which shouldnt be that complicated if you use CovidcastTestRow).

Comment on lines +576 to +581
el.signal_key_id,
el.geo_key_id,
MIN(el.time_value) as min_time_value,
MAX(el.time_value) as max_time_value
FROM covid.epimetric_latest el
GROUP BY el.signal_key_id, el.geo_key_id;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this less busy without the table alias:

Suggested change
el.signal_key_id,
el.geo_key_id,
MIN(el.time_value) as min_time_value,
MAX(el.time_value) as max_time_value
FROM covid.epimetric_latest el
GROUP BY el.signal_key_id, el.geo_key_id;
signal_key_id,
geo_key_id,
MIN(time_value) AS min_time_value,
MAX(time_value) AS max_time_value
FROM covid.epimetric_latest
GROUP BY signal_key_id, geo_key_id;

@@ -561,3 +561,32 @@ def retrieve_covidcast_meta_cache(self):
for entry in cache:
cache_hash[(entry['data_source'], entry['signal'], entry['time_type'], entry['geo_type'])] = entry
return cache_hash

def compute_coverage_crossref(self):
"""Compute coverage_crossref table."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Compute coverage_crossref table."""
"""Compute coverage_crossref table, for looking up available signals per geo or vice versa."""

Comment on lines +585 to +590
self._cursor.execute(coverage_crossref_delete_sql)
logger.info(f"coverage_crossref_delete_sql:{self._cursor.rowcount}")

self._cursor.execute(coverage_crossref_update_sql)
logger.info(f"coverage_crossref_update_sql:{self._cursor.rowcount}")
self.commit()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging changes to make the value easier to extract and use in elastic, and another log message to get commit timing:

Suggested change
self._cursor.execute(coverage_crossref_delete_sql)
logger.info(f"coverage_crossref_delete_sql:{self._cursor.rowcount}")
self._cursor.execute(coverage_crossref_update_sql)
logger.info(f"coverage_crossref_update_sql:{self._cursor.rowcount}")
self.commit()
self._cursor.execute(coverage_crossref_delete_sql)
logger.info("coverage_crossref_delete", rows=self._cursor.rowcount)
self._cursor.execute(coverage_crossref_update_sql)
logger.info("coverage_crossref_update", rows=self._cursor.rowcount)
self.commit()
logger.info("coverage_crossref committed")

@@ -78,6 +78,22 @@ def test_update_covidcast_meta_cache_query(self):
self.assertIn('timestamp', sql)
self.assertIn('epidata', sql)

def test_compute_coverage_crossref_query(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test doesnt add a lot of value, but i cant see a good way to effectively unit test a method thats just a few static sql statements, so you can delete this if you want (but leaving it around doesnt really hurt either) ¯\_(ツ)_/¯

@@ -0,0 +1,87 @@
"""Integration tests for covidcast's metadata caching."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Integration tests for covidcast's metadata caching."""
"""Integration tests for the covidcast `geo_coverage` endpoint."""


q.apply_geo_filters("geo_type", "geo_value", geo_sets)
q.set_sort_order("source", "signal")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is important to add, otherwise the response will grow ~linearly with the number of geos requested.

if we were writing straight SQL, i think using SELECT DISTINCT would be preferable to GROUP BY for clarity, though they should be virtually equivalent... but i dont think we can do that without modifying the QueryBuilder class -- this should work while using existing QB features:

Suggested change
q.group_by = fields_string # this condenses duplicate results, similar to `SELECT DISTINCT`

Comment on lines +84 to +85
'epidata': [{'signal': 'sig', 'source': 'src'},
{'signal': 'sig', 'source': 'src'}],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'epidata': [{'signal': 'sig', 'source': 'src'},
{'signal': 'sig', 'source': 'src'}],
'epidata': [{'signal': 'sig', 'source': 'src'}],

i have a suggestion in src/server/endpoints/covidcast.py for doing a "group by" that should remove this duplication

# update the coverage crossref table
main()

results = self._make_request()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should do an additional non-wildcard request (or two) as well



logger.info(
"Generated and updated covidcast metadata",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Generated and updated covidcast metadata",
"Generated and updated covidcast geo/signal coverage",

Comment on lines +14 to +19
def main(database_impl=Database):
"""Updates the table for the `coverage_crossref`."""

logger = get_structured_logger("coverage_crossref_updater")
start_time = time.time()
database = database_impl()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should remove this since the dependency injection doesnt actually get used elsewhere

Suggested change
def main(database_impl=Database):
"""Updates the table for the `coverage_crossref`."""
logger = get_structured_logger("coverage_crossref_updater")
start_time = time.time()
database = database_impl()
def main():
"""Updates the table for the `coverage_crossref`."""
logger = get_structured_logger("coverage_crossref_updater")
start_time = time.time()
database = Database()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants