Signal Documentation Coverage Endpoint #1584

nolangormley · 2025-01-27T14:07:45Z

addresses issue(s) #1583

Summary:

Create a new endpoint that returns a list of signals when given a geo. There were other endpoints that did this in the past, but the tables they read from are no longer around in epi v4. Starting from scratch will be much quicker and rely less on legacy code.

Once this is merged, I will also make issues to delete the old coverage code so that we don't have duplicate endpoints.

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

nolangormley · 2025-01-27T14:08:28Z

Proposed SQL for this endpoint

CREATE TABLE coverage_crossref (
    signal_key_id bigint NOT NULL,
    geo_key_id bigint NOT NULL,
    min_time_value int NOT NULL,
    max_time_value int NOT NULL
)
SELECT
    el.signal_key_id,
    el.geo_key_id,
    MIN(el.time_value) as min_time_value,
    MAX(el.time_value) as max_time_value
FROM epimetric_latest el
GROUP BY el.signal_key_id, el.geo_key_id;

CREATE INDEX coverage_crossref_signal_key_id ON coverage_crossref (signal_key_id);
CREATE INDEX coverage_crossref_geo_key_id ON coverage_crossref (geo_key_id);


CREATE VIEW coverage_crossref_v AS
SELECT
    sd.source,
    sd.signal,
    gd.geo_type,
    gd.geo_value,
    cc.min_time_value,
    cc.max_time_value
FROM coverage_crossref cc
JOIN signal_dim sd USING (signal_key_id)
JOIN geo_dim gd USING (geo_key_id)

nolangormley · 2025-01-29T21:18:49Z

Sonar is failing on a security flaw around using HTTP instead of HTTPS, but it is for a test. This is how it's done in other tests too.

…ta into sig_doc_coverage

sonarqubecloud · 2025-01-31T17:11:19Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

melange396

this is good stuff, its nice and tight! i wouldnt have even thought about going with QueryBuilder and a db VIEW, but together they work well in getting the job done.

ive got a number of smallish suggestions that you can hopefully just hit "accept" for, a couple that you can ignore if you want, and a couple for removing duplicate results. for a bigger change, you should at least one more test case (which shouldnt be that complicated if you use CovidcastTestRow).

melange396 · 2025-01-31T18:03:27Z

src/acquisition/covidcast/database.py

+          el.signal_key_id,
+          el.geo_key_id,
+          MIN(el.time_value) as min_time_value,
+          MAX(el.time_value) as max_time_value
+      FROM covid.epimetric_latest el
+      GROUP BY el.signal_key_id, el.geo_key_id;


you can make this less busy without the table alias:

Suggested change

el.signal_key_id,

el.geo_key_id,

MIN(el.time_value) as min_time_value,

MAX(el.time_value) as max_time_value

FROM covid.epimetric_latest el

GROUP BY el.signal_key_id, el.geo_key_id;

signal_key_id,

geo_key_id,

MIN(time_value) AS min_time_value,

MAX(time_value) AS max_time_value

FROM covid.epimetric_latest

GROUP BY signal_key_id, geo_key_id;

melange396 · 2025-01-31T18:06:50Z

src/acquisition/covidcast/database.py

@@ -561,3 +561,32 @@ def retrieve_covidcast_meta_cache(self):
    for entry in cache:
      cache_hash[(entry['data_source'], entry['signal'], entry['time_type'], entry['geo_type'])] = entry
    return cache_hash
+
+  def compute_coverage_crossref(self):
+    """Compute coverage_crossref table."""


Suggested change

"""Compute coverage_crossref table."""

"""Compute coverage_crossref table, for looking up available signals per geo or vice versa."""

melange396 · 2025-01-31T18:12:00Z

src/acquisition/covidcast/database.py

+    self._cursor.execute(coverage_crossref_delete_sql)
+    logger.info(f"coverage_crossref_delete_sql:{self._cursor.rowcount}")
+
+    self._cursor.execute(coverage_crossref_update_sql)
+    logger.info(f"coverage_crossref_update_sql:{self._cursor.rowcount}")
+    self.commit()


logging changes to make the value easier to extract and use in elastic, and another log message to get commit timing:

Suggested change

self._cursor.execute(coverage_crossref_delete_sql)

logger.info(f"coverage_crossref_delete_sql:{self._cursor.rowcount}")

self._cursor.execute(coverage_crossref_update_sql)

logger.info(f"coverage_crossref_update_sql:{self._cursor.rowcount}")

self.commit()

self._cursor.execute(coverage_crossref_delete_sql)

logger.info("coverage_crossref_delete", rows=self._cursor.rowcount)

self._cursor.execute(coverage_crossref_update_sql)

logger.info("coverage_crossref_update", rows=self._cursor.rowcount)

self.commit()

logger.info("coverage_crossref committed")

melange396 · 2025-01-31T18:18:39Z

tests/acquisition/covidcast/test_database.py

@@ -78,6 +78,22 @@ def test_update_covidcast_meta_cache_query(self):
    self.assertIn('timestamp', sql)
    self.assertIn('epidata', sql)

+  def test_compute_coverage_crossref_query(self):


this test doesnt add a lot of value, but i cant see a good way to effectively unit test a method thats just a few static sql statements, so you can delete this if you want (but leaving it around doesnt really hurt either) ¯\_(ツ)_/¯

melange396 · 2025-01-31T18:40:12Z

integrations/acquisition/covidcast/test_coverage_crossref_update.py

@@ -0,0 +1,87 @@
+"""Integration tests for covidcast's metadata caching."""


Suggested change

"""Integration tests for covidcast's metadata caching."""

"""Integration tests for the covidcast `geo_coverage` endpoint."""

melange396 · 2025-01-31T19:11:38Z

src/server/endpoints/covidcast.py

+
+    q.apply_geo_filters("geo_type", "geo_value", geo_sets)
+    q.set_sort_order("source", "signal")
+


this is important to add, otherwise the response will grow ~linearly with the number of geos requested.

if we were writing straight SQL, i think using SELECT DISTINCT would be preferable to GROUP BY for clarity, though they should be virtually equivalent... but i dont think we can do that without modifying the QueryBuilder class -- this should work while using existing QB features:

Suggested change

q.group_by = fields_string # this condenses duplicate results, similar to `SELECT DISTINCT`

melange396 · 2025-01-31T19:19:52Z

integrations/acquisition/covidcast/test_coverage_crossref_update.py

+      'epidata': [{'signal': 'sig', 'source': 'src'},
+                  {'signal': 'sig', 'source': 'src'}],


Suggested change

'epidata': [{'signal': 'sig', 'source': 'src'},

{'signal': 'sig', 'source': 'src'}],

'epidata': [{'signal': 'sig', 'source': 'src'}],

i have a suggestion in src/server/endpoints/covidcast.py for doing a "group by" that should remove this duplication

melange396 · 2025-01-31T19:25:04Z

integrations/acquisition/covidcast/test_coverage_crossref_update.py

+    # update the coverage crossref table
+    main()
+
+    results = self._make_request()


you should do an additional non-wildcard request (or two) as well

melange396 · 2025-01-31T20:35:56Z

src/maintenance/coverage_crossref_updater.py

+
+
+  logger.info(
+      "Generated and updated covidcast metadata",


Suggested change

"Generated and updated covidcast metadata",

"Generated and updated covidcast geo/signal coverage",

melange396 · 2025-01-31T20:37:05Z

src/maintenance/coverage_crossref_updater.py

+def main(database_impl=Database):
+  """Updates the table for the `coverage_crossref`."""
+
+  logger = get_structured_logger("coverage_crossref_updater")
+  start_time = time.time()
+  database = database_impl()


you should remove this since the dependency injection doesnt actually get used elsewhere

Suggested change

def main(database_impl=Database):

"""Updates the table for the `coverage_crossref`."""

logger = get_structured_logger("coverage_crossref_updater")

start_time = time.time()

database = database_impl()

def main():

"""Updates the table for the `coverage_crossref`."""

logger = get_structured_logger("coverage_crossref_updater")

start_time = time.time()

database = Database()

first pass at coverage endpoint

b0934b5

nolangormley self-assigned this Jan 27, 2025

nolangormley added 2 commits January 27, 2025 09:17

added SQL to schemas

1a06e2e

added load table and function to recompute data

55d13cb

nolangormley changed the title ~~first pass at coverage endpoint~~ Signal Documentation Coverage Endpoint Jan 27, 2025

nolangormley requested a review from melange396 January 27, 2025 14:39

nolangormley added 5 commits January 27, 2025 13:00

modified endpoint to use geo_sets, allowing multiple geos in filter

16fe7b7

removed indexes from load table, fixed index declaration on main table

85a972e

added CLI and tests

c2baf02

fixed sonar suggestions

233ccd2

more sonar issues

65c040a

nolangormley marked this pull request as ready for review January 29, 2025 21:17

aysim319 and others added 5 commits January 30, 2025 11:27

ignore sonar test

3c404b2

removed unessesary index rebuild, fixed indexes in ddl sql file

a5555a5

Merge branch 'sig_doc_coverage' of github.com:cmu-delphi/delphi-epida…

31200bd

…ta into sig_doc_coverage

updated SQL to use transaction instead of load table

26c6450

updated integration test, removed SQL alias

1f26061

melange396 requested changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Signal Documentation Coverage Endpoint #1584

Signal Documentation Coverage Endpoint #1584

nolangormley commented Jan 27, 2025 •

edited

Loading

nolangormley commented Jan 27, 2025

nolangormley commented Jan 29, 2025

sonarqubecloud bot commented Jan 31, 2025

melange396 left a comment

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

melange396 Jan 31, 2025

	"""Compute coverage_crossref table."""
	"""Compute coverage_crossref table, for looking up available signals per geo or vice versa."""

		@@ -0,0 +1,87 @@
		"""Integration tests for covidcast's metadata caching."""

	"""Integration tests for covidcast's metadata caching."""
	"""Integration tests for the covidcast `geo_coverage` endpoint."""


		q.apply_geo_filters("geo_type", "geo_value", geo_sets)
		q.set_sort_order("source", "signal")


	q.group_by = fields_string # this condenses duplicate results, similar to `SELECT DISTINCT`

		'epidata': [{'signal': 'sig', 'source': 'src'},
		{'signal': 'sig', 'source': 'src'}],

	"Generated and updated covidcast metadata",
	"Generated and updated covidcast geo/signal coverage",

Signal Documentation Coverage Endpoint #1584

Are you sure you want to change the base?

Signal Documentation Coverage Endpoint #1584

Conversation

nolangormley commented Jan 27, 2025 • edited Loading

Summary:

Prerequisites:

nolangormley commented Jan 27, 2025

nolangormley commented Jan 29, 2025

sonarqubecloud bot commented Jan 31, 2025

Quality Gate passed

melange396 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nolangormley commented Jan 27, 2025 •

edited

Loading