-
Notifications
You must be signed in to change notification settings - Fork 11
[WIP] [#604] Normalize organisations names #116
base: master
Are you sure you want to change the base?
[WIP] [#604] Normalize organisations names #116
Conversation
9d9841a
to
5f2ef1c
Compare
58e9ca7
to
5f2ef1c
Compare
So, played a bit with the failing build issue, here are some observations:
There must be a |
Just for the record, this discussion summarizes the issue we are facing here. Apparently the most elegant solution comes from the libs (fastcluster and pyhacrf) which shouldn't use numpy as requirement on their setup.py @nightsh both workarounds you've suggested work, and if we choose to go for any of them, I would also suggest to avoid the tox deps and add the custom command on the commands target, right before the tests itselves |
2c70917
to
5f2ef1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from my comments, could you explain why we need a training file? I thought we would simply read the contents from the DB when re-generating the clusters.
processors/base/helpers/__init__.py
Outdated
""" | ||
CLUSTER_QUERY = "SELECT canonical " + \ | ||
"FROM organisation_clusters " + \ | ||
"WHERE '%s'=ANY(variations)" % organisation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of binding the organisation
parameter via string interpolation, rely on SQLAlchemy for that. The query would become SELECT canonical FROM organisation_clusters WHERE '?'=ANY(variations)
and you'd call it as conn['warehouse'].query(CLUSTER_QUERY, organisation)
(if I remember correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
processors/base/helpers/__init__.py
Outdated
logger.debug('Organisation "%s" normalized as "%s"', organisation, normalized_form) | ||
except StopIteration: | ||
logger.debug('Organisation "%s" not normalized', organisation) | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove the pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved
cluster_ghent = { | ||
'canonical': 'Ghent University Hospital', | ||
'variations': ['Ghent University Hospital', 'Ghent University'] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These variations feel like should be in different clusters, as they look like different organisations.
|
||
@pytest.mark.usefixtures('organisation_cluster') | ||
def test_organisation_normalizer(self, conn, test_input, expected): | ||
assert helpers.get_canonical_organisation_name(conn, test_input) == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is doing too much. First of all, the pattern of the organisation_cluster
fixture is confusing, as it simply adds two fixtures to the DB without returning anything (which is probably the reason you've used @pytest.mark.usefixtures
instead of adding it as a parameter). There's also the issue that we're hardcoding the data we expect to be created by the fixture here. I think a fixture is the wrong abstraction here, at least with the very simple fixture tools we have.
Instead of doing this, please remove the organisation_cluster
fixture and create the data you need manually in the tests for the 3 cases you're testing. You could create a _create_organisation_cluster(self, conn, canonical_name, variations)
to help keep the code DRY.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. Please check if the current approach meets your suggestion
All code issues were resolved. We're still looking into how to train the clustering algorithm to avoid incorrect clusters |
opentrials/opentrials#604