Skip to content
This repository has been archived by the owner on Jan 29, 2022. It is now read-only.

[WIP] [#604] Normalize organisations names #116

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

felipevieira
Copy link
Contributor

@felipevieira felipevieira changed the title Normalize organisation names Normalize organisations names Feb 9, 2017
@felipevieira felipevieira changed the title Normalize organisations names [WIP] #604 Normalize organisations names Feb 9, 2017
@felipevieira felipevieira changed the title [WIP] #604 Normalize organisations names [WIP] [#604] Normalize organisations names Feb 9, 2017
@felipevieira felipevieira force-pushed the feature/organization_normalization branch 5 times, most recently from 9d9841a to 5f2ef1c Compare February 13, 2017 04:29
@felipevieira felipevieira force-pushed the feature/organization_normalization branch 3 times, most recently from 58e9ca7 to 5f2ef1c Compare February 13, 2017 20:42
@nightsh
Copy link
Member

nightsh commented Feb 13, 2017

So, played a bit with the failing build issue, here are some observations:

  • Invoked without cache (pip install --no-cache-dir -r requirements.txt) it fails around fastcluster due to not finding numpy.
  • Calling pip install -r requirements.txt uses the pip cache, so once the issue was fixed (see below) locally it won't reproduce anymore.
  • Prior numpy installation with pip seems to fix it (just numpy, for now).
  • Installing line by line fixes it. Two ways to do it:
    • iterate over package names in requirements.txt:

      $ for line in `grep -o '^[^#]\{1,\}' requirements.txt`; do pip install $line; done
    • iterate over requirements.in then do the remaining requirements.txt normally:

      $ while read p; do pip install --no-cache-dir $p; done < requirements.in
      $ pip install --no-cache-dir -r requirements.txt

There must be a pip gotcha I'm missing, though. Whoever finds the elegant fix for this (or the root cause, for that matter), please let me know 😉

@felipevieira
Copy link
Contributor Author

Just for the record, this discussion summarizes the issue we are facing here. Apparently the most elegant solution comes from the libs (fastcluster and pyhacrf) which shouldn't use numpy as requirement on their setup.py

@nightsh both workarounds you've suggested work, and if we choose to go for any of them, I would also suggest to avoid the tox deps and add the custom command on the commands target, right before the tests itselves

@felipevieira felipevieira force-pushed the feature/organization_normalization branch from 2c70917 to 5f2ef1c Compare February 14, 2017 14:50
Copy link
Contributor

@vitorbaptista vitorbaptista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from my comments, could you explain why we need a training file? I thought we would simply read the contents from the DB when re-generating the clusters.

"""
CLUSTER_QUERY = "SELECT canonical " + \
"FROM organisation_clusters " + \
"WHERE '%s'=ANY(variations)" % organisation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of binding the organisation parameter via string interpolation, rely on SQLAlchemy for that. The query would become SELECT canonical FROM organisation_clusters WHERE '?'=ANY(variations) and you'd call it as conn['warehouse'].query(CLUSTER_QUERY, organisation) (if I remember correctly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

logger.debug('Organisation "%s" normalized as "%s"', organisation, normalized_form)
except StopIteration:
logger.debug('Organisation "%s" not normalized', organisation)
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove the pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved

cluster_ghent = {
'canonical': 'Ghent University Hospital',
'variations': ['Ghent University Hospital', 'Ghent University']
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variations feel like should be in different clusters, as they look like different organisations.


@pytest.mark.usefixtures('organisation_cluster')
def test_organisation_normalizer(self, conn, test_input, expected):
assert helpers.get_canonical_organisation_name(conn, test_input) == expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is doing too much. First of all, the pattern of the organisation_cluster fixture is confusing, as it simply adds two fixtures to the DB without returning anything (which is probably the reason you've used @pytest.mark.usefixtures instead of adding it as a parameter). There's also the issue that we're hardcoding the data we expect to be created by the fixture here. I think a fixture is the wrong abstraction here, at least with the very simple fixture tools we have.

Instead of doing this, please remove the organisation_cluster fixture and create the data you need manually in the tests for the 3 cases you're testing. You could create a _create_organisation_cluster(self, conn, canonical_name, variations) to help keep the code DRY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Please check if the current approach meets your suggestion

@felipevieira
Copy link
Contributor Author

All code issues were resolved. We're still looking into how to train the clustering algorithm to avoid incorrect clusters

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants