[WIP] [#604] Normalize organisations names #116

felipevieira · 2017-02-09T18:45:10Z

nightsh · 2017-02-13T21:29:43Z

So, played a bit with the failing build issue, here are some observations:

Invoked without cache (pip install --no-cache-dir -r requirements.txt) it fails around fastcluster due to not finding numpy.
Calling pip install -r requirements.txt uses the pip cache, so once the issue was fixed (see below) locally it won't reproduce anymore.
Prior numpy installation with pip seems to fix it (just numpy, for now).

Installing line by line fixes it. Two ways to do it:

iterate over package names in requirements.txt:

$ for line in `grep -o '^[^#]\{1,\}' requirements.txt`; do pip install $line; done

iterate over requirements.in then do the remaining requirements.txt normally:

$ while read p; do pip install --no-cache-dir $p; done < requirements.in
$ pip install --no-cache-dir -r requirements.txt

There must be a pip gotcha I'm missing, though. Whoever finds the elegant fix for this (or the root cause, for that matter), please let me know 😉

felipevieira · 2017-02-14T13:55:36Z

Just for the record, this discussion summarizes the issue we are facing here. Apparently the most elegant solution comes from the libs (fastcluster and pyhacrf) which shouldn't use numpy as requirement on their setup.py

@nightsh both workarounds you've suggested work, and if we choose to go for any of them, I would also suggest to avoid the tox deps and add the custom command on the commands target, right before the tests itselves

vitorbaptista

Apart from my comments, could you explain why we need a training file? I thought we would simply read the contents from the DB when re-generating the clusters.

vitorbaptista · 2017-02-22T16:30:27Z

processors/base/helpers/__init__.py

+    """
+    CLUSTER_QUERY = "SELECT canonical " + \
+                    "FROM organisation_clusters " + \
+                    "WHERE '%s'=ANY(variations)" % organisation


Instead of binding the organisation parameter via string interpolation, rely on SQLAlchemy for that. The query would become SELECT canonical FROM organisation_clusters WHERE '?'=ANY(variations) and you'd call it as conn['warehouse'].query(CLUSTER_QUERY, organisation) (if I remember correctly)

vitorbaptista · 2017-02-22T16:33:50Z

processors/base/helpers/__init__.py

+        logger.debug('Organisation "%s" normalized as "%s"', organisation, normalized_form)
+    except StopIteration:
+        logger.debug('Organisation "%s" not normalized', organisation)
+        pass


Forgot to remove the pass

vitorbaptista · 2017-02-22T16:37:00Z

tests/fixtures/warehouse/organisation_clusters.py

+    cluster_ghent = {
+        'canonical': 'Ghent University Hospital',
+        'variations': ['Ghent University Hospital', 'Ghent University']
+    }


These variations feel like should be in different clusters, as they look like different organisations.

vitorbaptista · 2017-02-22T16:46:17Z

tests/processors/base/helpers/test_helpers.py

+
+    @pytest.mark.usefixtures('organisation_cluster')
+    def test_organisation_normalizer(self, conn, test_input, expected):
+        assert helpers.get_canonical_organisation_name(conn, test_input) == expected


I think this test is doing too much. First of all, the pattern of the organisation_cluster fixture is confusing, as it simply adds two fixtures to the DB without returning anything (which is probably the reason you've used @pytest.mark.usefixtures instead of adding it as a parameter). There's also the issue that we're hardcoding the data we expect to be created by the fixture here. I think a fixture is the wrong abstraction here, at least with the very simple fixture tools we have.

Instead of doing this, please remove the organisation_cluster fixture and create the data you need manually in the tests for the 3 cases you're testing. You could create a _create_organisation_cluster(self, conn, canonical_name, variations) to help keep the code DRY.

Resolved. Please check if the current approach meets your suggestion

felipevieira · 2017-02-23T18:56:42Z

All code issues were resolved. We're still looking into how to train the clustering algorithm to avoid incorrect clusters

felipevieira and others added 16 commits January 31, 2017 17:42

Building orgs clusters and suggesting normalized entries

354fea9

Saving cluster in warehouse to optimize further access

3d37b6f

Update organisations clustering training file

28d491d

Add unit tests to the organization normalization feature

ebe9c25

Remove unused constants on trial processor

76a86d0

Add unit tests set

5f56433

Add org cluster fixture to the trial test set

257e66e

Increase code quality by following some lint rules

54b0ec2

Create new cluster updater processor

19e77ad

Merge branch 'master' into feature/organization_normalization

b997a7c

Add raven depedency

0669314

Add unidecode depedency

160a996

Cleanup code and add proper documentation

57e5b9e

Remove unused logger

ee8ebda

Avoiding anonymous variable

c07a0bf

Add docs to organisation tests

81a44c5

roll added the 4. Ready for Review label Feb 9, 2017

felipevieira changed the title ~~Normalize organisation names~~ Normalize organisations names Feb 9, 2017

felipevieira changed the title ~~Normalize organisations names~~ [WIP] #604 Normalize organisations names Feb 9, 2017

felipevieira changed the title ~~[WIP] #604 Normalize organisations names~~ [WIP] [#604] Normalize organisations names Feb 9, 2017

felipevieira added 5 commits February 9, 2017 16:30

Update schema files

f50a992

Revert api database schema

c2835f7

Alter numpy requirement installation

a1ef975

Fix location test

687ba35

Organize requirements file

5f2ef1c

felipevieira force-pushed the feature/organization_normalization branch 5 times, most recently from 9d9841a to 5f2ef1c Compare February 13, 2017 04:29

felipevieira force-pushed the feature/organization_normalization branch 3 times, most recently from 58e9ca7 to 5f2ef1c Compare February 13, 2017 20:42

felipevieira added 2 commits February 14, 2017 11:42

Avoid to install all depedencies on tox deps

1d452c8

Add comment explaining why tox's not installing requirements on deps

25c2618

felipevieira force-pushed the feature/organization_normalization branch from 2c70917 to 5f2ef1c Compare February 14, 2017 14:50

vitorbaptista suggested changes Feb 22, 2017

View reviewed changes

Refact code

129f16b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [#604] Normalize organisations names #116

[WIP] [#604] Normalize organisations names #116

felipevieira commented Feb 9, 2017

nightsh commented Feb 13, 2017

felipevieira commented Feb 14, 2017

vitorbaptista left a comment

vitorbaptista Feb 22, 2017

felipevieira Feb 23, 2017

vitorbaptista Feb 22, 2017

felipevieira Feb 23, 2017

vitorbaptista Feb 22, 2017

vitorbaptista Feb 22, 2017

felipevieira Feb 23, 2017

felipevieira commented Feb 23, 2017

[WIP] [#604] Normalize organisations names #116

Are you sure you want to change the base?

[WIP] [#604] Normalize organisations names #116

Conversation

felipevieira commented Feb 9, 2017

nightsh commented Feb 13, 2017

felipevieira commented Feb 14, 2017

vitorbaptista left a comment

Choose a reason for hiding this comment

vitorbaptista Feb 22, 2017

Choose a reason for hiding this comment

felipevieira Feb 23, 2017

Choose a reason for hiding this comment

vitorbaptista Feb 22, 2017

Choose a reason for hiding this comment

felipevieira Feb 23, 2017

Choose a reason for hiding this comment

vitorbaptista Feb 22, 2017

Choose a reason for hiding this comment

vitorbaptista Feb 22, 2017

Choose a reason for hiding this comment

felipevieira Feb 23, 2017

Choose a reason for hiding this comment

felipevieira commented Feb 23, 2017