Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Add sample ML-based topic modeling support #170

Open
wants to merge 107 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
e24f3b7
Create token_pool.py
DonggeLiu Jun 29, 2017
9535b81
added the file created last time
DonggeLiu Jul 3, 2017
934da4b
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 3, 2017
2a8a0f2
1. Two LDA model (with different package, not sure which one is bette…
DonggeLiu Jul 3, 2017
e888805
Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…
DonggeLiu Jul 3, 2017
a23aa13
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 3, 2017
bc462ba
General
DonggeLiu Jul 10, 2017
83a31a7
1. Define types for parameters and return values
DonggeLiu Jul 11, 2017
ced8bb4
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 11, 2017
943c696
isolate import gensim to see if it causes failure #3839
DonggeLiu Jul 17, 2017
3db49ee
verifying the reason of errors
DonggeLiu Jul 17, 2017
06d1d37
reformat the output of model_gensim to make it in the same format as …
DonggeLiu Jul 17, 2017
e027dad
1. updated tests according to the changes I made in model_gensim.py
DonggeLiu Jul 17, 2017
336c0d8
added tests for model_lda.py
DonggeLiu Jul 17, 2017
178226b
trying to fix the 'module' object has no attribute 'plugin' problem
DonggeLiu Jul 18, 2017
ebc4715
reference topic_model module with full path
DonggeLiu Jul 18, 2017
39c5e8c
Merge branch 'master' into topic_modelling
pypt Jul 20, 2017
716fe91
added the requirement for sklearn, which supports the NMF algorithm
DonggeLiu Jul 24, 2017
f66ead6
Added msg for each assertion
DonggeLiu Jul 24, 2017
2d6c12d
added msg for each assertion
DonggeLiu Jul 24, 2017
6c50ed2
added model_nmf.py to model topics with the NMF algorithm
DonggeLiu Jul 24, 2017
679fef0
test cases for model_nmf.py
DonggeLiu Jul 24, 2017
3ab2124
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Jul 24, 2017
025dece
Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…
DonggeLiu Jul 24, 2017
61517d1
sorted requirements.txt in alphabetical order
DonggeLiu Jul 24, 2017
36817b9
cache WordNet
DonggeLiu Jul 24, 2017
b5562ad
install the WordNet via NLTK
DonggeLiu Jul 24, 2017
e6b126c
relocate test files
DonggeLiu Jul 24, 2017
c93fe63
remove uncessary files after test suits relocation
DonggeLiu Jul 24, 2017
730a4e9
1. removed josn serialization after fetching sentences from database
DonggeLiu Jul 24, 2017
3b38dff
add .close to open file
DonggeLiu Jul 24, 2017
154f96d
add .close() to opened file
DonggeLiu Jul 24, 2017
5ea449a
suppress warning message caused by NLTK built-in method lemmatize()
DonggeLiu Jul 24, 2017
34fdcbc
restore the file (its content was mysteriously deleted)
DonggeLiu Jul 24, 2017
baca56c
removed path_helper.py and related codes
DonggeLiu Jul 24, 2017
fe78de8
add a file containing sample stories (can replace DB in tests)
DonggeLiu Jul 24, 2017
91d725e
1. Change the SQL query to be the same as suggested in previous PR re…
DonggeLiu Jul 24, 2017
0ca1eca
Seperated test cases for three models from db_connection
DonggeLiu Jul 24, 2017
dc0b73b
added explanation for each of the three modules used
DonggeLiu Jul 24, 2017
96f566c
removed redundant textblob in requirements
DonggeLiu Jul 24, 2017
c488c08
separate test_token_pool.py from database
DonggeLiu Jul 24, 2017
6d8555e
remove import path_helper
DonggeLiu Jul 26, 2017
6182c4f
Rearraged NLTK installation to make it system-wide
DonggeLiu Jul 26, 2017
9c68669
Use wget instead of nltk.download() to avoid 405 error
DonggeLiu Jul 26, 2017
0e04ff1
silent wget
DonggeLiu Jul 26, 2017
d995cb8
adding more echos and comments
DonggeLiu Jul 26, 2017
a361b01
turn on -n switch of unzip gh-pages.zip, preventing rewrite existing …
DonggeLiu Jul 27, 2017
db1c584
added COMMAND_PREFIX to use sudo on linux
DonggeLiu Jul 27, 2017
2a88eab
restore missing log4perl.conf
DonggeLiu Jul 27, 2017
b62e71d
Don't --force-reinstall stuff needlessly
pypt Jul 27, 2017
7922d3c
Install only WordNet data from NLTK data
pypt Jul 27, 2017
7ce27cc
Revert "added COMMAND_PREFIX to use sudo on linux"
pypt Jul 27, 2017
29d460c
Revert "turn on -n switch of unzip gh-pages.zip, preventing rewrite e…
pypt Jul 27, 2017
4008366
Revert "adding more echos and comments"
pypt Jul 27, 2017
c1da604
Revert "silent wget"
pypt Jul 27, 2017
7b6beaf
Revert "Use wget instead of nltk.download() to avoid 405 error"
pypt Jul 27, 2017
bf2c962
Install NLTK data from own mirror on S3
pypt Jul 27, 2017
482f01e
Install only WordNet data from NLTK data
pypt Jul 27, 2017
00633aa
Don't --force-reinstall stuff needlessly
pypt Jul 27, 2017
6f09e31
added punkt into nltk dependencies
DonggeLiu Aug 1, 2017
179da05
use sample handler to separate access to sample file from others
DonggeLiu Aug 7, 2017
1cf5601
1. make use of sample_handler.py to access sample file
DonggeLiu Aug 7, 2017
1d3ad5e
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 7, 2017
81d6892
use full path of sample_handler.py
DonggeLiu Aug 7, 2017
8861d9e
Temporarily disable unit tests for Travis to cache dependencies
pypt Aug 8, 2017
c732a50
Revert "cache WordNet"
pypt Aug 8, 2017
65c505b
Revert "Temporarily disable unit tests for Travis to cache dependencies"
pypt Aug 8, 2017
73f7e2e
added a new abstract method for topic model classes to evaluate curre…
DonggeLiu Aug 9, 2017
ef35923
unify the name of models used in each class to self._model as in the …
DonggeLiu Aug 9, 2017
89882cd
implement the evaluation method based on the buit-in method likelihood()
DonggeLiu Aug 9, 2017
73e518c
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 9, 2017
e2d6655
use the sample file instead of DB in Travis
DonggeLiu Aug 9, 2017
5289a85
Merge branch 'topic_modelling' of github.com:berkmancenter/mediacloud…
DonggeLiu Aug 9, 2017
00831af
edit the total number of topics
DonggeLiu Aug 9, 2017
59bcb50
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 12, 2017
2c8e6eb
added tuning steps to find out the optimal topic number
DonggeLiu Aug 12, 2017
d1129a6
a finder that can identify the max/min points of a polynomial compute…
DonggeLiu Aug 13, 2017
4d5b9e4
added two methods tune_*() to find out the optimal number of topics
DonggeLiu Aug 13, 2017
8e77ed4
removed some print()s and rewrote evaluation()
DonggeLiu Aug 14, 2017
809aad7
added more test cases on checking the accuracy of the model via likel…
DonggeLiu Aug 14, 2017
f819366
improved polynomial tuning algorithm
DonggeLiu Aug 19, 2017
9869ca8
no longer test tune_with_iteration as polynomial has a sigificant bet…
DonggeLiu Aug 19, 2017
e185dd0
larger sample for Travis to test against
DonggeLiu Aug 19, 2017
3545e0e
modify tests accroding to change in sample_stories.txt
DonggeLiu Aug 19, 2017
7816ec8
use smaller sample size so that Travis will not fail
DonggeLiu Aug 20, 2017
94ebc24
do not test limit if limit is not specified
DonggeLiu Aug 20, 2017
c1c257e
improved tune with polynomial algorithm
DonggeLiu Aug 20, 2017
6d09265
removed uncessary tune_with_iteration as its advantage/feature has be…
DonggeLiu Aug 20, 2017
2479107
fixed the algorithm of optimal point finder
DonggeLiu Aug 20, 2017
51dd0ec
removed useless codes
DonggeLiu Aug 20, 2017
620afb4
Merge branch 'master' of github.com:berkmancenter/mediacloud into top…
DonggeLiu Aug 20, 2017
5ead4f2
Disable unit tests temporarily for Travis to have a chance to compile…
DonggeLiu Aug 20, 2017
0fb4e4a
Cache WordNet of NLTK
DonggeLiu Aug 20, 2017
87efd01
set test cases back
DonggeLiu Aug 20, 2017
6ea203b
revert the changes made on .travis.yml
DonggeLiu Aug 20, 2017
b675559
added more story samples
DonggeLiu Aug 21, 2017
8753442
new commits from git pull origin master
DonggeLiu Aug 21, 2017
e39415b
removed unnecessary code to keep higher level of accuracy
DonggeLiu Aug 21, 2017
a674d26
changed sample file name
DonggeLiu Aug 21, 2017
6267f72
this sample file has been replaced by 3 files with different size
DonggeLiu Aug 21, 2017
d4e9d48
use a smaller sample to test on Travis due to limit restriction
DonggeLiu Aug 21, 2017
0c3f7ee
1. break large block of codes up to more funcitons
DonggeLiu Aug 21, 2017
4c12748
remove uncessary code
DonggeLiu Aug 21, 2017
720dd7a
restructured tests to reduce running time
DonggeLiu Aug 21, 2017
97afc48
further improvements on the code structure
DonggeLiu Aug 22, 2017
016d01c
remove redudent code
DonggeLiu Aug 22, 2017
9ff15ff
Merge branch 'master' into topic_modelling
pypt Sep 1, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
no longer test tune_with_iteration as polynomial has a sigificant bet…
…ter efficiency and performance

I will combine these two later
  • Loading branch information
DonggeLiu committed Aug 19, 2017
commit 9869ca88e72b00844f33aa96d1ba6b9f070efcf3
48 changes: 24 additions & 24 deletions mediacloud/mediawords/util/topic_modeling/test_model_lda.py
Original file line number Diff line number Diff line change
@@ -21,18 +21,18 @@ def setUp(self):
self.OFFSET = 1
# token_pool = TokenPool(connect_to_db())
token_pool = TokenPool(SampleHandler())
# self._story_tokens = token_pool.output_tokens(limit=self.LIMIT, offset=self.OFFSET)

self._story_tokens = token_pool.output_tokens()
self._flat_story_tokens = self._flatten_story_tokens()
self._lda_model = ModelLDA()
self._lda_model.add_stories(self._story_tokens)
self._optimal_topic_num_poly = self._lda_model.tune_with_polynomial()
self._optimal_topic_num_iter = self._lda_model.tune_with_iteration()
# self._optimal_topic_num_iter = self._lda_model.tune_with_iteration()

self._topics_via_poly \
= self._lda_model.summarize_topic(total_topic_num=self._optimal_topic_num_poly)
self._topics_via_iter \
= self._lda_model.summarize_topic(total_topic_num=self._optimal_topic_num_iter)
# self._topics_via_iter \
# = self._lda_model.summarize_topic(total_topic_num=self._optimal_topic_num_iter)

logging.getLogger("lda").setLevel(logging.WARNING)
logging.getLogger("gensim").setLevel(logging.WARNING)
@@ -54,8 +54,8 @@ def test_one_to_one_relationship(self):
"""
Pass topics generated by both methods to _check_one_to_one_relationship()
"""
# self._check_one_to_one_relationship(topics=self._topics_via_iter)
self._check_one_to_one_relationship(topics=self._topics_via_poly)
self._check_one_to_one_relationship(topics=self._topics_via_iter)

def _check_one_to_one_relationship(self, topics: Dict[int, List]):
"""
@@ -77,12 +77,12 @@ def _check_one_to_one_relationship(self, topics: Dict[int, List]):
expr=(article_id in topic_ids),
msg="Missing article id: {}".format(article_id))

def test_story_contains_topic_word(self):
"""
Pass topics generated by both methods to _check_story_contains_topic_word()
"""
self._check_story_contains_topic_word(topics=self._topics_via_poly)
self._check_story_contains_topic_word(topics=self._topics_via_iter)
# def test_story_contains_topic_word(self):
# """
# Pass topics generated by both methods to _check_story_contains_topic_word()
# """
# self._check_story_contains_topic_word(topics=self._topics_via_poly)
# self._check_story_contains_topic_word(topics=self._topics_via_iter)

def _check_story_contains_topic_word(self, topics: Dict[int, List]):
"""
@@ -110,8 +110,8 @@ def test_default_topic_params(self):
"""
Pass topics generated by both methods to _check_default_topic_params()
"""
# self._check_default_topic_params(topics=self._topics_via_iter)
self._check_default_topic_params(topics=self._topics_via_poly)
self._check_default_topic_params(topics=self._topics_via_iter)

def _check_default_topic_params(self, topics: Dict[int, List[str]]):
"""
@@ -125,14 +125,14 @@ def _check_default_topic_params(self, topics: Dict[int, List[str]]):
.format(default_word_num, len(topics), topics))

def test_highest_likelihood(self):
self._check_highest_likelihood(num=self._optimal_topic_num_iter, name="Iteration")
# self._check_highest_likelihood(num=self._optimal_topic_num_iter, name="Iteration")
self._check_highest_likelihood(num=self._optimal_topic_num_poly, name="Polynomial")

def _check_highest_likelihood(self, num: int, name: str):
"""
Test if the result is the most accurate one
"""
optimal_likelihood = self._lda_model.evaluate()[1]
optimal_likelihood = self._lda_model.evaluate(topic_num=num)[1]
other_nums = [0, 1, num-1, num+1, num*2]

for other_num in other_nums:
@@ -146,16 +146,16 @@ def _check_highest_likelihood(self, num: int, name: str):
msg="Topic num {} has a better likelihood {} than {} with {}:{}"
.format(other_num, other_likelihood, name, num, optimal_likelihood))

def test_equal_likelihood(self):
"""
The likelihood of both methods should be the same (i.e. the max),
However, the total topic nums do not have to be the same
"""
unittest.TestCase.assertEqual(
self=self, first=self._topics_via_iter, second=self._topics_via_poly,
msg="Iter: {}\nPoly: {}"
.format(self._lda_model.evaluate(topic_num=self._optimal_topic_num_iter)[1],
self._lda_model.evaluate(topic_num=self._optimal_topic_num_poly)[1]))
# def test_equal_likelihood(self):
# """
# The likelihood of both methods should be the same (i.e. the max),
# However, the total topic nums do not have to be the same
# """
# unittest.TestCase.assertEqual(
# self=self, first=self._topics_via_iter, second=self._topics_via_poly,
# msg="Iter: {}\nPoly: {}"
# .format(self._lda_model.evaluate(topic_num=self._optimal_topic_num_iter)[1],
# self._lda_model.evaluate(topic_num=self._optimal_topic_num_poly)[1]))


if __name__ == '__main__':