A program to mimic the behavior of https://twitter.com/anagramatron
- a file which parses input tweet data -- anagrams.py
- a file to make dummy tweet data given a dictionary -- make_tweet.json
- a small dictionary of the 1,000 most common words common_word_dictionary.txt
- a large dictionary of ~ 170,000 words -- dictionary.txt
- these are for passing to make_tweet_json.py to construct tweet data
- a small .json file of sample tweet data -- tweets.json
- the binaries to execute; added to path with
pip install
-
Clone repository:
git clone https://github.com/killakam3084/anagrams.git
-
Install anagrams package
cd anagrams/ && pip install .
-
To make sample JSON
make-tweet-data -l 3 -n 100000 tweets.json dictionary.txt
usage: make-tweet-data [-h] [-l, word-length WORD_LENGTH] [-n, number-of-tweets NUM_TWEETS] output_file dictionary_file A program to make dummy tweet json for anagrams positional arguments: output_file A output file to write tweet data to. dictionary_file The dictionary to construct anagrams. optional arguments: -h, --help show this help message and exit -l, word-length WORD_LENGTH The max character count for a word. -n, number-of-tweets NUM_TWEETS The number of tweets to constructs.
-
Run script
usage: find-anagrams [-h] input_file A vanilla program to detect anagrams in tweet json positional arguments: input_file The input file of tweets. optional arguments: -h, --help show this help message and exit Certainly this isn't how Anagramatron does it
- ijson is an iterative JSON parser with a standard Python iterator interface
- Ijson provides several implementations of the actual parsing in the form of backends located in ijson/backends:
- The script should output memory_profiler data showing the effectiveness of ijson's lazy loading
- Processed 50,000,000 two word tweets with words of length 4 in about ~ 2 hours, around ~3.5G file size.