Python snippets for educational purposes
cat text.txt | python statistics/charcount.py > charcount.out
Time it!
time cat text.txt | python statistics/charcount.py > charcount.out
The same as charcount.py
plus it prints a log message at every 100000th line processed.
If an integer argument is supplied, it only outputs characters that can be found at least N time.
It only affects the output, it doesn't reduce the scripts running time.
time cat text.txt | python statistics/charcount.py 10 > charcount_fancy.out
Statistics on Hungarian diacritics:
- number of tokens
- number of types
- ratio of words with at least one diacritic
- lexdif: average number of words that map to the same latinized word (word with the diacritics removed)
The input is expected to be one word-per-line. Example usage:
cat words | python statistics/diacritic_stats.py
Output format:
8351 tokens, 3737 types
1.00214075462 lexdif, 0.490360435876 diacritic ratio