Documentation

📥 Step 0: Extraction

extract_passages.py

Description

A small program to extract key passage texts based on their (start/end) position in the text. Inputs are defined as a literary text and the corresponding citation_sources file, created with Lotte. Returns a .csv file or a .pkl file containing the texts of all cited passages and additional information. For now, this script only works if called from the lotte-develop repo, might need to be rewritten for other purposes.

Usage

extract_passages.py [-h] [-o {.csv,.pkl}] -c {.json} -t {.txt}

required named arguments:

-c, --citations: citation_sources path/file name (file type: {.json})
-t, --text: literary text path/file name (file type: {.txt})

optional arguments:

-h, --help: show this help message and exit
-o, --output: output path / file name (file type: {.csv,.pkl})

group_passages.py

Description

This Python script allows us to divide our literary text into two groups - one consisting of potential key passages ("cited") and one containing the rest ("not cited"). A .pkl file created with extract_passages.py is needed as input and another .pkl file containing all text passages is returned.

Usage

group_passages.py [-h] [-w WORK] -i {.pkl} -t {.txt}

required named arguments:

-i, --input: input path/file name (file type: {.pkl})
-t, --text: literary text path/file name (file type: {.txt})

optional arguments:

-h, --help: show this help message and exit
-w , --work: title of the work, used for output file names (type: {str([WORK])})#

📄 Step 1: Textual Statistics

Prerequisites: j_all.pkl and k_all.pkl, as created in Step 0 with group_passages.py (here: data/0_extraction_data)
Output: data/1_text-stats_data

Notebook

path: 1_text-stats.ipynb
libraries that need to be installed:
- plotly
- scipy
- pandas
- spacy
- nltk

Functions

Python scripts used: basic.py, processing.py, stats.py, vis.py

Preparation

text/df = read_file(filepath)
Reads a .txt or a .pkl file.

name	description
`filepath`	name of the file/filepath type: `str`
`text/df`	file content, based on `filepath` it is either a type: `str` or type: `pandas.DataFrame`

sents_listed/sentences = split_sentences(text)
Splits an input text into sentences.

name	description
`text`	type: `str` or `list`
`sents_listed/sentences`	depending on whether the input is a single text file or a list of texts, this function return either a `list` of sentences or `list` of `lists` of sentences or type: `list`

Textual Statistics

stats_df, sumstats = get_stats(text, sents, cit_num)
Returns a pandas.DataFrame of text statistics (character length, token count, token length and sentence length per passage) as well as a pandas.DataFrame with the corresponding summary statistics. For functions prepare_stats() and summary_stats() as well as the calculation of the different text statistics ( get_char_len(), get_token_count(), get_token_len(), get_sent_len()), please take a look at stats.py.

name	description
`text`	list of texts (per passage) type: `list`
`sents`	list of sentences (per passage) type: `list`
`cit_num`	list of citation frequencies (per passage) type: `list`
`stats_df`	output from `prepare_stats()` type: `pandas.DataFrame`
`sumstats`	output from `summary_stats()` type: `pandas.DataFrame`

Vis

fig = box_plot(df_cols, df_names, attribute)
Returns a box plot for the given columns and attribute.

name	description
`df_cols`	`pandas.DataFrame` column names to compare type: `list`
`df_names`	`str` names for columns in `df_cols`, used for labelling purposes type: `list`
`attribute`	attribute (e.g. textual statistic) to inspect type: `str`
`fig`	figure type: `plotly.graph_objects.Figure`

fig = scatter_plot(df, df_name, colx, coly)
Returns a scatter plot for the given columns in df.

name	description
`df`	type: `pandas.DataFrame`
`df_name`	name of `df` , used for title text type: `str`
`colx`	column 1 of DF to inspect type: `str`
`coly`	column 1 of DF to inspect type: `str`
`fig`	figure type: `plotly.express.scatter`

🏷 Step 2: Part-of-Speech (POS) Tagging

Prerequisites: all files in data/1_text-stats_data
Output: data/2_pos_data

Notebook

path: 2_pos.ipynb
libraries that need to be installed:
- plotly
- pandas
- spacy
- numpy
- statsmodels

Functions

Python scripts used: basic.py, processing.py, pos.py, vis.py

Preparation

read_file()

split_sents()

(Relative) Frequencies

pos_tagged_dict, pos_tagged_list = get_pos_tags(sents)
Returns a dict and list of POS Tags for all of the passages.

name	description
`sents`	list of lists of sentences, as created using `split_sents()` type: `list`
`pos_tagged_dict`	all individual terms (= each term once per passage) and their associated POS Tags type: `dict`
`pos_tagged_list`	all POS Tags per passage in their original order type: `list`

sorted_tag_freqs, tags_used = count_tag_freqs(pos_tagged)
Counts the individual POS Tag frequencies per passage in pos_tagged.

name	description
`pos_tagged`	output `pos_tagged_list` from `get_pos_tags` type: `list`
`sorted_tag_freqs`	DataFrame where row = tag name, column = index of passage, values = relative frequencies type: `pandas.DataFrame`
`tags_used`	all individual POS Tags used within an input text/document type: `list`

fig = pos_heatmap(df)
Returns a heatmap visualization for the input df.

name	description
`df`	output `sorted_tag_freqs` from `count_tag_freqs` type: `pandas.DataFrame`
`fig`	figure type: `plotly.graph_objects.Figure`

df = calculate_weights(df, cit_num)
Calculate weighted values for sorted_tag_freqs.

name	description
`df (input)`	output `sorted_tag_freqs` from `count_tag_freqs` type: `pandas.DataFrame`
`cit_num`	citation frequencies per passage type: `list`
`df (output)`	equals `df (input)` but with newly calculated values type: `pandas.DataFrame`

n-grams

ngrams_list = find_ngrams(pos_tagged, n)
Return a list of n-Grams for pos_tagged.

name	description
`pos_tagged`	output `pos_tagged` from `get_pos_tags` type: `list`
`n`	n as in n-Gram type: `int`
`ngrams_list`	all the corresponding n-Grams found type: `list`

df = ngram_count(ngrams)
Counts the frequencies of each ngram in ngrams and returns them as a pandas.DataFrame.

name	description
`ngrams`	output `ngrams_list` from `find_ngrams` type: `list`
`df`	DataFrame containing the columns "ngram" and "count" type: `pandas.DataFrame`

ngrams, names = get_n_ngrams(n, topn, pos_tagged)
Allows to call find_ngrams and ngram_count for more than one n and limit results to a topn count.

name	description
`n`	`list` of `int`s, n as in n-Gram type: `list([int_1, int_2, int_n])`
`top`	describes how many highest values to return type: `int`
`pos_tagged`	output `pos_tagged` from `get_pos_tags` type: `list`
`ngrams`	returns a `pandas.DataFrame` similar to output `df` from `ngram_count` for each `n` type: `list`
`names`	contains names for each of the nested `pandas.DataFrames` in `ngrams` to use for visualization purposes type: `list`

fig = vis_subplots(subtitles, dataframes, rowcount, colcount, showlabels, rel_yaxis)
Create a plotly.graph_objects.Fig consisting of several bar subplots for each DataFrame in dataframes.

name	description
`subtitles`	`list` of `str`s for each subtitle type: `list`
`dataframes`	`list` of `pandas.DataFrame`s, output `ngrams` from `get_n_grams` type: `list`
`rowcount`	number of rows for subplots type: `int`
`colcount`	number of columns for subplots type: `int`
`showlabels`	whether to show labels for subplots or not type: `binary`
`rel_yaxis`	if `True` all following subplots have the same y-axis scale as the first one type: `binary`
`fig`	type: `plotly.graph_objects.Fig`

all_grams_list = list_individual_grams(df_lists)
Return all individual n-grams over all input df_lists.

name	description
`df_lists`	`list` of `pandas.DataFrame`s that equal the output `ngrams` from `get_n_grams` type: `list`
`all_grams_lists`	all individual ngrams over the different input DataFrames in `df_lists` type: `list`

check_grams = grams_matrix_prep(grams, all_grams, type)
Checks for each n-gram in grams whether it occurs or not ("binary") or how often it occurs ("count"), depending on type. Returns a list of values.

name	description
`grams`	equals output `ngrams` from `get_n_grams` type: `pandas.DataFrame`
`all_grams`	output `all_grams_list` from `list_individual_grams` type: `list`
`type`	options are `"binary"` and `"count"`, does not (spell-)check them right now type: `str`
`check_grams`	returns a `list` of binary/count values for each n-gram in `all_grams` type: `list`

index_list = find_ngram_index(pos_tagged, ngram)
Finds all indices of passages in pos_tagged that contain a certain ngram at least once.

name	description
`pos_tagged`	output `pos_tagged_list` from `get_pos_tags` type: `list`
`ngram`	must be in the following format (use `,` as delimiter): `"[pos_tag],[pos_tag_2],[pos_tag_n]"` type: `str`
`index_list`	type: `list`

Diversity

div_df = get_pos_diversity(df)
Calculate Shannon Entropy for all POS Tag frequencies per passage in df and returns them in a pandas.DataFrame.

name	description
`df`	output `sorted_tag_freqs` from `count_tag_freqs` type: `pandas.DataFrame`
`div_df`	DataFrame containing the column "pos_diversity" type: `pandas.DataFrame`

🙃 Step 3: Sentiment Analysis

Prerequisites: all files in data/2_pos_data
Output: data/3_sentiment_data

Notebook

path: 3_sentiment.ipynb
libraries that need to be installed:
- plotly
- pandas
- spacy
- nltk
- germansentiment
- sklearn

Functions

Python scripts used: basic.py, processing.py, sentiment.py, summary.py

Preparation

read_file()

split_sents()

SentiWS

sentiment_glossary = sentiws_glossary(positive_lines, negative_lines)
Return all SentiWS data in form of a pandas.DataFrame.

name	description
`positive_lines`	file `SentiWS_v2.0_Positive.txt`, read with `.readlines()` type: `list`
`negative_lines`	file `SentiWS_v2.0_Negative.txt`, read with `.readlines()` type: `list`
`sentiment_glossary`	processable glossary of words and their sentiment values to work with type: `pandas.DataFrame`

sentiment_vals = get_polarity_values(text, sentiment_df)
Return a sentiment value for each passage in text.

name	description
`text`	type: `list`
`sentiment_df`	output `sentiment_glossary` from `sentiws_glossary` type: `pandas.DataFrame`
`sentiment_vals`	all polarity values for `text` based on `sentiment_df` type: `list`

dataframe = apply_scaling(dataframe, col, scale_range)
Apply a new scale to all data in one col of dataframe (input).

name	description
`dataframe (input)`	type: `pandas.DataFrame`
`col`	column in `dataframe (input)` that `apply_scaling` should be applied to type: `str`
`scale_range`	(currently) one of two options: `"zero_pos"` equals range [0, 1] (using `sklearn.preprocessing.MinMaxScaler`), `"neg_pos"` equals range [-1, 1] (using `sklearn.preprocessing.MaxAbsScaler`) type: `str`
`dataframe (output)`	type: `pandas.DataFrame`

germansentiment

sentiment = get_germansentiment(text_col)
Calculate sentiment_scores for each passage in text_col and return a DataFrame.

name	description
`text_col`	type: `pandas.DataFrame[column]`
`sentiment`	`germansentiment.SentimentModel().predict_sentiment()` for each text in `text_col` type: `pandas.DataFrame`

compare_sentiment(passage_loc, df)
Prints out germansentiment and SentiWS Scores for a given passage_loc in df.

name	description
`passage_loc`	location (index) of passage to inspect type: `int`
`df`	must contain the columns `"text"`, `"germansentiment"` and `"rel_sentiws"` for them to be compared type: `pandas.DataFrame`

df = map_sentiment(df)
Transforms germansentiment values in df to a [-1, 0, 1] scale.

name	description
`df (input/output)`	must contain the column `"germansentiment"` on which the function is applied type: `pandas.DataFrame`

🗂 Step 4: Summary

Prerequisites: all files in data/3_sentiment_data
Output: data/4_summary_data

Notebook

path: 4_summary.ipynb
libraries that need to be installed:

Functions

read_file()

apply_scaling()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

documentation.md

documentation.md

Documentation

📥 Step 0: Extraction

extract_passages.py

Description

Usage

group_passages.py

Description

Usage

📄 Step 1: Textual Statistics

Notebook

Functions

Preparation

Textual Statistics

Vis

🏷 Step 2: Part-of-Speech (POS) Tagging

Notebook

Functions

Preparation

(Relative) Frequencies

n-grams

Diversity

🙃 Step 3: Sentiment Analysis

Notebook

Functions

Preparation

SentiWS

germansentiment

🗂 Step 4: Summary

Notebook

Functions

Files

documentation.md

Latest commit

History

documentation.md

File metadata and controls

Documentation

📥 Step 0: Extraction

extract_passages.py

Description

Usage

group_passages.py

Description

Usage

📄 Step 1: Textual Statistics

Notebook

Functions

Preparation

Textual Statistics

Vis

🏷 Step 2: Part-of-Speech (POS) Tagging

Notebook

Functions

Preparation

(Relative) Frequencies

n-grams

Diversity

🙃 Step 3: Sentiment Analysis

Notebook

Functions

Preparation

SentiWS

germansentiment

🗂 Step 4: Summary

Notebook

Functions