Skip to content

Latest commit

 

History

History
328 lines (255 loc) · 20.8 KB

documentation.md

File metadata and controls

328 lines (255 loc) · 20.8 KB

Documentation

📥 Step 0: Extraction

Description

A small program to extract key passage texts based on their (start/end) position in the text. Inputs are defined as a literary text and the corresponding citation_sources file, created with Lotte. Returns a .csv file or a .pkl file containing the texts of all cited passages and additional information. For now, this script only works if called from the lotte-develop repo, might need to be rewritten for other purposes.

Usage

extract_passages.py [-h] [-o {.csv,.pkl}] -c {.json} -t {.txt}

required named arguments:

  • -c, --citations: citation_sources path/file name (file type: {.json})
  • -t, --text: literary text path/file name (file type: {.txt})

optional arguments:

  • -h, --help: show this help message and exit
  • -o, --output: output path / file name (file type: {.csv,.pkl})

Description

This Python script allows us to divide our literary text into two groups - one consisting of potential key passages ("cited") and one containing the rest ("not cited"). A .pkl file created with extract_passages.py is needed as input and another .pkl file containing all text passages is returned.

Usage

group_passages.py [-h] [-w WORK] -i {.pkl} -t {.txt}

required named arguments:

  • -i, --input: input path/file name (file type: {.pkl})
  • -t, --text: literary text path/file name (file type: {.txt})

optional arguments:

  • -h, --help: show this help message and exit
  • -w , --work: title of the work, used for output file names (type: {str([WORK])})#

📄 Step 1: Textual Statistics

Notebook

Functions

Preparation

text/df = read_file(filepath)
Reads a .txt or a .pkl file.

name description
filepath name of the file/filepath
type: str
text/df file content, based on filepath it is either a
  • type: str or
  • type: pandas.DataFrame

sents_listed/sentences = split_sentences(text)
Splits an input text into sentences.

name description
text type: str or list
sents_listed/sentences depending on whether the input is a single text file or a list of texts, this function return either a
  • list of sentences or
  • list of lists of sentences or
  • type: list

Textual Statistics

stats_df, sumstats = get_stats(text, sents, cit_num)
Returns a pandas.DataFrame of text statistics (character length, token count, token length and sentence length per passage) as well as a pandas.DataFrame with the corresponding summary statistics. For functions prepare_stats() and summary_stats() as well as the calculation of the different text statistics ( get_char_len(), get_token_count(), get_token_len(), get_sent_len()), please take a look at stats.py.

name description
text list of texts (per passage)
type: list
sents list of sentences (per passage)
type: list
cit_num list of citation frequencies (per passage)
type: list
stats_df output from prepare_stats()
type: pandas.DataFrame
sumstats output from summary_stats()
type: pandas.DataFrame

Vis

fig = box_plot(df_cols, df_names, attribute)
Returns a box plot for the given columns and attribute.

name description
df_cols pandas.DataFrame column names to compare
type: list
df_names str names for columns in df_cols, used for labelling purposes
type: list
attribute attribute (e.g. textual statistic) to inspect
type: str
fig figure
type: plotly.graph_objects.Figure

fig = scatter_plot(df, df_name, colx, coly)
Returns a scatter plot for the given columns in df.

name description
df type: pandas.DataFrame
df_name name of df , used for title text
type: str
colx column 1 of DF to inspect
type: str
coly column 1 of DF to inspect
type: str
fig figure
type: plotly.express.scatter

🏷 Step 2: Part-of-Speech (POS) Tagging

Notebook

Functions

Preparation

read_file()

split_sents()

(Relative) Frequencies

pos_tagged_dict, pos_tagged_list = get_pos_tags(sents)
Returns a dict and list of POS Tags for all of the passages.

name description
sents list of lists of sentences, as created using split_sents()
type: list
pos_tagged_dict all individual terms (= each term once per passage) and their associated POS Tags
type: dict
pos_tagged_list all POS Tags per passage in their original order
type: list

sorted_tag_freqs, tags_used = count_tag_freqs(pos_tagged)
Counts the individual POS Tag frequencies per passage in pos_tagged.

name description
pos_tagged output pos_tagged_list from get_pos_tags
type: list
sorted_tag_freqs DataFrame where row = tag name, column = index of passage, values = relative frequencies
type: pandas.DataFrame
tags_used all individual POS Tags used within an input text/document
type: list

fig = pos_heatmap(df)
Returns a heatmap visualization for the input df.

name description
df output sorted_tag_freqs from count_tag_freqs
type: pandas.DataFrame
fig figure
type: plotly.graph_objects.Figure

df = calculate_weights(df, cit_num)
Calculate weighted values for sorted_tag_freqs.

name description
df (input) output sorted_tag_freqs from count_tag_freqs
type: pandas.DataFrame
cit_num citation frequencies per passage
type: list
df (output) equals df (input) but with newly calculated values
type: pandas.DataFrame

n-grams

ngrams_list = find_ngrams(pos_tagged, n)
Return a list of n-Grams for pos_tagged.

name description
pos_tagged output pos_tagged from get_pos_tags
type: list
n n as in n-Gram
type: int
ngrams_list all the corresponding n-Grams found
type: list

df = ngram_count(ngrams)
Counts the frequencies of each ngram in ngrams and returns them as a pandas.DataFrame.

name description
ngrams output ngrams_list from find_ngrams
type: list
df DataFrame containing the columns "ngram" and "count"
type: pandas.DataFrame

ngrams, names = get_n_ngrams(n, topn, pos_tagged)
Allows to call find_ngrams and ngram_count for more than one n and limit results to a topn count.

name description
n list of ints, n as in n-Gram
type: list([int_1, int_2, int_n])
top describes how many highest values to return
type: int
pos_tagged output pos_tagged from get_pos_tags
type: list
ngrams returns a pandas.DataFrame similar to output df from ngram_count for each n
type: list
names contains names for each of the nested pandas.DataFrames in ngrams to use for visualization purposes
type: list

fig = vis_subplots(subtitles, dataframes, rowcount, colcount, showlabels, rel_yaxis)
Create a plotly.graph_objects.Fig consisting of several bar subplots for each DataFrame in dataframes.

name description
subtitles list of strs for each subtitle
type: list
dataframes list of pandas.DataFrames, output ngrams from get_n_grams
type: list
rowcount number of rows for subplots
type: int
colcount number of columns for subplots
type: int
showlabels whether to show labels for subplots or not
type: binary
rel_yaxis if True all following subplots have the same y-axis scale as the first one
type: binary
fig type: plotly.graph_objects.Fig

all_grams_list = list_individual_grams(df_lists)
Return all individual n-grams over all input df_lists.

name description
df_lists list of pandas.DataFrames that equal the output ngrams from get_n_grams
type: list
all_grams_lists all individual ngrams over the different input DataFrames in df_lists
type: list

check_grams = grams_matrix_prep(grams, all_grams, type)
Checks for each n-gram in grams whether it occurs or not ("binary") or how often it occurs ("count"), depending on type. Returns a list of values.

name description
grams equals output ngrams from get_n_grams
type: pandas.DataFrame
all_grams output all_grams_list from list_individual_grams
type: list
type options are "binary" and "count", does not (spell-)check them right now
type: str
check_grams returns a list of binary/count values for each n-gram in all_grams
type: list

index_list = find_ngram_index(pos_tagged, ngram)
Finds all indices of passages in pos_tagged that contain a certain ngram at least once.

name description
pos_tagged output pos_tagged_list from get_pos_tags
type: list
ngram must be in the following format (use , as delimiter): "[pos_tag],[pos_tag_2],[pos_tag_n]"
type: str
index_list type: list

Diversity

div_df = get_pos_diversity(df)
Calculate Shannon Entropy for all POS Tag frequencies per passage in df and returns them in a pandas.DataFrame.

name description
df output sorted_tag_freqs from count_tag_freqs
type: pandas.DataFrame
div_df DataFrame containing the column "pos_diversity"
type: pandas.DataFrame

🙃 Step 3: Sentiment Analysis

Notebook

Functions

Preparation

read_file()

split_sents()

SentiWS

sentiment_glossary = sentiws_glossary(positive_lines, negative_lines)
Return all SentiWS data in form of a pandas.DataFrame.

name description
positive_lines file SentiWS_v2.0_Positive.txt, read with .readlines()
type: list
negative_lines file SentiWS_v2.0_Negative.txt, read with .readlines()
type: list
sentiment_glossary processable glossary of words and their sentiment values to work with
type: pandas.DataFrame

sentiment_vals = get_polarity_values(text, sentiment_df)
Return a sentiment value for each passage in text.

name description
text type: list
sentiment_df output sentiment_glossary from sentiws_glossary
type: pandas.DataFrame
sentiment_vals all polarity values for text based on sentiment_df
type: list

dataframe = apply_scaling(dataframe, col, scale_range)
Apply a new scale to all data in one col of dataframe (input).

name description
dataframe (input) type: pandas.DataFrame
col column in dataframe (input) that apply_scaling should be applied to
type: str
scale_range (currently) one of two options: "zero_pos" equals range [0, 1] (using sklearn.preprocessing.MinMaxScaler), "neg_pos" equals range [-1, 1] (using sklearn.preprocessing.MaxAbsScaler)
type: str
dataframe (output) type: pandas.DataFrame

germansentiment

sentiment = get_germansentiment(text_col)
Calculate sentiment_scores for each passage in text_col and return a DataFrame.

name description
text_col type: pandas.DataFrame[column]
sentiment germansentiment.SentimentModel().predict_sentiment() for each text in text_col
type: pandas.DataFrame

compare_sentiment(passage_loc, df)
Prints out germansentiment and SentiWS Scores for a given passage_loc in df.

name description
passage_loc location (index) of passage to inspect
type: int
df must contain the columns "text", "germansentiment" and "rel_sentiws" for them to be compared
type: pandas.DataFrame

df = map_sentiment(df)
Transforms germansentiment values in df to a [-1, 0, 1] scale.

name description
df (input/output) must contain the column "germansentiment" on which the function is applied
type: pandas.DataFrame

🗂 Step 4: Summary

Notebook

Functions

read_file()

apply_scaling()