- POEMNet
- Video statistics
- Inter-annotator stats
- Model Results
- Questions, comments, concerns, need help?
- Citation
Repository holds software used in Automated operative phase identification in peroral endoscopic myotomy. In particular, it houses the scripts used to generate the statistics and visualizations used in the Results section of the paper. Documentation for the scripts includes heavy commenting within each script, an informative commit message per script, and the below README. If you still have any questions, please contact me.
Python >= 3.6 required. The following extra packages will be needed for some of the scripts. Install through your distribution's package manager/PyPI/etc.
docopt
(verified to work with version0.6.2_5
)numpy
(verified to work with version1.19.1
)
All calculations for the paper were performed with R 3.6. The
following packages will also be required (recommend adding with
install.packages()
)
tidyverse
(verified to work with version1.3.0
)irr
(verified to work with version0.84.1
)caret
(verified to work with version6.0-86
)docopt
(verified to work with version0.6.1
)e1071
(missed dependency in thecaret
package)
The video_lengths.py
script will require ffprobe
which is
typically packaged with ffmpeg
.
The following sections address how to calculate the results shared in the Results subsection Video information of the paper.
To calculate the mean, min, max, median, and pstdev on the overall
lengths of a directory full of videos, use video_lengths.py
. Results
are output to stdout
in unit of seconds, rounded to two decimal
places. Of note, this will require
ffprobe
.
video_lengths.py /data/directory_holding_videos
will output:
For the videos in '/data/directory_holding_videos', in seconds:
The mean is: 1467.5
The pstdev is: 645.5
The min is: 822
The median is: 1467.5
The max is: 2113
Phase duration statistics are calculated from annotated ground truth, with one annotation per video. Our annotation file format, due to historical reasons, are all converted to the Anvil file format, which is in xml.
All statistics, though, are calculated from a csv file (detailed in section below). Therefore to generate statistics and the boxplot of phase duration, if your annotations are in anvil format, I provide scripts to convert that information into a csv of the correct structure. Otherwise if you stored your data in another format, I recommend skipping the headache of your annotation format -> Anvil -> CSV and instead write a program to generate a CSV directly from your annotation format. If you need help with this, please contact me. Happy to offer guidance and/or write a script.
The below table shows the general structure of the CSV that contains annotated ground truth for an entire set of videos:
variable | class | description |
---|---|---|
video | character | Filename of video annotated |
frame | double | Frame number for the video. Starts at 0. Equivalent to seconds for this case |
annotator | character | Annotation for each frame (eg clip, cut, or NA if not annotated) done by the annotator generically named "annotator." Any column variable name is fine. |
-
Into a directory (eg
annotations_dir
), move a single Anvil format annotation per video in the dataset- It is ok if the "coder" field in the anvil annotation has different annotator names, the next step will standardize these
-
Rename all "coder" fields in the anvil annotation files to "annotator"
make_anvil_same_annotator.py annotations_dir standardized_name_dir
-
Generate
phase_duration.csv
anvil_to_frame_csv.py 1 standardized_name_dir phase_duration.csv
- the
1
tells the script to generate only one row (aka frame) per second.
- the
Phase names in annotation files are typically abbreviated (eg
muc_incis
) rather than there full human-readable name (eg
"Mucostomy"). Analysis programs also tend to order them alphabetically
in the display (eg "Mucostomy" will occur before "Submucosal
Injection"), which does not make sense as we would prefer the phases to
be temporally ordered. To fix both these issues, the statistics script
is fed a simple csv (phase_factors.csv
) that specifies a short to full
name translation. In addition, the rows' order matters, so the first
row will be displayed first, and the last row will be displayed last in
any future graphs. You can use this to order your phases temporally for
example or any other arbitrary sorting.
The phase_factors.csv
has the following structure:
variable | class | description |
---|---|---|
name |
character | Variable name used for the annotated phase in the annotation file, eg mucos_close |
full_name |
character | Full human readable name of the phase, eg "Mucosotomy Closure" |
An example phase_factors.csv
for POEM is present in the examples/
directory.
The script overall_phase_stats.R
is used. It will generate a TSV with
the following summary statistics for each phase:
- mean
- sd (standard deviation)
- min
- Q1 (1st quartile)
- med (median)
- Q3 (3rd quartile)
- max
- mad (median absolute deviation)
It can optionally generate a boxplot. With the command, you can
optionally invoke the time scale to be logarithmic. Phases will
be displayed on the Y-axis in the order they are specified in the
phase_factors.csv
. Images output format (eg png, jpg, pdf) is
automatically inferred from the specified file extension given. In
general, pdf
will provide optimal image quality (text will look the
best upon import into LaTeX). You can also specify the height and width
of the output boxplot. Below is an example invocation that outputs a TSV
with summary statistics in durations.tsv
and a boxplot boxplot.png
that is a 12×8 cm image with a log scale for the time axis:
overall_phase_stats.R -f phase_factors.csv -o durations.tsv \
-p -l 12 8 boxplot.png phase_duration.csv
Invoking overall_phase_stats.R -h
will print to stdout a help
message explaining all the command line options.
The following sections address how to calculate the results in the Results subsection Inter-annotator reliability and agreement.
As with the Phase duration statistics above, all statistics are calculated from a csv file that holds all the annotations (detailed in section below). Therefore to generate inter-annotator statistics, you can either go from Anvil annotations to generate the csv (instructions below) or generate it from your own annotations. If you need help with that, please let me know!
The below table shows the general structure of the CSV that contains
annotated ground truth for an entire set of videos by multiple
annotators. Note, it is nearly identical to the csv phase_duration.csv
except that it can hold an infinite number of annotator name columns:
variable | class | description |
---|---|---|
video |
character | Filename of video annotated |
frame |
double | Frame number for the video. Starts at 0. Equivalent to seconds for this case |
annotator_1's name |
character | Annotation for each frame (eg clip, cut, or NA if not annotated) done by the first annotator. |
annotator_2's name |
character | Annotation for each frame (eg clip, cut, or NA if not annotated) done by the second annotator. |
annotator_N's name |
character | Annotation for each frame (eg clip, cut, or NA if not annotated) done by the nth annotator. |
-
Create a directory and move the annotations for a set of videos by multiple annotators into the directory
- We have included the annotations performed by
DAH
,ORM
, andtmw
in theexamples/multiple_annotator_annotations
directory - Note that you will need each annotator's name in the anvil xml file to be the same between files (eg do not have TMW in one and TW in the other, this will count as different annotators)
- Also, each video must be annotated by all annotators otherwise the csv generation script may not output correctly
- We have included the annotations performed by
-
Generate the
multiple_annotator.csv
from the annotations stored in the directorymultiple_annotator_annotations
anvil_to_frame_csv.py 1 multiple_annotator_annotations \ multiple_annotators.csv
The script interannotator_stats.R
is used to generate Krippendorff's
alpha coefficient (to calculate inter-annotator reliability over the
entire video) and Fleiss' kappa (to calculate inter-annotator agreement
on a per-phase basis). It uses the annotation data extracted into
multiple_annotator.csv
. It also requires the phase_factors.csv
generated in previous portions of the readme in order to translate
phase names and order then by user preference. Each second is treated
as a different "diagnosis" by an annotator and compared then between
annotators. An example invocation that calculates the statistics and
outputs Fleiss' kappa into fleiss.csv and Krippendorff's alpha into
kripp.csv is below:
interannotator_stats.R -f fleiss.csv -k kripp.csv phase_factors.csv \
multiple_annotators.csv
Of note, to keep things consistent across calculation of Krippendorff's alpha and Fleiss' kappa, we now count "Idle" aka "NA" time the same across the two groups. Before in the Fleiss' kappa these times were not included for the calculations. You will notice slightly different results compared to our paper's table (in particular, Overall Fleiss' kappa is the same as the Krippendorf's alpha, as anticipated).
The following section will show you how to perform the following:
- Generate individual and side-by-side surgical fingerprint plots for each video in the test set (eg Fig 3 in the paper)
- Generate performance across phases statistics (per video in test set and all videos combined) (eg Table 3 in the paper)
- Generate a confusion matrix per video and across all videos in the test set (eg Fig 4 in the paper)
- Generate performance across phase-duration statistics on user-specified duration intervals (eg Table 4 in the paper)
As in previous sections, calculations are done from a csv/tsv file, in this instance, it's one tsv file per video in the test set for the model. I will outline the structure of that tsv file and also show how, from our typical model's output, you can generate the tsv file. Then I will show how to generate the figures and statistics from the tsv files.
To generate summary statistics, surgical fingerprint plots, and confusion matrices, there should be a directory full of tsvs. Each tsv contains a row per video second, with each row containing information below:
variable | class | description |
---|---|---|
second |
integer | second of the video |
vid_num |
integer | video's number in the test set |
block |
integer | phase's block number. each time a phase transitions to another phase this increments |
gt |
character | ground truth label for the phase |
predicted |
character | predicted (most likely) phase |
phase_1 name |
double | model's probability estimate that the current video second is phase_1 name |
phase_2 name |
double | model's probability estimate that the current video second is phase_2 name |
phase_... name |
double | model's probability estimate that the current video second is phase_... name |
phase_N name |
double | model's probability estimate that the current video second is phase_N name |
Replace the column variable names phase_1 name
, phase_2 name
, etc with
whatever the phase labels are that your model is trying to identify.
This section describes how to generate tsvs detailed in the prior
section. In particular, it shows how to generate it from a pkl
file
that our model outputs.
POEMNet's results on the test set are published in the
examples/poemnet.pkl
. The pkl file contains a single python
dictionary with keys class_names
, gt
, lstm:
, and lstm_hmm
. These
contain:
class_names
: list of class (phase) names, egmucos_close
gt
: a dictionary with keys[1, 2, ..., 20]
, one for each video. Each key's value is a list of the ground truth labels, with the list's index corresponding to the video second.lstm:
: a dictionary with keys[1, 2, ..., 20]
, one for each video. Each key's value is anumpy.ndarray
, with each index holding the likelihoods of that video's second being categorized as each of the different class names (aka phases).lstm_hmm
: same aslstm:
except thelstm:
results with additional HMM smoothing added
To quickly ascertain the contents of the pkl, you can run the script
pkl_dump.py
such as below:
pkl_dump.py poemnet_results.pkl
that will output:
Class names found in 'class_names' are:
muc_incis
mucos_close
myotomy
submuc_inject
tunnel
A model's results for 20 videos found in: 'lstm_hmm'
Ground truth for 20 videos found in: 'gt'
A model's results for 20 videos found in: 'lstm:'
This information will come in handy when you run
make_model_results_tsv.py
later.
Now that we know the pkl structure, we can run
make_model_results_tsv.py
to generate a directory full of TSVs for the
model's results (without HMM smoothing) called video_tsvs
for further
analysis:
make_model_result_tsvs.py -c 'class_names' -p poemnet_results.pkl \
-g 'gt' -m 'lstm:' video_tsvs
The command-line options are documented by invoking the script with
-h
.
Surgical fingerprints are a way to display the model's phase likelihoods for each second of the video compared to the annotated ground truth. An example is in Fig 3 of the paper. Below we will show you how to generate a fingerprint for each video the model analysed, and how to create a "side-by-side" fingerprint like in Fig 3.
Both use the same script, fingerprints.R
. They will require a
phase_factors.csv
file already written as documented above and a
directory full of model result TSVs (one per video).
Creating a fingerprint per each video allows you to rapidly analyze the model's performance on a per-video basis and try to target areas that will need improvement. To create one per video, do the following:
-
Make a directory to contain the scripts output
mkdir results
-
Run
fingerprints.R all
to generate a fingerprint per-video, intoresults/
, eg:fingerprints.R all -o results/ -f phase_factors.csv -t video_tsvs/
Optionally you can also specify image width, height, and format (use the
-h
command line switch to see fully how to use the script).
As in the POEMNet paper, you may want to generate a side-by-side comparison of two fingerprints to highlight differences in model analysis of each case. To do so (such as we did to generate the figure in the paper), do the following:
fingerprints.R two -W 15 -H 7.5 -o results/ \
-f phase_factors.csv -t video_tsvs \
video_08.tsv "Straightforward" video_10.tsv "Tortuous Esophagus"
This will generate a 15×7.5 image called
results/video_08_video_10_fingerprint.png
that shows the fingerprint
for video 08 on the left with the title "Straightforward" and the
fingerprint for video 10 on the right with the title "Tortuous
Esophagus".
To generate per-phase precision, recall, F1 score, and prevalence statistics (Table 3 in the paper) and recall/precision confusion matrices (seen in POEMNet's Fig 4), follow the below steps:
-
Create a directory of TSVs, with one per model result on a test set video, as detailed above. For example, create a
video_tsvs
directory. -
Make a directory to store the per-phase and confusion matrix results:
mkdir results_dir
-
Generate model metrics and confusion matrices:
model_metrics.R -c -o results_dir/ -f phase_factors.csv video_tsvs/
Of note, in the paper we used a width of 12 and height of 8 which you can specify as a command option (see output from
-h
for help)
model_metrics.R
generates a number of files that analyze the model's
performance on the test set. Assuming you followed the above, the
results_dir/
will contain, for each video and for all videos combined,
the following files (labeled either combine_foo
or video_NN_foo
),
where foo
can be:
raw.tsv
: Raw results from analysis withcaret
passed throughbroom::tidy()
. This includes overall accuracy (with CI) and per-phase recall, precision, etc.perclass.tsv
: A tsv that shows, per-phase and overall, the model's precision, recall, f1-score, and prevalence. The one for combined is identical to Table 3 in the paperconfusion.tsv
: A tsv that puts the caret results into a "tidy" format from which to generate the plots. It includes both the number and proportion of phases classified by the model as each phase.precision.png
: If specified to generate, an image of the precision confusion matrix.recall.png
: If specified to generate, an image of the recall confusion matrix. The one for combined is identical to the confusion matrix in Fig 4 of the paper.
We noticed for the shorter phases that our model had lower performance.
To clearly convey this, we created a script to generate statistics on
accuracy called per_block_duration_accuracy.R
. What this does is
examine the model's accuracy across blocks of different lengths. A block
is a continuous phase in the surgery that is bounded by different
phases. For example, if a surgeon "injects the submucosa" then "does a
mucosotomy," notices that they need more injection, so "injects the
submucosa" again, this counts as three different blocks. This allows us
to analyze short continuous segments, even if they typically are a long
phase.
To analyze different block lengths (eg from 1-30 seconds, 31-60 seconds, etc), you need to create a csv file that holds your preferred block lengths that has the following structure:
variable | class | description |
---|---|---|
start |
double | Start time, in seconds, for the current interval |
end |
double | End time, in seconds, for the current interval. To specify until the end of the video, use 'Inf' which stands for infinity in R |
An example durations.csv
exists in examples/
.
To generate per-duration model accuracy statistics (as seen in Table 4 of the Results section of the POEMNet paper), do the following:
-
Create a directory of TSVs, with one per model result on a test set video, as detailed previously. For example, create a
video_tsvs
directory. -
Make a
durations.csv
as above -
Run
per_block_duration_accuracy.R
per_block_duration_accuracy.R -i durations.csv \ -o per_block_duration_accuracy.tsv video_tsvs
The output file per_block_duration_accuracy.tsv
will contain:
start end accuracy
1 30 0.4178082191780822
31 60 0.6431818181818182
61 300 0.7601941747572816
301 600 0.9359582542694497
600 Inf 0.8777989802916009
Which is identical to Table 4 results for POEMNet in the paper! If you
also want to generate the results for after HMM post-processing, you
will need to generate another TSV directory for the lstm_hmm
model and
then run per_block_duration_accuracy.R
on this directory as well.
Please contact me in the communication medium of your preference listed on my Contact page.
If you found the code helpful for your research, please cite our paper:
@article{wardAutomatedOperativePhase2020,
title = {Automated Operative Phase Identification in Peroral Endoscopic Myotomy},
author = {Ward, Thomas M. and Hashimoto, Daniel A. and Ban, Yutong and Rattner, David W. and Inoue, Haruhiro and Lillemoe, Keith D. and Rus, Daniela L. and Rosman, Guy and Meireles, Ozanan R.},
year = {2020},
month = jul,
issn = {1432-2218},
doi = {10.1007/s00464-020-07833-9},
journal = {Surgical Endoscopy},
language = {en}
}