Skip to content

Commit

Permalink
adding missing explanations and legends to the figures
Browse files Browse the repository at this point in the history
  • Loading branch information
alix-tz committed Aug 22, 2024
1 parent 4ed5c6e commit 117a6bb
Showing 1 changed file with 24 additions and 9 deletions.
33 changes: 24 additions & 9 deletions visualize_catalog_content.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
---
title: "What is in HTR-United Catalog?"
author: "Alix Chagué"
title: "Plotting HTR-United's catalog"
format:
html:
code-fold: true
Expand All @@ -9,18 +8,26 @@ format:
toc-title: "Contents"
toc-location: left
theme: lumen
other-links:
- text: HTR-United's website
href: https://htr-united.github.io
code-links: repo
project:
type: default
output-dir: ./computed
jupyter: python3
---


# HTR-United
This page offers an overview of the latest content of the [HTR-United catalog](http://htr-united.github.io). The visualizations are normally updated frequently. Feel free to check that the HTR-United's catalog version listed below does indeed correspond to the latest version available for the catalog ([here](https://github.com/HTR-United/htr-united)).

```txt
TODO: Paragraphs introducing HTR-United in a few words.
```
### What is HTR-United?

HTR-United is a catalog that lists highly documented training datasets used for automatic transcription or segmentation models. HTR-United standardizes dataset descriptions using a schema, offers guidelines for organizing data repositories, and provides tools for quality control and continuous documentation. It's an open and transparent ecosystem hosted on GitHub, designed for easy maintenance. HTR-United was created to help projects quickly access diverse ground truth data for training models on smaller collections.

### What is shown here?

This page is only dedicated to a generic oversight of the content of the catalog, mainly in the form of plots. If you want to browse the datasets listed in the catalog, there is a more suitable interface for that [here](https://htr-united.github.io/catalog.html).

<!-- library import -->
```{python}
Expand Down Expand Up @@ -101,7 +108,7 @@ if r.status_code == 200:
# now let's download the yaml file
with open("htr-united.yml", "w", encoding="utf8") as fh:
fh.write(r_yml.text)
print("We are currently computing the content of HTR-United's catalog", htr_united_version)
print("We are currently computing \nthe content of HTR-United's \ncatalog", htr_united_version)
else:
print("Couldn't connect to", github_url, "got status code", r_yml.status_code)
Expand All @@ -122,6 +129,8 @@ if os.path.exists(yaml_file_path):
## Language coverage

```{python}
#| fig-cap: "A bar-plot figuring the different languages covered in HTR-United's catalog"
languages = []
for entry in data:
if entry.get("language"):
Expand All @@ -140,6 +149,8 @@ make_bar_plot(counted_lgges, title='Language Distribution', xlabel="Languages")
## Script coverage

```{python}
#| fig-cap: "A bar-plot figuring the different writing systems (scripts) covered in HTR-United's catalog"
scripts_dict = []
for entry in data:
if entry.get("script"):
Expand All @@ -159,19 +170,23 @@ for cs in counted_scripts.most_common(5):
make_bar_plot(counted_scripts, title='Script Distribution', xlabel="Scripts")
```

## Script type coverage
## Writing type type coverage

```{python}
#| fig-cap: "A bar-plot figuring the different writing categories covered in HTR-United's catalog ('script-type')"
script_types = [entry.get("script-type") for entry in data if entry.get("script-type")]
counted_script_types = Counter(script_types)
pprint(counted_script_types)
make_bar_plot(counted_script_types, title='Script type Distribution', xlabel="Script Type")
make_bar_plot(counted_script_types, title='Writing type Distribution', xlabel="Writing Type")
```

## Software variety

```{python}
#| fig-cap: "A bar-plot figuring the different software used to create the datasets listed in HTR-United's catalog"
softwares = [entry.get("production-software") for entry in data if entry.get("production-software")]
counted_softwares = Counter(softwares)
Expand Down

0 comments on commit 117a6bb

Please sign in to comment.