diff --git a/visualize_catalog_content.qmd b/visualize_catalog_content.qmd index 78065c1..2f3641a 100644 --- a/visualize_catalog_content.qmd +++ b/visualize_catalog_content.qmd @@ -1,6 +1,5 @@ --- -title: "What is in HTR-United Catalog?" -author: "Alix Chagué" +title: "Plotting HTR-United's catalog" format: html: code-fold: true @@ -9,6 +8,10 @@ format: toc-title: "Contents" toc-location: left theme: lumen + other-links: + - text: HTR-United's website + href: https://htr-united.github.io + code-links: repo project: type: default output-dir: ./computed @@ -16,11 +19,15 @@ jupyter: python3 --- -# HTR-United +This page offers an overview of the latest content of the [HTR-United catalog](http://htr-united.github.io). The visualizations are normally updated frequently. Feel free to check that the HTR-United's catalog version listed below does indeed correspond to the latest version available for the catalog ([here](https://github.com/HTR-United/htr-united)). -```txt -TODO: Paragraphs introducing HTR-United in a few words. -``` +### What is HTR-United? + +HTR-United is a catalog that lists highly documented training datasets used for automatic transcription or segmentation models. HTR-United standardizes dataset descriptions using a schema, offers guidelines for organizing data repositories, and provides tools for quality control and continuous documentation. It's an open and transparent ecosystem hosted on GitHub, designed for easy maintenance. HTR-United was created to help projects quickly access diverse ground truth data for training models on smaller collections. + +### What is shown here? + +This page is only dedicated to a generic oversight of the content of the catalog, mainly in the form of plots. If you want to browse the datasets listed in the catalog, there is a more suitable interface for that [here](https://htr-united.github.io/catalog.html). ```{python} @@ -101,7 +108,7 @@ if r.status_code == 200: # now let's download the yaml file with open("htr-united.yml", "w", encoding="utf8") as fh: fh.write(r_yml.text) - print("We are currently computing the content of HTR-United's catalog", htr_united_version) + print("We are currently computing \nthe content of HTR-United's \ncatalog", htr_united_version) else: print("Couldn't connect to", github_url, "got status code", r_yml.status_code) @@ -122,6 +129,8 @@ if os.path.exists(yaml_file_path): ## Language coverage ```{python} +#| fig-cap: "A bar-plot figuring the different languages covered in HTR-United's catalog" + languages = [] for entry in data: if entry.get("language"): @@ -140,6 +149,8 @@ make_bar_plot(counted_lgges, title='Language Distribution', xlabel="Languages") ## Script coverage ```{python} +#| fig-cap: "A bar-plot figuring the different writing systems (scripts) covered in HTR-United's catalog" + scripts_dict = [] for entry in data: if entry.get("script"): @@ -159,19 +170,23 @@ for cs in counted_scripts.most_common(5): make_bar_plot(counted_scripts, title='Script Distribution', xlabel="Scripts") ``` -## Script type coverage +## Writing type type coverage ```{python} +#| fig-cap: "A bar-plot figuring the different writing categories covered in HTR-United's catalog ('script-type')" + script_types = [entry.get("script-type") for entry in data if entry.get("script-type")] counted_script_types = Counter(script_types) pprint(counted_script_types) -make_bar_plot(counted_script_types, title='Script type Distribution', xlabel="Script Type") +make_bar_plot(counted_script_types, title='Writing type Distribution', xlabel="Writing Type") ``` ## Software variety ```{python} +#| fig-cap: "A bar-plot figuring the different software used to create the datasets listed in HTR-United's catalog" + softwares = [entry.get("production-software") for entry in data if entry.get("production-software")] counted_softwares = Counter(softwares)