From 104a7ad99a3cd5991a22ddb35a7298af9835760f Mon Sep 17 00:00:00 2001 From: Kaito Sugimoto Date: Sun, 19 Nov 2023 17:50:10 +0900 Subject: [PATCH] add JapaneseSpokenLanguageBERT --- README.md | 1 + README_en.md | 1 + README_fr.md | 1 + 3 files changed, 3 insertions(+) diff --git a/README.md b/README.md index f23fa4e..d6cfb52 100644 --- a/README.md +++ b/README.md @@ -148,6 +148,7 @@ | [Laboro BERT](https://laboro.ai/activity/column/engineer/laboro-bert/) | BERT (base, large) | 日本語 Web コーパス
(ニュースサイトやブログなど
計4,307のWebサイト、2,605,280ページ (12GB)) | Laboro.AI | CC BY-NC 4.0 | ✕ | | [Laboro DistilBERT](https://laboro.ai/activity/column/engineer/laboro-distilbert/) | DistilBERT | - (Laboro BERT(base) を親モデルとして知識蒸留)| Laboro.AI | CC BY-NC 4.0 | [◯](https://huggingface.co/laboro-ai/distilbert-base-japanese) | | [日本語ブログELECTRA](https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E2-5.pdf) | ELECTRA (small) | 日本語ブログコーパス(3億5,400万文) | 北見工大 桝井・プタシンスキ研 | CC BY-SA 4.0 | [◯](https://huggingface.co/ptaszynski/yacis-electra-small-japanese) | +| [日本語話し言葉BERT](https://tech.retrieva.jp/entry/2021/04/01/114943) | BERT (base) | 東北大BERTに対して日本語話し言葉コーパス(CSJ)を用いて追加学習
(DAPTモデルでは国会議事録データも使用) | レトリバ | Apache 2.0 | [◯](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | | [日本語金融BERT](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | BERT (small, base) [^9] | 日本語 Wikipedia
+ 日本語金融コーパス (約2,700万文 (5.2GB)) | 東大 和泉・坂地研 | CC BY-SA 4.0 |◯ ([small](https://huggingface.co/izumi-lab/bert-small-japanese-fin), [base](https://huggingface.co/izumi-lab/bert-base-japanese-fin-additional)) | | [日本語金融ELECTRA](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | ELECTRA (small) | 日本語 Wikipedia (約2,000万文 (2.9GB))
+ 日本語金融コーパス (約2,700万文 (5.2GB)) | 東大 和泉・坂地研 | CC BY-SA 4.0 | [◯](https://huggingface.co/izumi-lab/electra-small-japanese-fin-discriminator) | | [UTH-BERT](https://ai-health.m.u-tokyo.ac.jp/home/research/uth-bert) | BERT (base) | 日本語診療記録(約1億2,000万行) | 東大病院
医療AI開発学講座 | CC BY-NC-SA 4.0 | △ | diff --git a/README_en.md b/README_en.md index 5201a31..99b094f 100644 --- a/README_en.md +++ b/README_en.md @@ -147,6 +147,7 @@ Please point out any errors on the [issues page](https://github.com/llm-jp/aweso | [Laboro BERT](https://laboro.ai/activity/column/engineer/laboro-bert/) | BERT (base, large) | Japanese Web Corpus
(News and blogs, etc) (12GB) | Laboro.AI | CC BY‑NC 4.0 | ✕ | | [Laboro DistilBERT](https://laboro.ai/activity/column/engineer/laboro-distilbert/) | DistilBERT | (Distillation of Laboro BERT(base)) | Laboro.AI | CC BY‑NC 4.0 | [◯](https://huggingface.co/laboro-ai/distilbert-base-japanese) | | [JapaneseBlogELECTRA](https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E2-5.pdf) | ELECTRA (small) | Japanese Blog Corpus (354M sentences) | Kitami Institute of Technology Masui-Ptaszynski Lab | CC BY‑SA 4.0 | [◯](https://huggingface.co/ptaszynski/yacis-electra-small-japanese) | +| [JapaneseSpokenLanguageBERT](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | BERT (base) | Additional training for TohokuUniversityBERT using Corpus of Spontaneous Japanese (CSJ)
(In the DAPT model, the diet record is also used) | Retrieva | Apache 2.0 | [◯](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | | [JapaneseFinancialBERT](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | BERT (small, base)[^9] | Japanese Wikipedia, Japanese Financial Corpus (27M sentences/5.2GB) | University of Tokyo Izumi-Sakaji Lab | CC BY‑SA 4.0 |◯
([small](https://huggingface.co/izumi-lab/bert-small-japanese-fin), [base](https://huggingface.co/izumi-lab/bert-base-japanese-fin-additional)) | | [JapaneseFinancialELECTRA](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | ELECTRA (small) | Japanese Wikipedia (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB) | University of Tokyo Izumi-Sakaji Lab | CC BY‑SA 4.0 | [◯](https://huggingface.co/izumi-lab/electra-small-japanese-fin-discriminator) | | [UTH-BERT](https://ai-health.m.u-tokyo.ac.jp/home/research/uth-bert) | BERT (base) | Japanese Medical Records(120M lines) | University of Tokyo Hospital
Medical AI Development Course | CC BY‑NC‑SA 4.0 | △ | diff --git a/README_fr.md b/README_fr.md index 695a558..abc03ad 100644 --- a/README_fr.md +++ b/README_fr.md @@ -147,6 +147,7 @@ N'hésitez pas à signaler les erreurs sur la page [issues](https://github.com/l | [Laboro BERT](https://laboro.ai/activity/column/engineer/laboro-bert/) | BERT (base, large) | Corpus web en japonais
(Actualités, blogs, etc) (12GB) | Laboro.AI | CC BY‑NC 4.0 | ✕ | | [Laboro DistilBERT](https://laboro.ai/activity/column/engineer/laboro-distilbert/) | DistilBERT | (Distillation of Laboro BERT(base)) | Laboro.AI | CC BY‑NC 4.0 | [◯](https://huggingface.co/laboro-ai/distilbert-base-japanese) | | [JapaneseBlogELECTRA](https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E2-5.pdf) | ELECTRA (small) | Corpus de blogs en japonais (354M sentences) | Université de technologie de Kitami - Laboratoire de Masui-Ptaszynski | CC BY‑SA 4.0 | [◯](https://huggingface.co/ptaszynski/yacis-electra-small-japanese) | +| [JapaneseSpokenLanguageBERT](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | BERT (base) | Formation supplémentaire pour TohokuUniversityBERT en utilisant le Corpus of Spontaneous Japanese (CSJ)
(Dans le modèle DAPT, le compte rendu de la diète est également utilisé) | Retrieva | Apache 2.0 | [◯](https://huggingface.co/retrieva-jp/japanese-spoken-language-bert) | | [JapaneseFinancialBERT](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | BERT (small, base)[^9] | Wikipédia en japonais, Japanese Financial Corpus (27M sentences/5.2GB) | Université de Tokyo Izumi-Sakaji Lab | CC BY‑SA 4.0 |◯
([small](https://huggingface.co/izumi-lab/bert-small-japanese-fin), [base](https://huggingface.co/izumi-lab/bert-base-japanese-fin-additional)) | | [JapaneseFinancialELECTRA](https://sites.google.com/socsim.org/izumi-lab/tools/language-model) | ELECTRA (small) | Wikipédia en japonais (20M sentences/2.9GB), Japanese Financial Corpus (27M sentences/5.2GB) | Université de Tokyo Izumi-Sakaji Lab | CC BY‑SA 4.0 | [◯](https://huggingface.co/izumi-lab/electra-small-japanese-fin-discriminator) | | [UTH-BERT](https://ai-health.m.u-tokyo.ac.jp/home/research/uth-bert) | BERT (base) | Dossiers médicaux en japonais (120M lignes) | Université de Tokyo Hôpital
Cours de développement en IA pour la médecine | CC BY‑NC‑SA 4.0 | △ |