- FIX: Fixed
hyperglot-data
error
- DATA: All language yaml documents now have their
contributors
listed, some havereviewers
listed - DATA: *Massive improvement of language
sources
with proper source citations where possible - DATA: Added
punctuation
,numerals
andcurrency
attributes to orthographies - checking for these attributes will be added in the next update! - DATA: Added
lib/hyperglot/extra_data/default.yaml
to include inheritable defaults per script - DATA: Refined
jpn
,ryu
andain
Katakana orthographies - FEATURE: Orthography attributes can inherit from other languages with
<iso>
syntax, see README - TWEAK: Improved loading time for repeat access by saving parsed language data cache file
- TWEAK: Orthographies can no longer have an
inherit
attribute - TWEAK: Improved loading speed for repeat queries and indivudal language queries
- TWEAK: Refactored
Languages
,Language
andOrthography
object instantiation to always return parsed and defaulted nested objects - TWEAK: Removed the
--speakers
and--autonym
CLI options - TWEAK: Removed the
--comparison
CLI option (seeexamples
instead) - TWEAK: Removed the
--languages
CLI option, usehyperglot-info LanguageName/ISO
instead - TWEAK: Removed the
--strict_iso
CLI option; use the python library to access this option, particularlyLanguage.get_name(script, strict_iso=True)
- FIX: Fixed an issue where trying to log missing shaping glyphs would crash in
FontChecker
- FIX: Improved mark shaping detection to interpret ccmp substitutions of base + mark as correctly shaping (thanks @arialcrime)
- TWEAK: Cleaned up
hyperglot.language.Language
class and added attribute properties for dict properties with computed defaults (as opposed to writing defaults for missing attributes) as well as more code annotation - TWEAK:
hyperglot.orthography.Orthography
object hasscript_iso
attribute returning the mapped ISO 15924 script tag - DATA: Added
lib/extra_data/script-names.yaml
with a list of all current Hyperglot scripts and a mapping to their ISO 15924 code equivalent - DATA: Added di/tri-graphs to Czech and Hungarian orthographies and fixed their order
- DATA: Added Squamish (
squ
) (thanks @justinpenner) - DATA: Unified "Geʽez" script with reversed comma, as opposed to previous mixed use of "Ge'ez/Fidel" and "Ge'ez"
- DATA: Amended spelling "Tai Viet" script in title case to match other script names
- DATA: Corrected spelling of "Bamum" script and language (instead of less used "Bamun" used in Hyperglot)
- DATA: Use "Coptic" instead of "Coptic/Numbian" script name
- DATA: Use "Burmese" script for language "Mon"
- DATA: Use "Baybayin" script name instead of "Tagalog (Baybayin, Alibata)"
- DATA: Fixed Toki Pona (
tok
) file name - TWEAK: Make sure
Orthography.base_chars
andOrthography.aux_chars
return no duplicates for decomposed character sequences - TWEAK: Define
Languages
,Language
andOrthography
as module top level exports for easier importing, e.g. now:from hyperglot import Language
- FIX: Set correct default values for
Language.status
andOrthography.preferred_as_group
and provide validation and tests for these. - TWEAK: Deprecated plain list
SUPPORTLEVELS, VALIDITYLEVELS, STATUSES, ORTHOGRAPHY_STATUSES
and replaced them withSupportLevel, LanguageValidity, LanguageStatus, OrthographyStatus
enums throughout the code base. The deprecated values will be removed in the next minor version. - TESTS: Added simple tox config for running test on all supported minor python versions
- FIX: Fixed type hinting issue causing failure on python 3.8.x
- DATA: Added Banjar (
bjn
) (thanks @mahalisyarifuddin) - DATA: Expanded Xavánte (
xav
) data (thanks @moyogo)
- DATA: Refined Romanian by adding
design_alternates
explicitly
- DATA: Refined Klingon (
tlh
) orthography and added a draft version of Toki Pona (tok
) - FEATURE: Implemented shaping checks for mark positioning when required by unencoded base + mark combinations or
--decompose
- FEATURE: Implemented shaping checks for connecting scripts to detect presence of required positional forms
- FEATURE: Implemented
hyperglot-report
command with same options ashyperglot
and additional--report-missing n
,--report-marks n
and--report-joining n
— or--report-all n
to toggle all aforementioned — parameters/flags for outputting languages almost supported by the font - TWEAK: Support checking is now done via
hyperglot.checker
objects for cleaner separation between language data and checking fonts - TWEAK: Various python APIs and objects changed and refactored
- TWEAK: Bumped required python version to 3.8.0
- DATA : Added Tlingit
tli
language data (thanks @jcrippen) - DATA: Fixed inconsistent note about
Ŋ
in various languages (thanks @moyogo) - TWEAK: Improved
hyperglot-validate
to spot lookalike characters in the wrong script, e.g.a
(Latin U+0061) vsа
(Cyrillic U+0430) - TWEAK: Explicitly ignore non-yaml files (e.g. operating system or other) in the data when parsing
- TWEAK: Improved
hyperglot-validate
command to better catch yaml issues (thanks for reporting @jcrippen)
- DATA: Removed orthography status
deprecated
and usinghistorical
for those instances - DATA: Added Ethiopic languages
awn
,byn
,gez
,har
,sgw
,tig
,xan
and updatedtir
(thanks @dyacob and @NeilSureshPatel) - DATA: Added Avestan
- DATA: Corrections to
jbo
(thanks @berrymot) - DATA: Updated
sco
primary orthography (thanks @moyogo) - DATA: Some fixes to
kkj
orthography (thanks @moyogo) - DATA: Small note added to
Dagbani
(thanks @clauseggers and @moyogo) - DATA: Fix to Shan (
shn
) containing some stray Latin characters - FIX: Fix issue with file name conflicts on Windows systems
- FIX: Fix pypi missing data files
- FEATURE: Added
-l/--language
flag to show supported/not supported glyphs of a font for specific languages - DATA: Restructured
hyperglot.yaml
into individual files for each language inhyperglot/data/xxx/xxx.yaml
- DATA: Fix two auxiliary glyphs in Georgian which where swapped uppercase / lowercase by mistake
- DATA: Small charset fixes to Kom
bkm
and Southern Samosbd
(thanks @moyogo) - DATA: Small tweak to Afrikaans
afr
(thanks @iandoug)
- DATA: Added languages and scripts for: Ainu, Akkadian, Ancient Egyptian, Mycenaean Greek, Linear A, Linear B, Minoan, Pontic Greek, Okinawan, Sumerian, Klingon, Minaen, Hadramautic, Qatabanian and Sabaean (big thanks to @gusbemacbe !)
- DATA: Added Kayah autonym
- DATA: Added design requirement note for
Ŋ
- DATA: Improved Georgian, added Mtavruli and auxiliary
- DATA: Added historical orthographies for German and English that use
ſ
- FIX: Fixed orthography of Thai to not require
◌̍ ◌̎
in base checks
- FIX: Fixed missing script attribute in 'lee' orthography
- FIX: Fixed typo in 'Oriya' script name
- FEATURE: Implemented
hyperglot-data
CLI command to search and display language information returned by Hyperglot - FEATURE: Implemented more convenient language access via attributes on hyperglot.languages.Languages, e.g. Languages().eng to access a hyperglot.language.Language object for "eng"
- DATA: Fix in Standard Malay encoding of
'
(thanks M. Mahali Syarifuddin and Caleb Maclennan) - DATA: Added numerous Burkina Faso and other African languages (another huge thanks to @moyogo !)
- DATA: Added Oriya
- DATA: Added Kartvelian languages (kat, sva, xmf, lzz) (thanks Ana)
- DATA: Dozens of African and North-American languages added and refined (thanks @moyogo !)
- DATA: Refined English
auxiliary
- DATA: Fix for Pinyin
- CLI: Introduced
--sort
(alphabetic
, default, orspeakers
) and--sort-dir
(asc
, default, ordesc
)
- DATA: Fix for Skolt Sami (soft sign)
- DATA: Fix for Hawaiian (okina)
- DATA: Fix for Thai including several missing marks and letters
- DATA: Fix in Buginese
- DATA: Updates to Indonesian and Standard Malay
- DATA: Fix for Turkish orthography
- DATA: Fix for Afrikaans orthography
- DATA: Corrected ISO code for Gen language
- DATA: Added Benin languages
- DATA: Small fix to Portuguese
- DATA: Revised Tamil orthography
- DATA: Added Apinayé, Karo and Awetí languages
- FIX: Fixed an encoding issue affecting Windows environments
- DATA: Fixed typos in Buginese
- DATA: Reviewed Minangkabau orthography
- DATA: Added Batak languages and refined Balinese
- FIX: Further improvement to detection of orthographies with unencoded base + mark combinations
- TWEAK: Refined the returned properties of
hyperglot.language.Orthography
to include base and auxiliary lists of encoded characters as well as required marks for - TOOLS: Added scraper for fetching a mapping of Opentype language systems to ISO codes and saving them in
other/languagesystems.yaml
- DATA: Renamed
design_note
todesign_requirements
and made its data structure a list - DATA: Introduced
design_alternates
- a list of characters which may require special design in a font supporting an orthography - DATA: Added
design_alternates
for several Cyrillic and Latin languages
- DATA: Corrected speaker count for Manipuri
- DATA: Updates to Andaandi and Old Nubian
- DATA: Minor formatting and duplicate fixes
- FIX: Fixed parsing issue that led for some languages to require marks in their support as if the
--marks
flag was used - TWEAK:
hyperglot.language.Language
no longer prunes or parses any character lists, but this is instead done on running the support checks by instantiating aOrthography
object and using it for checking, leaving the dict representation of the yaml data in theLanguage
untouched - FEATURE: Introduced
hyperglot.language.Orthography
abstraction for easier access of parse lists vs yaml raw character strings - TESTS: More refactored Languages, Language and new Orthography tests
- DATA: Changed the way
marks
and decomposition are handled in the data entry and saving - DATA:
base
andauxiliary
may now contain unencoded base + mark character combinations without those getting decomposed on saving - DATA: Updated approximately 50-100 languages which previously had unencoded base + mark combinations not saved in their character sets, since those were not unicode characters - this update added and retains those unencoded combinations for more comprehensive listing of the orthographies
- DATA: Marks are now always placed on
◌
in the data for easier readability - CLI: Default checking (without
-m
) no longer requires implicit combining marks, meaning those which are retrieved from decomposing the characters - the default check will still require those marks, which are explicitly listed inmarks
and are not the result of decomposing the characters - CLI: Introduced
-m/--marks
as a flag to require all marks for a support level check - CLI: Changed
-m/--mode
to-c/--comparison
- TWEAK: Removed
hyperglot.parse.prune_superflous_marks
as no longer needed - TWEAK: Introduced
hyperglot.parse.parse_marks
- TWEAK: Removed
prune
andpruneRetainDecomposed
flags fromLanguages()
and changed default call toLanguages()
to no longer prune or parse its dict contents - TWEAK: Only calls to
Language()
now parse the orthography data (with defaultTrue
for argumentparse
) - TWEAK: Renamed methods
hyperglot.languages.get_support_from_chars
tosupported
andhyperglot.languages.has_support
tosupported
- TWEAK: Added warnings and validation checks for multiple inheritance levels (e.g. A inherits from B inherits from C should instead be A inherits from C)
- Data: Updated Ter Sami orthography as inheriting from Kildin Sami
- Data: Fixes to Kildin Sami
- Data: Some fixes to Marshallese
- Data: Added Ottoman Turkish and a transliteration orthography for it
- Data: Added Hanunoo
- Data: Replaced Single right comma (and other variants) with Modifier letter apostrophe for some Sami languages
- Data:
- FIX: Fixed issue that caused to parse some fonts (#24)
- TWEAK: Allow inheriting an orthography without explicitly having a script present in the orthography, this will inherit the primary script orthography of the parent
- DATA: Updated language data for Nubian languages and Japanese
- DATA: Introduced
transliteration
orthography status (started in 0.2.10)
- DATA: Updated language data for Minang (xrg), Tamil (tam), Cherokee (chr), Tagalog (tgl), Aja (ajg), Khmer (khm), Madurese (mad), Javanese (jav) and others
- FIX: Reverted hotfix from 0.2.9 and implemented validation to use iso yaml file only for editable package installs and emit warning
- FIX: Refined
--decompose
and fixed an issue where the decompose option ended up returning more stringent matches than teh default - FIX:
--output
output refactored to no longer expect the result to be structured by support levels - TWEAK: Refactored multiple file input result intersection and union
- TESTS: Better tests relating to deomposed output
- TESTS: Added tests for multiple file input intersection and union results
- HOTFIX: Prevent error message about missing file in CLI use
- FIX: Fixed inheritence when it chains, e.g. Algerian Arabic inheriting from Tunisian Arabic which inherits from Standard Arabic
- FIX: Fixed inheritence missing
marks
,design_notes
andnote
- TWEAK: Make sure
marks
are saved in ordered form, so saving does not arbitrarily alter the order - TESTS: Added tests for orthography inheritance
- DATA: Constrained speaker counts to integers only
- DATA: Fixed various speaker counts containing malformed data
- DATA: More design notes for Latin-script languages
- DATA: Khmer added as draft, Armenian, Buginese, Georgian, Burmese, Lao and Thai refined
- TWEAK: Implemented validation for speaker count data
- DATA: Various status updates, notes and reviewed orthographies
- DATA: Introduced
marks
attribute containing all combinging marks needed for an orthography - FEATURE: Automatically extract and save
marks
frombase
data, plus retain any explicitly addedmarks
in the data - TWEAK: For default
hyperglot-save
calls automatically run validation to flag any remaining issues - TWEAK: Flag legacy marks being used in charset data
- DATA: Introduced
design_note
parameter - DATA: Various language data updates and smaller fixes
- DATA: Several orthography fixes, thanks Denis Moyogo Jacquerye
- TWEAK: Changed orthography status names to
todo, draft, preliminary, verified
- TWEAK: Improved
Language.get_orthography
to return better default picks and allow getting orthographies of specific script or status
- First
pip
release :)
- FEATURE: Implemented
--include-all-orthographies
to check all butdeprecated
orthographies and changed default behaviour to only listprimary
orthographies - TWEAK: Implemented treating orthographies with
preferred_as_group
as one for checks - TWEAK: Languages with multiple
primary
orthographies will match if one is supported - TWEAK:
Languages
can be initiated withpruneRetainDecomposed
to keep any precomposed characters from the database when usingprune
(which decomposes them to base + mark) - TWEAK: Improved tests for CLI and improved and fixed some parsing tests
- FIX: Marginal cases fixed where using
parse_chars
and already parsed lists would merge a mark with a predeceding base glpyh and result in a erraneous list of base/aux characters - DATA: Added uppercase to bicameral scripts
- DATA: All languages now have a
primary
orthography - DATA: Introduced
preferred_as_group
orthography attribute - TESTS: Config to ignore other library's warnings
- TWEAK:
Languages()
now takes avalidity
argument to filter by validity ('weak' or better by default) - TWEAK:
parse_chars
now will put decomposition components on in the input list to the end of the list - TWEAK: Languages require an orthography that has status
primary
- DATA: Updated and added many scripts and languages and their speaker counts
- FEATURE: Added
--decomposed
flag that determines if a font is required to have all glyphs of a language as code points, or if supporting all combining marks is sufficient - TWEAK: Renamed module and database to
hyperglot
- TWEAK:
--strict-support
refactored to--validity
with defaultweak
to pick the level of required validity on the languages that should get matched - TWEAK: Saving and validating enforces removal of superflous mark characters that are getting implicitly extracted via glyph decomposition
- TWEAK: Detection automatically extracts all required mark glyphs for languages and the database has been pruned of any no longer required mark glyphs listed. Using the
hyperglot-save
will apply this pruning and save the database in its cleaned up state - TESTS: Added tests for the Language and Languages class
- TESTS: Added test for the CLI options running against actual font files
- DOCS: Overhauled and updated the README to all latest changes
- FIX: Refined character parsing to also include the encoded form of any decomposable glyphs
- FIX: Improved character set parsing from database properly decomposing any combining characters into their parts and checking against those
- TESTS: Added first pytest for above case
- FEATURE: Added
--strict-support
flag (default False) to explicitly trigger warning about languages with unconfirmed status. Since those languages have well researched charset information but just have not been confirmed by several expert sources we still want to include them in the count. Using--strict-support
excludes (but lists separately) all those languages which we have not been able to confirm - TWEAK: Renamed
--strict
flag to--strict-iso
to be more discriptive - TWEAK: Database file linking, one more time... as per 0.1.10
- TWEAK: Added validation check to prevent non-space separators in character list data
- FEATURE: Implemented
fontlang-export
CLI script to export the rosetta.yaml with expanded inherits to a file, usage:$ fontlang-export thefile.yaml
- FIX: Refactored
setup.py
to include the databased file relative to the package
- FIX: "Inverted" the
preferred_as_individual
outcome, e.g. those languages should suppress any included languages from being listed and be listed as one language instead
- FIX: Made sure
preferred_as_individual
in fact also removes the language that is being inheritted from the matches - TWEAK: Update
fontlang-save
and sorting to not include inheritted attributes - TWEAK: Updated and fixed validation for
status
attributes
- FEATURE: Implemented support for the
preferred_as_individual
attribute on macro languages - FEATURE: Added
--strict
flag to display language names and macrolanguages as per ISO data - TWEAK: Implemented orthography attribute
inherit
to inherit another language's orthography for thatscript
(if one exists) - FIX: Language names with countries in brackets no longer have their closing parenthesis cut off
- TWEAK: Updated
fontlang-validate
to spec
- FIX: More robust relative file path loading for database file
- TWEAK:
-o
output is now of same structure for single file input, and indexed by file name for several file input - TWEAK:
-o
filters the languages' orthographies to only supported ones - TWEAK: Added validation check to confirm orthographies have a 'script'
- TWEAK: Refactored validation script to
fontlang-validate
CLI command - TWEAK: Languages without orthographies that are included in macrolanguages that do have orthographies silently inherit the macrolanguage's orthographies
- FEATURE: Added
fontlang-save
CLI command to re-save therosetta.yaml
sorted alphabetically - FEATURE: Added
--include-historical
and--include-constructed
flags to include those languages in results - FEATURE: Added
--version
and--verbose
flags
- FEATURE: Added
-m
option ('individual', 'union', 'intersection') to compute a support comparison of several passed in fonts - TWEAK: You can now pass in any number of font paths. By default one after the other is analyzed
- TWEAK: Make sure to print
preferred_name
if available
- TWEAK: Merged
fontlang
with Rosetta Language DB repo - TWEAK: Updated data structure in YAML and added
Language
class for convenience
- FEATURE:
-o
flag to specify an output yaml file path - FEATURE:
-n
flag to display language names in native spelling (where available) - FEATURE:
-u
flag to display language users (if available) - TWEAK: Updated
rosetta.yaml
language database
- FEATURE:
-s
flag with "base" or "aux" values to set support level to check - TWEAK: Support output sorted by scripts and language DB status (informs about not "done" langs being match)
- TWEAK: Added basic font validation for the passed in file path
- TWEAK: Fixed relative imports and cli usage for dev
- FIX: Language database typo fix
- FEATURE: MVP with basic
$ fontlang path/to/font
command