The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Add new
table
importer for CSV files. - Allow to set the order of output columns in
table
export with the parametercolumn_names
and to skip the column header with theskip_header
param. remove_match
option inrevise
now allows to delete the annotation but not the referenced node.
- Export token in
table
exporter instead of ignoring them. You can disable exporting the token with theskip_token
parameter. - Fixed
find_connected
calls withBound::Included(usize::MAX)
, which can lead to invalid results when using the linear graph storage. Replaced with the correctBound::Unbounded
.
- Add
meta
as export format
xlsx
importer did not give the correct node name to segmentation token. Due to this inconsistency, span annotations on segmentation nodes where not connected to the segmentation token.- unknown keys in toml configurations are now denied not only in config context, but globally in a workflow file
table
export has feature to customize n/a-value, which by default is the empty string- Add
conllu
as export format - import of
conllu
now supports enhanced dependencies - Adds
saltxml
export format - Adds
time
graph op to add or enrich time annotations - The
table
exporter now supports theid_column
parameter to enable/disable the ID column. - Importers that map directories to (sub)-corpora and files to documents can now also importt the
corpus if the
path
argument points to a single file. xlsx
importer now maps columns as spans if the column is not configured to be atoken_column
.
exmaralda
import now ranks order of tlis higher than sorting by time value (more compatible with modern EXMARaLDA files)xlsx
importer will connect spans to their corresponding segmentation node with coverage edges instead of connecting them with the base tokens generated for the timeline items. Thus, the configured connection between spans and base text is not lost.
exmaralda
import keeps events with missing time values
- New command line argument
--in-memory
that has the same meaning as settingANNATTO_IN_MEMORY
to true but is easier to discover. map
manipulator can now add annotated spans and copy values from existing annotations. The copied values can be manipulated using regular expressions and replacement values.- Adds
saltxml
import format - Adds
table
export format - Adds
filter
graph op table
export can include in- and outgoing edges
- Using the same type of manipulator in a workflow now shows the correct progress.
- Make the progress report in
revise
mode indeterminate because it is unclear how many operations are actually performed and finishing some steps instantly and others in minutes will cause extremely unrealistic time estimations. table
export uses one additional criterion to identify timeline tokens: no outgoing coverage tokens (apart from being a member of Ordering/annis/)
revise
now offers to delete nodes that match a query using list entry[[remove_match]]
with keysquery
andremove
.
- internal changes in deserialization of annotation components and annotation keys (keys can be provided as string or in map notation), which changes the api and the way some workflow configurations are organised. Use
annatto info [module]
for more details. It does not affect behaviour once older workflows are adapted to the new interface. - when an exmaralda file contains an empty url attributed for a referenced file, this does not raise a warning anymore, as this is the way exmaralda encodes it, when no media file is used.
textgrid
export creates intervals from global xmin to global xmax for all tiers
textgrid
export did not generate intermediate empty intervals when xmax of an interval did not match xmin of the subsequent interval, which leads to hardly editable intervals in praat. This has been fixed.
visualize
graph operation that allows to output the current graph (somehwere in the conversion process) to SVG or DOT for debugging.
- removed debug output
collapse
uses deserializable component, thus attributesctype
,layer
, andname
are now under keycomponent
collapse
only keeps annotations with namespaceannis
for nodes that were terminals in the collapsed components when transferring to the merged nodes to keep node status intact (e. g. token vs. not a token in terms ofannis::tok
).
textgrid
export considers time annotations of covered nodes as well
textgrid
export can now handleannis::time
intervals with an undefined right boundary (such intervals will be skipped)
collapse
now also transfers annotations with namespace "annis" with the exception of "annis::node_name". This could lead to unstable results in case of conflicting values, such as for "annis::layer", but for most use cases this is not relevant yet. Not adding many of the before dropped annotations, though, was much more severe.
textgrid
export now creates PRAAT TextGrid files from annotation graphstextgrid
export can be configured for a desired order of tiers in the output files; the order of tiers can be incomplete, attributeignore_others
can be used to interprete the order as an allowlisttextgrid
export also looks intopoint_tiers
ignore_others
is on, since it is a reasonable expection the user could have. Thus, settingignore_others = true
with an emptytier_order
would result in an export of all point tiers if at least one is set.exmaralda
export can now be configured for the annotation key that provides a clue to which subgraph is relevant for a file
- code is more robust and more transparent to the user in case of unexpected data
textgrid
import now allows correct file type specification for short files
revise
now deserialized components directly and uses different syntax. They are provided as a list offrom
andto
component specifications.
- preconfiguration of
arch_dependency
viaguess_vis
field of graphml export now only setsnode_key
mapping for named orderings. Setting it with an empty value did not addressannis::tok
contrary to what was expected to happen. - some bare unwraps have been removed, thus exporting graphml is now more robust.
- New
annatto document <OUTPUT_DIR>
command that allows to generate markdown files with the module documentation in a given output directory. This command is executed in every pull request to keep the documentation up to date. conllu
format now properly imports sentence comments, i. e. sentence level annotations that are not delimited by "=". This also requires such annotations to not contain a "=" at all. Such comments will be by default imported as values ofconll::comment
annotations. The annotation name can be adapted using attributecomment_anno
of toml typemap
with keysns
andname
(a serialization of graphANNIS'AnnoKey
).
link
,map
,enumerate
, andcollapse
have documentation visible to the user.
- documentation for import of
xlsx
showed wrong config doc string link
does not use default0
forsource_node
andtarget_node
attributes anymore, since they are 1-based indices (instead, there is no default)
sequence
export for horizontal data now also works in models with multiple segmentation and empty tokenscheck
can now save without a panic whenreport
attribute is omitted.list
is the default report level which only applies tosave
, not to thereport
attribute itself, where the default is not to print.
sequence
export for horizontal mode now works
- Importer for the relANNIS format (http://korpling.github.io/ANNIS/3.7/developer-guide/annisimportformat.html)
- progress reports for
enumerate
,link
, andmap
revise
can now rename nodes using attributenode_names
, e. g. for renaming (top level) corpus nodes. The syntax is equivalent to renaming annotations, thus renaming with an empty value will lead to deletion. Renaming with an existing value (also rename with self) will lead to an error.- Add
zip
option to GraphML export to directly export as ZIP file which can be more easily imported in ANNIS.
- update to dependencies to latest graphANNIS version
- Fix non-resolved relative path when importing EXMARaLDA files.
- Limit the table width when listing the module properties, so they fit in the current terminal.
sequence
exports connected node's annotation values (e. g. ordered nodes) as vertical or horizontal sequences.split
breaks up conflated annotation values into partsrevise
now offers to delete an entire subgraph from a node in the inverse direction of part of edgesenumerate
can prefix the numeric annotation it generates with an annotation value from the query match (use attributevalue
to point in the match list with a 1-based index)
enumerate
uses u64 internally (to be in line with graphANNIS and to be deserializable)collapse
now uses node ids that indicate the node names that entered the merge, the parent node is not indicated anymoresplit
has default configuration/behaviour (do nothing); attributekeep
is nowdelete
to adhere to boolean default logic
- no more
annis::tok
labels for non-terminal coverage nodes inxlsx
import - hypernode id's are unified, in older versions it could happen that annotations get distributed about two or more hypernode instances due to invalid determination of the parent (part of-child)
- Added simple chunker module based on text-splitter.
check
can write check report to filecheck
can test a corpus graph comparing results to an external corpus graph loaded from a graphANNIS database- import
ptb
can now split node annotations to derive a label for the incoming edge, when a delimiter is provided usingedge_delimiter
. E. g.,NP-sbj
will create a node of categoryNP
, whose incoming edge has functionsbj
, given the following config is used:edge_delimiter = "-"
- config attribute
stable_order
for exporting graphml enforces stable ordering of edges and nodes in output - toml workflow files now strictly need to stick to known fields of module structs
- command line interface now has the
list
subcommand to list all modules and theinfo
subcommand to show the description and parameters of a module.o
- The
check
module can now query theAnnotationGraph
directly without using theCorpusStorageManager
. chunk
deserializes with empty config to default values
- Don't throw error if output directory for any workflow does not exist.
- import
ptb
: Also constituents getPartOf
edges to their respective document node.
- improve progress reporting by reporting each conversion step separately
- graph_op
collapse
can collapse an edge component, i. e., it merges all nodes in a connected subgraph in said component collapse
can be accelerated when all edges of the component to be collapsed are known to be disjoint by providingdisjoint = true
in the step configcollapse
provides more feedback on current processcollapse
gives hypernodes proper names that allow to identify the subgraph they belong to. Furthermore already existing hypernode ids are not reused (in case multiple collapse operations are run on a graph).CorpusStorage
is now quiet- importing
exmaralda
does now has more features exmaralda
can be exportedxlsx
import creates part of-edges between tokens and document nodes- all imports add PartOf edges from nodes to their respective document (lowest corpus node)
link
now considers all matching nodes for the same value, so the correct amount of edges is createdexmaralda
returns error when there is no time value for a timeline item- fixed and simplified import of corpus node annotations
exmaralda
import's paths to linked media files are relative to the working directoryxlsx
importer now addsPartOf
relations to the document nodes
- a separator for joining node values in
link
can be set with attributevalue_sep
- spreadsheet imports can now be configured with a fallback token column for annotation names not mentioned in a column map, an empty string means map to timeline directly
- graph_op
check
can now be configured to not let the entire processing chain fail, when a test fails, by settingpolicy = "warn"
(default isfail
) - metadata can be imported from spreadsheets alongside the linguistic data in the workbook, a data and a metadata spreadsheet name or number can now be specified for importing xlsx
- add heuristic for KWIC visualizer in graphml export
re
is nowrevise
revise
can modify componentspath
as a import format now triggers the embedding of path names as nodes into the graph; this is supposed to help to represent configuration files for ANNIS- import
path
adds anannis::file
annotation - import
path
adds part-of edges - very basic implementation of a generic xml importer
- import opus sentence alignments
- graph op
enumerate
to enumerate nodes, i. e., add numeric annotations to results of one or multiple queries - add importer for the format used by the TreeTagger
- mapping annotations now correctly extracts the id of the node to apply a new annotation to
- linking nodes failed to extract node names when graphANNIS responded with a node name only (e. g. in case of "tok" or "node" in a query)
- linking nodes did not concatenate the values of multiple nodes properly, this is now fixed
- fixed code of spreadsheet import (merged cells might not have an end column reference)
- relative import and export paths are interpreted as relative to the parent directory of the workflow file
- the spreadsheet importer will use the correct namespace
default_ns
for segmentation ordering relations - fixed ordering of token nodes in spreadsheet import
- removed
show-documentation
subcommand and moved the documentation from mdBook to the crate documentation in the source code
- Documentation was not included in release binaries.
- CLI binary renamed from
annatto-cli
toannatto
- To execute a workflow file, use
annatto run <workflow-file>
- module properties are now struct attributes of the importer, manipulator, or exporter, which facilitates deserialization and also use (undefined properties are no longer accepted, required properties cannot be ommited)
- only TOML workflow files are now supported, xml workflows can no longer be processed
- TOML support adds a command
annatto validate <workflow-file>
that checks if a worklow description can be deserialized to an internal workflow - not all modules have a default implementation anymore (path attributes have no default value that makes sense)
- there is a default operation for each step type. Import: Create empty corpus, Manipulation: Do nothing, Export: Write GraphML
- map properties of some modules (such as
tier_group
for importing Textgrid) are no longer String codings, since TOML supports providing maps directly - flattened TOML for workflow files
- TOML workflows: module config has to be singled out in separate table
check
tests are now configured in main workflow as TOML fragmentcheck
report table contains number of matches in case of failure- linker takes list of node indices for value nodes (source and target)
- return an error in Workflow::execute on conversion error instead of relying on status messages
- collected errors in status messages
Failed
are now all reported at the end of the job - an annotation mapper can create annotations from existing annotations using AQL for defining target nodes
- New command
show-documentation
for CLI, which starts a browser with the user guide. - after running
check
, the the test results can be printed as a table (default: off) check
displays matching nodes for tests in new verbose modecheck
now comes with a higher level test ("Layer test") that is internally converted into atomic aql tests. The test can be applied to nodes and edges. It tests if a layer exists and only valid annotation values have been used.- using flag
--env
allows to resolve environmental variables in workflow definitions which enables the use of template workflow definitions - node linker: with two queries the resulting nodes can be linked via edges of a configurable type, layer, and name
- boolean environment variable
ANNATTO_IN_MEMORY
influences whether or not graphs will be stored on disk or in memory
- fixed panics caused by undefined attributes in tier tag or missing speaker table / wrong speaker id
- exmaralda import did not properly forward errors through the status sender, which it now does
- added import module for exmaralda partitur files
- set annis::layer with speaker name when importing exmaralda files
- spreadsheet import builds regular ANNIS coverage-based model
- import CoNLL-U files
- fix character buffer in exmaralda import
- order names are no longer part of the guessed visualisation when exporting GraphML
- if audio file linked in an exmaralda file cannot be found, no audio source will be linked
- exmaralda import: Multiple tlis with the same time value are now merged into a single tli token
- Upgrade to quick-xml 0.28 to avoid issues in future versions of Rust.
- exmaralda: catch flipped time values (start >= end)
- allow to leak graph updates to text file
- import textgrid, ptb, graphml, corpus annotations (metadata), spreadsheets
- check documents with list of AQL queries and expected results
- merge multiple imports to single graph
- merge policy for merge: fail on error, forward error (corrupted graph), drop subgraph (document) with errors
- apply single combined update after imports are finished to avoid multiple
- calls to apply_update
- replace annotation names, namespaces, move annotations, delete annotations (re)
- export graphml