Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds a Macroize tool to enable conversion of Ion 1.0 data to Ion 1.1 using specified macros and text patterns. #990

Merged
merged 1 commit into from
Dec 4, 2024

Conversation

tgregg
Copy link
Contributor

@tgregg tgregg commented Nov 11, 2024

Description of changes:

This is an early (v0.1) version of the tool with many TODOs and known limitations (see the JavaDoc on MacroizeSpec.java). However, I think it's useful enough to make available now while we continue iterating. I also have a version of the tool that does basic macro detection, but that will be added in a future PR.

The tool may be invoked by building the ion-java jar and using

java -cp ion-java.jar com.amazon.ion.apps.macroize.Macroize

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@@ -24,7 +24,7 @@ public abstract class IonEncodingVersion<BinaryWriterBuilder, TextWriterBuilder>
* Ion 1.0, see the <a href="https://amazon-ion.github.io/ion-docs/docs/binary.html">binary</a> and
* <a href="https://amazon-ion.github.io/ion-docs/docs/text.html">text</a> specification.
*/
public static IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder> ION_1_0 = new IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder>(0) {
public static final IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder> ION_1_0 = new IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder>(0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why Spotless just found these now, but this is a good change, though unrelated to the rest of the PR.

@@ -21,7 +21,7 @@
* in an IndexOutOfBoundsException. The number 10 is chosen because it is the maximum number of bytes required to write
* a long value as a FlexInt or VarInt.
*/
/*package*/ final class WriteBuffer implements Closeable
public final class WriteBuffer implements Closeable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being used in Macroize, which is in a different Java package. We can rename this class using the _Private_ prefix if we want to maintain the existing convention to denote that this shouldn't be used by users of the library.

Comment on lines +326 to +333
if (((IonValueLite) element)._context.getContextSymbolTable() != getContextForIndex(null, index).getContextSymbolTable()) {
// Note: this isn't impossible to support, but it requires care in the case where 'element' may depend
// on symbol table mappings unique to its own context. In order to sidestep this complexity until a use
// case is identified for it, only setting the element at an index that uses the same symbol table is
// currently supported.
throw new UnsupportedOperationException();
}
return super.set(index, element);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JavaDoc for this method still says it's unsupported. I haven't felt the need to change that to mark this as officially supported since it has a known limitation. I'm happy to move this functionality to an internal-only method if we prefer that.

Comment on lines +12 to +15
/**
* Writes a String value as a make_string invocation whose argument is a symbol. This allows recurring text to be
* added to the symbol table and encoded using an ID while retaining the String type.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just create a constant macro for it? It would be be more space efficient to have a constant macro (even if it has a multi-byte address) than it is to convert a symbol using make_string.

In the best case, it requires a minimum of 4 bytes to invoke make_string with a single SID. If there are more than 255 symbols, then you could have a 2-byte SID here, and if make_string is not one of the first 64 macros in the default module, then it will require 2 bytes to encode the macro address.

03      // macro `make_string`, assuming it can be invoked using a one-byte user-space address
01      // AEB - one expression present
E1 XX   // Symbol with 1-byte SID

Whereas constant macros up to (approximately) address 1MM can always be invoked using 3 or fewer bytes.

The overhead of the macro definition itself is about 4-5 bytes (1 byte to start s-expression, 2 bytes for macro symbol, 1 byte for empty signature, payload, optional 1 byte to end macro definition if using delimited sexp), so the break-even point is 5 usages or fewer in the worst case. If e.g. the constant macro is encoded with a 2 byte macro address and the SID would be 2 bytes, the break-even point is 2 usages.


@Override
public void invoke(String match, ManualEncodingContext table, IonRawWriter_1_1 writer, boolean isBinary) {
// TODO consider whether these could/should be written using a custom macro that itself calls make_string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like VerbatimTextPattern, this is always going to be more compact in the data stream if it uses a custom macro.

Invoking make_string directly, in the best possible case, has a minimum of 5 bytes of overhead. If the suffix is long enough, then it needs at least 6 bytes of overhead in order to use a delimited expression group instead of writing out a long length prefix.

03      // Invoke make_string using single-byte address
02      // AEB - expression group
XX      // Expression group length
E1 XX   // prefix - Symbol with 1-byte SID

F9 ...  // suffix

A custom macro with an address less than ~1MM always has 3 bytes or less of overhead.

XX YY ZZ      // Macro address up to ~1MM

F9 ...  // suffix

The overhead for defining the macro is going to be 18 bytes, so the worst-case break even point is 9 usages.

AA      // start sexp
BB BB   // system symbol macro
CC      // signature sexp
DD DD   // 1 byte inline symbol for the parameter name
EE      // (
FF FF   // .
GG GG   // make_string
HH HH   // prefix symbol
II      // (
JJ JJ   // %
KK KK   // variable name

Comment on lines +10 to +14
/**
* Writes a String value as a make_string invocation with a prefix, a recurring substring, and a suffix. This allows for
* strings with common substrings to be written compactly, even if they may have high-cardinality prefixes and/or
* suffixes.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like PrefixTextPattern this will be more space efficient in the data stream if using a custom macro such as (macro (prefix suffix) (.make_string (%prefix) "middle" (%suffix))). I haven't done the calculations on the break-even point, though.

@popematt
Copy link
Contributor

popematt commented Dec 4, 2024

Just to be clear, I'm not saying that you need to switch the text patterns to macros in this PR. I'm just pointing out that macros are more efficient in most cases.

@tgregg
Copy link
Contributor Author

tgregg commented Dec 4, 2024

Just to be clear, I'm not saying that you need to switch the text patterns to macros in this PR. I'm just pointing out that macros are more efficient in most cases.

Yes, agreed on the feedback. This is what I wanted to do at first, but I didn't have support for invoking macros in TDL in the tool yet, and wanted to leave this for future improvement. I've opened #1006 for this.

@tgregg tgregg merged commit 24e82fa into ion-11-encoding Dec 4, 2024
16 of 17 checks passed
@tgregg tgregg deleted the ion-11-encoding-macroize-basic branch December 4, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants