-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a Macroize tool to enable conversion of Ion 1.0 data to Ion 1.1 using specified macros and text patterns. #990
Conversation
…using specified macros and text patterns.
@@ -24,7 +24,7 @@ public abstract class IonEncodingVersion<BinaryWriterBuilder, TextWriterBuilder> | |||
* Ion 1.0, see the <a href="https://amazon-ion.github.io/ion-docs/docs/binary.html">binary</a> and | |||
* <a href="https://amazon-ion.github.io/ion-docs/docs/text.html">text</a> specification. | |||
*/ | |||
public static IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder> ION_1_0 = new IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder>(0) { | |||
public static final IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder> ION_1_0 = new IonEncodingVersion<IonBinaryWriterBuilder, IonTextWriterBuilder>(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why Spotless just found these now, but this is a good change, though unrelated to the rest of the PR.
@@ -21,7 +21,7 @@ | |||
* in an IndexOutOfBoundsException. The number 10 is chosen because it is the maximum number of bytes required to write | |||
* a long value as a FlexInt or VarInt. | |||
*/ | |||
/*package*/ final class WriteBuffer implements Closeable | |||
public final class WriteBuffer implements Closeable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being used in Macroize, which is in a different Java package. We can rename this class using the _Private_
prefix if we want to maintain the existing convention to denote that this shouldn't be used by users of the library.
if (((IonValueLite) element)._context.getContextSymbolTable() != getContextForIndex(null, index).getContextSymbolTable()) { | ||
// Note: this isn't impossible to support, but it requires care in the case where 'element' may depend | ||
// on symbol table mappings unique to its own context. In order to sidestep this complexity until a use | ||
// case is identified for it, only setting the element at an index that uses the same symbol table is | ||
// currently supported. | ||
throw new UnsupportedOperationException(); | ||
} | ||
return super.set(index, element); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JavaDoc for this method still says it's unsupported. I haven't felt the need to change that to mark this as officially supported since it has a known limitation. I'm happy to move this functionality to an internal-only method if we prefer that.
/** | ||
* Writes a String value as a make_string invocation whose argument is a symbol. This allows recurring text to be | ||
* added to the symbol table and encoded using an ID while retaining the String type. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just create a constant macro for it? It would be be more space efficient to have a constant macro (even if it has a multi-byte address) than it is to convert a symbol using make_string
.
In the best case, it requires a minimum of 4 bytes to invoke make_string
with a single SID. If there are more than 255 symbols, then you could have a 2-byte SID here, and if make_string
is not one of the first 64 macros in the default module, then it will require 2 bytes to encode the macro address.
03 // macro `make_string`, assuming it can be invoked using a one-byte user-space address
01 // AEB - one expression present
E1 XX // Symbol with 1-byte SID
Whereas constant macros up to (approximately) address 1MM can always be invoked using 3 or fewer bytes.
The overhead of the macro definition itself is about 4-5 bytes (1 byte to start s-expression, 2 bytes for macro
symbol, 1 byte for empty signature, payload, optional 1 byte to end macro definition if using delimited sexp), so the break-even point is 5 usages or fewer in the worst case. If e.g. the constant macro is encoded with a 2 byte macro address and the SID would be 2 bytes, the break-even point is 2 usages.
|
||
@Override | ||
public void invoke(String match, ManualEncodingContext table, IonRawWriter_1_1 writer, boolean isBinary) { | ||
// TODO consider whether these could/should be written using a custom macro that itself calls make_string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like VerbatimTextPattern
, this is always going to be more compact in the data stream if it uses a custom macro.
Invoking make_string
directly, in the best possible case, has a minimum of 5 bytes of overhead. If the suffix is long enough, then it needs at least 6 bytes of overhead in order to use a delimited expression group instead of writing out a long length prefix.
03 // Invoke make_string using single-byte address
02 // AEB - expression group
XX // Expression group length
E1 XX // prefix - Symbol with 1-byte SID
F9 ... // suffix
A custom macro with an address less than ~1MM always has 3 bytes or less of overhead.
XX YY ZZ // Macro address up to ~1MM
F9 ... // suffix
The overhead for defining the macro is going to be 18 bytes, so the worst-case break even point is 9 usages.
AA // start sexp
BB BB // system symbol macro
CC // signature sexp
DD DD // 1 byte inline symbol for the parameter name
EE // (
FF FF // .
GG GG // make_string
HH HH // prefix symbol
II // (
JJ JJ // %
KK KK // variable name
/** | ||
* Writes a String value as a make_string invocation with a prefix, a recurring substring, and a suffix. This allows for | ||
* strings with common substrings to be written compactly, even if they may have high-cardinality prefixes and/or | ||
* suffixes. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like PrefixTextPattern
this will be more space efficient in the data stream if using a custom macro such as (macro (prefix suffix) (.make_string (%prefix) "middle" (%suffix)))
. I haven't done the calculations on the break-even point, though.
Just to be clear, I'm not saying that you need to switch the text patterns to macros in this PR. I'm just pointing out that macros are more efficient in most cases. |
Yes, agreed on the feedback. This is what I wanted to do at first, but I didn't have support for invoking macros in TDL in the tool yet, and wanted to leave this for future improvement. I've opened #1006 for this. |
Description of changes:
This is an early (
v0.1
) version of the tool with many TODOs and known limitations (see the JavaDoc on MacroizeSpec.java). However, I think it's useful enough to make available now while we continue iterating. I also have a version of the tool that does basic macro detection, but that will be added in a future PR.The tool may be invoked by building the ion-java jar and using
java -cp ion-java.jar com.amazon.ion.apps.macroize.Macroize
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.