Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(csharp/src/Apache.Arrow.Adbc): correct StandardSchemas.ColumnSchema data types #1731

Conversation

birschick-bq
Copy link
Contributor

@birschick-bq birschick-bq commented Apr 19, 2024

  1. Corrects the two fields in StandardSchemas.ColumnSchema
  2. Adds unit tests for the StandardSchemas
  3. Adds extension methods to validate data using Schema or IReadOnlyList<Field>.
  4. Add extension method to CreateNestedListArray

Resolves #1729

@github-actions github-actions bot added this to the ADBC Libraries 1.0.0 milestone Apr 19, 2024
@lidavidm lidavidm changed the title bug(csharp/src/Apache.Arrow.Adbc): correct StandardSchemas.ColumnSchema data types fix(csharp/src/Apache.Arrow.Adbc): correct StandardSchemas.ColumnSchema data types Apr 19, 2024
@@ -128,8 +128,8 @@ public static class StandardSchemas
new Field("xdbc_scope_catalog", StringType.Default, true),
new Field("xdbc_scope_schema", StringType.Default, true),
new Field("xdbc_scope_table", StringType.Default, true),
new Field("xdbc_is_autoincrement", StringType.Default, true),
new Field("xdbc_is_generatedcolumn", StringType.Default, true)
new Field("xdbc_is_autoincrement", BooleanType.Default, true),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected data types.

@birschick-bq birschick-bq changed the title fix(csharp/src/Apache.Arrow.Adbc): correct StandardSchemas.ColumnSchema data types fix(csharp/src/Apache_Arrow_Adbc): correct StandardSchemas.ColumnSchema data types Apr 19, 2024
@lidavidm
Copy link
Member

Oh hmm, the regex I'm using to validate the title is too strict - periods should probably be allowed. You can ignore that check for now, please use the correct path.

@birschick-bq birschick-bq changed the title fix(csharp/src/Apache_Arrow_Adbc): correct StandardSchemas.ColumnSchema data types fix(csharp/src/Apache.Arrow.Adbc): correct StandardSchemas.ColumnSchema data types Apr 19, 2024
.pre-commit-config.yaml Outdated Show resolved Hide resolved
@birschick-bq birschick-bq marked this pull request as draft April 19, 2024 17:27
@birschick-bq birschick-bq marked this pull request as ready for review April 19, 2024 21:03
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole, I'm not excited about exposing two public and non-ADBC-specific functions from this assembly -- particularly without a stronger specification for them. Can we make them internal and then use InternalsVisibleTo? This is already being used in the C# Arrow implementation, so the build processes should support it.

It would also be nice to eliminate the large number of nullability warnings by either removing the nullability annotations or by putting the code into a #nullable block.

/// <param name="schema">The schema to validate against.</param>
/// <param name="data">The data array to validate.</param>
/// <exception cref="ArgumentException">Throws an exception if the number of columns or type data types in the data array do not match the schema fields.</exception>
public static IReadOnlyList<IArrowArray> Validate(this Schema schema, IReadOnlyList<IArrowArray> data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a very superficial comparison of types. For instance, it doesn't check the precision and scale of decimals, length of a fixed-size list, the units of dates, times or intervals, whether or not a union is dense or sparse, the name or nullability of a field, etc. That seems mostly okay for an internal method being used privately against a relatively-known schema, but problematic for a public function that's being exported for anyone to use.

How many of the gaps in the above list are features (for this particular use case) rather than flaws? Should this be standard functionality in the core Arrow library? It seems to overlap with what the (internal) class ArrayDataTypeComparer does.

Copy link
Contributor Author

@birschick-bq birschick-bq Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to use the existing class, if it were exposed. I could also make this better and more rigorous, if you believe it might be used outside of its current context.

/// <param name="arrayList">The list of data.</param>
/// <param name="dataType">The data type of the contained data.</param>
/// <returns>A <see cref="ListArray"/> of the data.</returns>
public static ListArray CreateNestedListArray(this IReadOnlyList<IArrowArray?> arrayList, IArrowType dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary seems underspecified for a public function. I can't tell from the text what it does or what I need to pass to it to make it work.

Is this functionality which should be in the core Arrow library?

Copy link
Contributor Author

@birschick-bq birschick-bq Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I believe this should likely be included in the Arrow library. Not sure of it proper placement in the namespace, yet.

Copy link
Contributor Author

@birschick-bq birschick-bq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the classed in the new extensions to be internal and exposed them in the AssemblyInfo.cs properties.


namespace Apache.Arrow.Adbc.Extensions
{
internal static class ListArrayExtensions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extensions is marked as internal.


namespace Apache.Arrow.Adbc.Extensions
{
internal static class StandardSchemaExtensions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extension is marked as internal.


using System.Runtime.CompilerServices;

[assembly: InternalsVisibleTo("Apache.Arrow.Adbc.Drivers.Apache, PublicKey=0024000004800000940000000602000000240000525341310004000001000100e504183f6d470d6b67b6d19212be3e1f598f70c246a120194bc38130101d0c1853e4a0f2232cb12e37a7a90e707aabd38511dac4f25fcb0d691b2aa265900bf42de7f70468fc997551a40e1e0679b605aa2088a4a69e07c117e988f5b1738c570ee66997fba02485e7856a49eca5fd0706d09899b8312577cbb9034599fc92d4")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expose internal interfaces to three libraries, assuming the current signing key.

/// <param name="schema">The schema to validate against.</param>
/// <param name="data">The data array to validate.</param>
/// <exception cref="ArgumentException">Throws an exception if the number of columns or type data types in the data array do not match the schema fields.</exception>
public static IReadOnlyList<IArrowArray> Validate(this Schema schema, IReadOnlyList<IArrowArray> data)
Copy link
Contributor Author

@birschick-bq birschick-bq Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to use the existing class, if it were exposed. I could also make this better and more rigorous, if you believe it might be used outside of its current context.

/// <param name="arrayList">The list of data.</param>
/// <param name="dataType">The data type of the contained data.</param>
/// <returns>A <see cref="ListArray"/> of the data.</returns>
public static ListArray CreateNestedListArray(this IReadOnlyList<IArrowArray?> arrayList, IArrowType dataType)
Copy link
Contributor Author

@birschick-bq birschick-bq Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I believe this should likely be included in the Arrow library. Not sure of it proper placement in the namespace, yet.

@birschick-bq
Copy link
Contributor Author

birschick-bq commented Apr 26, 2024

On the whole, I'm not excited about exposing two public and non-ADBC-specific functions from this assembly -- particularly without a stronger specification for them. Can we make them internal and then use InternalsVisibleTo? This is already being used in the C# Arrow implementation, so the build processes should support it.

It would also be nice to eliminate the large number of nullability warnings by either removing the nullability annotations or by putting the code into a #nullable block.

RE: nullability annotations

I believe we need a separate ticket for this. It will require hundreds of small changes to mark current non-nullable types as nullable. I see you have already created a new issue for this.

@CurtHagenlocher CurtHagenlocher merged commit 3173a1d into apache:main Apr 27, 2024
6 checks passed
cocoa-xu pushed a commit to meowcraft-dev/arrow-adbc that referenced this pull request May 8, 2024
…ma data types (apache#1731)

1. Corrects the two fields in `StandardSchemas.ColumnSchema`
2. Adds unit tests for the `StandardSchemas`
3. Adds extension methods to validate data using `Schema` or
`IReadOnlyList<Field>`.
4. Add extension method to `CreateNestedListArray`

Resolves apache#1729
@birschick-bq birschick-bq deleted the dev/birschick-bq/1729/inconsistent-columnschema-types branch May 9, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

csharp/src/Apache.Arrow.Adbc: incorrect data type used in StandardSchemas.ColumnSchema
3 participants