Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problem with column names using Amazon Athena ODBC 2.x driver #10

Open
jeffsantos opened this issue Jan 3, 2024 · 3 comments

Comments

@jeffsantos
Copy link

I'm trying to run a query on a database using the Amazon Athena ODBC 2.x driver (https://docs.aws.amazon.com/athena/latest/ug/odbc-v2-driver.html). I can correctly connect to the Athena server (via DSN) and also execute queries. The problem is that the returned column names appear to be in the wrong encoding system. The contents of the columns themselves are correct, the problem is only in the name of the columns.

The Athena driver requires forwardOnly cursors, so I used the query:forwardOnly: message from the connection object to execute the queries. I debugged the execution of this message and identified the following:

  1. As I am using Windows, the ODBCConnection object is always instantiated with utf-16 as the string encoding system. This is done in the ODBCConnection class >> determineStringEncoder which always returns an instance of ODBCUTF16Encoder.

  2. When executing the query, the problem occurs in the method
    ODBCAbstractStatement >> describeCols:

This is the part of the code where the problem occurs:

col := ODBCColAttr new
columnNumber: each;
name: (self stringEncoder decodeStringFrom: name characterCount: colNameLen value);
type: dataType value;
precision: columnSize value;
scale: decimalDigits value;
yourself. 

I can get the proper column names using other odbc libraries (pyodbc for Python, for example).

I tried to understand how the ODBCUTF16Encoder >> decodeStringFrom:characterCount: method works, but it is very complicated, and I still haven't been able to understand what might be happening. I would appreciate it if you have any tips that could help me.

This is a screenshot of my playground inspecting an ODBCRow object returned by the query where the problem can be seen more clearly:

odbc-issue

jeffsantos added a commit to jeffsantos/Pharo-ODBC that referenced this issue Jan 5, 2024
…ement >> describeCols:`

In some cases (i.e. Athena ODBC 2.x driver) could be necessary to reallocate the column name buffer to a bigger size.
Resolves: pharo-rdbms#10
@jeffsantos
Copy link
Author

I identified the problem. In fact, it has nothing to do with the encoding system, but rather with the external call to the ODBC API function SQLDescribeColW, in the ODBCAbstractStatement>>describeCols: method. At the beginning of this method, a buffer of a certain size is allocated to support the name of a column. Theoretically, the initially allocated size should be enough to support any column name. It works great with the relational database drivers I've tested. But for some reason, it doesn't work with the Amazon Athena ODBC 2.x driver.

Checking the output parameters after the SQLDescribeColW call, I noticed that the value of the param that indicates the size of the column name returned by the function is larger than the one previously allocated before the function call.

The signature of the SQLDescribeColW function in the official documentation is as follows:

SQLRETURN SQLDescribeCol(
       SQLHSTMT StatementHandle,
       SQLUSMALLINT ColumnNumber,
       SQLCHAR * ColumnName,
       SQLSMALLINT BufferLength,
       SQLSMALLINT * NameLengthPtr,
       SQLSMALLINT * DataTypePtr,
       SQLULEN * ColumnSizePtr,
       SQLSMALLINT * DecimalDigitsPtr,
       SQLSMALLINT * NullablePtr);

The description of the NameLengthPtr parameter gives a clue to the solution:

[Output] Pointer to a buffer in which to return the total number of characters (excluding the null termination) available to return in *ColumnName. If the number of characters available to return is greater than or equal to BufferLength, the column name in *ColumnName is truncated to BufferLength minus the length of a null-terminated character.

I forked the project in my personal account and changed ODBCAbstractStatement>>describeCols: to check the return values after calling SQLDescribeColW. If the returned size is larger than the previously allocated one, I resize the buffer and call SQLDescribeColW again to get the untruncated column name.

If you think this could be a permanent solution, please let me know. I can submit a pull request for your evaluation.

@astares
Copy link
Contributor

astares commented Jan 18, 2024

Would be good to provide a PR (ideally with a test case) so the users of the original project can profit from your fix too.

@jeffsantos
Copy link
Author

Great! I will prepare that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants