Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLAS] Fix symbol conflicts between backends and reference #251

Merged
merged 4 commits into from
Nov 22, 2022

Conversation

dnhsieh-intel
Copy link
Contributor

Description

When a backend respects CBLAS symbols, instead of comparing the backend against the reference libcblas library, the current testing structure would compare the reference against itself due to the conflicting symbols. This PR resolves this issue by loading reference functions at runtime into a local namespace.

Fixes #204

Tests

Diagnostic messages were inserted in the Netlib cblas_saxpy and the oneMKL axpy unit test to check when the reference was called. More specifically:

  • In the Netlib cblas_saxpy.c
    void cblas_saxpy( const CBLAS_INT N, const float alpha, const float *X,
                           const CBLAS_INT incX, float *Y, const CBLAS_INT incY)
    {
       printf("   Netlib cblas_saxpy called\n");
    #ifdef F77_INT
       F77_INT F77_N=N, F77_incX=incX, F77_incY=incY;
    #else
       #define F77_N N
       #define F77_incX incX
       #define F77_incY incY
    #endif
       F77_saxpy( &F77_N, &alpha, X, &F77_incX, Y, &F77_incY);
    }
  • In the oneMKL unit test axpy.cpp
    template <typename fp>
    int test(device *dev, oneapi::mkl::layout layout, int N, int incx, int incy, fp alpha) {
        // Prepare data.
        vector<fp> x, y, y_ref;
    
        rand_vector(x, N, incx);
        rand_vector(y, N, incy);
        y_ref = y;
    
        // Call Reference AXPY.
        std::cout << "Call reference axpy" << std::endl;
        using fp_ref = typename ref_type_info<fp>::type;
        const int N_ref = N, incx_ref = incx, incy_ref = incy;
    
        ::axpy(&N_ref, (fp_ref *)&alpha, (fp_ref *)x.data(), &incx_ref, (fp_ref *)y_ref.data(),
               &incy_ref);
    
        // Call DPC++ AXPY.
        std::cout << "Call DPC++ axpy" << std::endl;
        ......

We use the _ct test as an example in the following. The results of the _rt test are the same.

Since the conflicts of symbols between the MKL backend and the reference have been fixed in PR #210, the reference function was called appropriately both with and without this PR:

$ ./bin/test_main_blas_ct --gtest_filter=AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major*CPU*
Run this program with --terse_output to change the way it prints its output.
Note: Google Test filter = AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major*CPU*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AxpyTestSuite/AxpyTests
[ RUN      ] AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major_Intel_R__Core_TM__i7_6770HQ_CPU___2_60GHz
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
[       OK ] AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major_Intel_R__Core_TM__i7_6770HQ_CPU___2_60GHz (125 ms)
[----------] 1 test from AxpyTestSuite/AxpyTests (125 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (125 ms total)
[  PASSED  ] 1 test.

To simulate the situation of conflicting symbols, the commit 634d7d2 right before PR #210 was used. Based on the commit, the diagnostic messages were:

$ ./bin/test_main_blas_ct --gtest_filter=AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major*CPU*
Run this program with --terse_output to change the way it prints its output.
Note: Google Test filter = AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major*CPU*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AxpyTestSuite/AxpyTests
[ RUN      ] AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major_Intel_R__Core_TM__i7_6770HQ_CPU___2_60GHz
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
   Netlib cblas_saxpy called
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
   Netlib cblas_saxpy called
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
   Netlib cblas_saxpy called
[       OK ] AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major_Intel_R__Core_TM__i7_6770HQ_CPU___2_60GHz (104 ms)
[----------] 1 test from AxpyTestSuite/AxpyTests (104 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (104 ms total)
[  PASSED  ] 1 test.

We then applied the changes in this PR to the commit. We can see from the messages that the changes fixed the issue.

$ ./bin/test_main_blas_ct --gtest_filter=AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major*CPU*
Run this program with --terse_output to change the way it prints its output.
Note: Google Test filter = AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major*CPU*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from AxpyTestSuite/AxpyTests
[ RUN      ] AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major_Intel_R__Core_TM__i7_6770HQ_CPU___2_60GHz
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
Call reference axpy
   Netlib cblas_saxpy called
Call DPC++ axpy
[       OK ] AxpyTestSuite/AxpyTests.RealSinglePrecision/Column_Major_Intel_R__Core_TM__i7_6770HQ_CPU___2_60GHz (116 ms)
[----------] 1 test from AxpyTestSuite/AxpyTests (116 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (117 ms total)
[  PASSED  ] 1 test.

Checklist

All Submissions

  • Do all unit tests pass locally? Attach a log: blas_lnx_log.txt
  • Have you formatted the code using clang-format?

Copy link
Contributor

@andrewtbarker andrewtbarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this pull request and the thorough investigation! I have a question about this:

To simulate the situation of conflicting symbols, the commit 634d7d2 right before PR #210 was used.

Why this particular commit? The problem of using reference blas for both baseline and test persists even after #210.

izamax_res = cblas_izamax_p(n, x, incx);
}
return izamax_res;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is already huge, I suggest putting everything above this point in a separate file (maybe named reference_blas_wrappers.hpp or something similar) and including it in reference_blas_templates.hpp. I actually wish there was a way to avoid so much wrapping and indirection but probably this is the best we can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem of using reference blas for both baseline and test persists even after #210.

That's weird. I didn't see the problem in the develop using either gdb or output messages. Could you share what you observed please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This open source product calls oneMKL CPU backend via DPC++ interfaces, which then call cblas symbols. I recall that those symbols were resolving to netlib even after #210. This is what I meant with this comment. But you have investigated this more closely than I did, I might be wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have dodged the conflict. After PR #210, it looks to me that it is the _64 APIs that are called in the mklcpu backend, e.g., cblas_saxpy_64.

The wrappers were moved to reference_blas_wrappers.hpp. (commit f17ce6c)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that makes sense, the _64 wrappers were a separate change. Thanks!

Copy link
Contributor

@mkrainiuk mkrainiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I have several questions to the changes.

tests/unit_tests/CMakeLists.txt Outdated Show resolved Hide resolved
PROPERTIES TEST_PREFIX ${DOMAIN_PREFIX}/RT/
DISCOVERY_TIMEOUT 30
)
endif()

gtest_discover_tests(test_main_${domain}_ct
PROPERTIES BUILD_RPATH ${CMAKE_BINARY_DIR}/lib
PROPERTIES ENVIRONMENT LD_LIBRARY_PATH=${CMAKE_BINARY_DIR}/lib:$ENV{LD_LIBRARY_PATH}
PROPERTIES ENVIRONMENT LD_LIBRARY_PATH=${CMAKE_BINARY_DIR}/lib:${CBLAS_LIB_DIR}:$ENV{LD_LIBRARY_PATH}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add domain check here if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the commit f762e6f.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have one more question regarding Windows build, we define LD_LIBRARY_PATH here for Netlib libraires but we don't define PATH on Windows, does it make sense to update PATH instead of LD_LIBRARY_PATH in case of Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is straightforward. I was wrong...

On Windows, if we try to set PATH, which is a semicolon-separated list, in gtest_discover_tests, for example:

gtest_discover_tests(test_main_${domain}_ct PROPERTIES ENVIRONMENT "PATH=C:\path1;C:\path2;C:\path3")

After CMake's internal processing, it turns out that the actual setting will only be PATH=C:\path1. This seems to be a known CMake issue.

Do you think we should open an issue to investigate a workaround? Right now, because users have to include Netlib in PATH to build on Windows, it's fine if they run tests directly after building. But they would encounter the "failed to load CBLAS library" error if they run tests in a separate session.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth adding a note to our README about Netlib required in PATH on Windows. And we also can investigate potential W/A as separated issue. I don't think we need to mix the current issue and the Windows one in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be tracked in issue #252.

tests/unit_tests/CMakeLists.txt Show resolved Hide resolved
tests/unit_tests/blas/include/reference_blas_templates.hpp Outdated Show resolved Hide resolved
@dnhsieh-intel
Copy link
Contributor Author

Log of the commit f17ce6c: blas_lnx_log_f17ce6c.txt
100% tests passed, 0 tests failed out of 3530

Copy link
Contributor

@mkrainiuk mkrainiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BLAS: tests may not be executing the right backend
3 participants