Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: TestStorageTieringContinueInxMigration.test_put_object_limit_lt #246

Open
alanking opened this issue Dec 19, 2023 · 3 comments
Assignees
Milestone

Comments

@alanking
Copy link
Contributor

I have observed two modes of failure for test_plugin_unified_storage_tiering.TestStorageTieringContinueInxMigration.test_put_object_limit_lt as of commit 511cba8 (NOTE: This commit is not necessarily the commit which introduced the problem, just the earliest one with which I tested).

  1. The test intermittently fails in about the same amount of time as a passing test (~135 seconds).
  2. The test intermittently hangs forever after migrating a few objects out of the 255 objects it was supposed to migrate.

We need to investigate whether this is due to the test or if this is a real issue. We at least need to ensure that the test will fail eventually by removing the infinite loop found here:

def wait_for_empty_queue(function):
done = False
while done == False:
out, err, rc = lib.execute_command_permissive(['iqstat'])
if -1 != out.find('No delayed rules pending'):
function()
done = True
else:
print(out)
sleep(1)

@alanking
Copy link
Contributor Author

For more context, I see hundreds of messages like this in the log:

{
  "log_category": "database",
  "log_level": "error",
  "log_message": "Could not find a delay rule with id [10356].",
  "request_api_name": "",
  "request_api_number": 20013,
  "request_api_version": "d",
  "request_client_user": "rods",
  "request_host": "172.25.0.3",
  "request_proxy_user": "rods",
  "request_release_version": "rods4.3.1",
  "server_host": "2d81d8c90eae",
  "server_pid": 2505,
  "server_timestamp": "2023-12-19T22:54:19.628Z",
  "server_type": "agent",
  "server_zone": "tempZone"
}
{
  "log_category": "api",
  "log_level": "error",
  "log_message": "Could not get delay rule information [rule id=[10356]]",
  "request_api_name": "",
  "request_api_number": 20013,
  "request_api_version": "d",
  "request_client_user": "rods",
  "request_host": "172.25.0.3",
  "request_proxy_user": "rods",
  "request_release_version": "rods4.3.1",
  "server_host": "2d81d8c90eae",
  "server_pid": 2505,
  "server_timestamp": "2023-12-19T22:54:19.628Z",
  "server_type": "agent",
  "server_zone": "tempZone"
}

@alanking alanking modified the milestones: 4.3.1.1, 4.3.1.2 Dec 22, 2023
@korydraughn korydraughn removed this from the 4.3.1.2 milestone Apr 26, 2024
@alanking alanking added this to the 4.3.2.1 milestone May 15, 2024
@alanking alanking removed this from the 4.3.2.1 milestone Aug 13, 2024
@alanking alanking added this to the 4.3.3.1 milestone Dec 5, 2024
@alanking alanking self-assigned this Dec 10, 2024
alanking added a commit to alanking/irods_capability_storage_tiering that referenced this issue Dec 10, 2024
wait_for_empty_queue now allows the caller to specify a timeout
in seconds as well as a function to run when a timeout occurs.
If None is given for the function to run when a timeout occurs,
the function raises a TimeoutError Exception. Existing uses of
this function pass self.fail() to run on timeout with a brief
message explaining the failure.
alanking added a commit that referenced this issue Dec 10, 2024
wait_for_empty_queue now allows the caller to specify a timeout
in seconds as well as a function to run when a timeout occurs.
If None is given for the function to run when a timeout occurs,
the function raises a TimeoutError Exception. Existing uses of
this function pass self.fail() to run on timeout with a brief
message explaining the failure.
alanking added a commit to alanking/irods_capability_storage_tiering that referenced this issue Dec 11, 2024
wait_for_empty_queue now allows the caller to specify a timeout
in seconds as well as a function to run when a timeout occurs.
If None is given for the function to run when a timeout occurs,
the function raises a TimeoutError Exception. Existing uses of
this function pass self.fail() to run on timeout with a brief
message explaining the failure.
alanking added a commit that referenced this issue Dec 11, 2024
wait_for_empty_queue now allows the caller to specify a timeout
in seconds as well as a function to run when a timeout occurs.
If None is given for the function to run when a timeout occurs,
the function raises a TimeoutError Exception. Existing uses of
this function pass self.fail() to run on timeout with a brief
message explaining the failure.
@alanking
Copy link
Contributor Author

Test has been updated such that it will eventually fail now. Still seeing intermittent failures, so the issue is real.

@alanking
Copy link
Contributor Author

Saw this test fail and am posting some findings here...

I saw an agent crash with signal 11, similar to what was described in #193, but not the same stacktrace. Here are the log messages with references to the PID of the agent which crashed:

{
  "log_category": "legacy",
  "log_level": "error",
  "log_message": "apply_policy_for_tier_group :: no resources found for group [example_group_g2]",
  "request_api_name": "EXEC_RULE_EXPRESSION_AN",
  "request_api_number": 1206,
  "request_api_version": "d",
  "request_client_user": "rods",
  "request_host": "192.168.96.3",
  "request_proxy_user": "rods",
  "request_release_version": "rods4.3.3",
  "server_host": "202cf5def270",
  "server_pid": 19597,
  "server_timestamp": "2024-12-11T15:07:03.711Z",
  "server_type": "agent",
  "server_zone": "tempZone"
}
{
  "log_category": "agent_factory",
  "log_level": "error",
  "log_message": "Agent process [19597] exited with status [11].",
  "server_host": "202cf5def270",
  "server_pid": 19249,
  "server_timestamp": "2024-12-11T15:07:22.661Z",
  "server_type": "agent_factory",
  "server_zone": "tempZone"
}
{
  "log_category": "server",
  "log_level": "critical",
  "log_message": " 0# stacktrace_signal_handler in /lib/libirods_server.so.4.3.3\n 1# 0x00007F8807D2D320 in /lib/x86_64-linux-gnu/libc.so.6\n 2# 0x00007F8803C148AB\n 3# 0x00007F8803C148B3\n 4# 0x00007F8803C148B3\n 5# 0x00007F8803C148B3\n 6# 0x00007F8803C148B3\n 7# 0x00007F8803C148B3\n 8# 0x00007F8803C148B3\n 9# 0x00007F8803C148B3\n10# 0x00007F8803C148B3\n11# 0x00007F8803C148B3\n12# 0x00007F8803C148B3\n13# 0x00007F8803C148C4\n14# 0x00007F8803C148B3\n15# 0x00007F8803C148B3\n16# 0x00007F8803C148B3\n17# 0x00007F8803C148B3\n18# 0x00007F8803C148B3\n19# 0x00007F8803C14875\n20# 0x00007F8803C029F5\n21# 0x00007F8803BF652C\n22# 0x00007F8803BFA291\n23# 0x00007F8803B8B15C\n24# 0x00007F8803BF0B0F\n25# 0x00007F8803BF0A64\n26# 0x00007F8803BF09D4\n27# 0x00007F8803BEFAB3\n28# 0x00007F880A648C3C in /lib/libirods_server.so.4.3.3\n29# std::__1::function<irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) const in /lib/libirods_server.so.4.3.3\n30# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.3\n31# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.3\n32# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n33# 0x00007F880AD8591B in /lib/libirods_server.so.4.3.3\n34# int std::__1::__invoke_void_return_wrapper<int, false>::__call<int (*&)(RsComm*, ExecRuleExpression*), RsComm*, ExecRuleExpression*>(int (*&)(RsComm*, ExecRuleExpression*), RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3\n35# 0x00007F880AD85877 in /lib/libirods_server.so.4.3.3\n36# std::__1::__function::__func<int (*)(RsComm*, ExecRuleExpression*), std::__1::allocator<int (*)(RsComm*, ExecRuleExpression*)>, int (RsComm*, ExecRuleExpression*)>::operator()(RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3\n37# 0x00007F880AC71D6F in /lib/libirods_server.so.4.3.3\n38# std::__1::function<int (RsComm*, ExecRuleExpression*)>::operator()(RsComm*, ExecRuleExpression*) const in /lib/libirods_server.so.4.3.3\n39# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n40# 0x00007F880AC719D1 in /lib/libirods_server.so.4.3.3\n41# irods::error std::__1::__invoke_void_return_wrapper<irods::error, false>::__call<irods::api_call_adaptor<ExecRuleExpression*>&, irods::plugin_context&, RsComm*, ExecRuleExpression*>(irods::api_call_adaptor<ExecRuleExpression*>&, irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3\n42# 0x00007F880AC718DF in /lib/libirods_server.so.4.3.3\n43# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3\n44# 0x00007F880AC71E87 in /lib/libirods_server.so.4.3.3\n45# std::__1::function<irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) const in /lib/libirods_server.so.4.3.3\n46# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n47# call_execRuleExpressionInp(irods::api_entry*, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3\n48# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.3\n49# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.3\n50# agentMain(RsComm*) in /lib/libirods_server.so.4.3.3\n51# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.3\n52# main::$_5::operator()() const at /irods_source/server/main_server/src/rodsServer.cpp:1320\n53# main at /irods_source/server/main_server/src/rodsServer.cpp:1387\n54# 0x00007F8807D121CA in /lib/x86_64-linux-gnu/libc.so.6\n55# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6\n56# _start in /usr/sbin/irodsServer\n",
  "server_host": "202cf5def270",
  "server_pid": 19248,
  "server_timestamp": "2024-12-11T15:07:32.570Z",
  "server_type": "server",
  "server_zone": "tempZone",
  "stacktrace_agent_pid": "19597",
  "stacktrace_timestamp_epoch_milliseconds": "651",
  "stacktrace_timestamp_epoch_seconds": "1733929642",
  "stacktrace_timestamp_utc": "2024-12-11T15:07:22.651Z"
}

The agent executes apply_policy_for_tier_group, which shows an error message that we can ignore (the executing rule refers to a tier group which does not exist). Then, about 20 seconds later, the agent factory reports that the agent exited with signal 11 (SIGSEGV - bad).

Shortly after, a stacktrace is logged. Here's what that looks like with the newlines actually expanded:

0# stacktrace_signal_handler in /lib/libirods_server.so.4.3.3
 1# 0x00007F8807D2D320 in /lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F8803C148AB
 3# 0x00007F8803C148B3
 4# 0x00007F8803C148B3
 5# 0x00007F8803C148B3
 6# 0x00007F8803C148B3
 7# 0x00007F8803C148B3
 8# 0x00007F8803C148B3
 9# 0x00007F8803C148B3
10# 0x00007F8803C148B3
11# 0x00007F8803C148B3
12# 0x00007F8803C148B3
13# 0x00007F8803C148C4
14# 0x00007F8803C148B3
15# 0x00007F8803C148B3
16# 0x00007F8803C148B3
17# 0x00007F8803C148B3
18# 0x00007F8803C148B3
19# 0x00007F8803C14875
20# 0x00007F8803C029F5
21# 0x00007F8803BF652C
22# 0x00007F8803BFA291
23# 0x00007F8803B8B15C
24# 0x00007F8803BF0B0F
25# 0x00007F8803BF0A64
26# 0x00007F8803BF09D4
27# 0x00007F8803BEFAB3
28# 0x00007F880A648C3C in /lib/libirods_server.so.4.3.3
29# std::__1::function<irods::error (std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback)>::operator()(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) const in /lib/libirods_server.so.4.3.3
30# irods::pluggable_rule_engine<std::__1::tuple<> >::exec_rule_expression(std::__1::tuple<>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*, irods::callback) in /lib/libirods_server.so.4.3.3
31# irods::rule_engine_context_manager<std::__1::tuple<>, RuleExecInfo*, (irods::rule_execution_manager_pack)0>::exec_rule_expression(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, MsParamArray*) in /lib/libirods_server.so.4.3.3
32# rsExecRuleExpression(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
33# 0x00007F880AD8591B in /lib/libirods_server.so.4.3.3
34# int std::__1::__invoke_void_return_wrapper<int, false>::__call<int (*&)(RsComm*, ExecRuleExpression*), RsComm*, ExecRuleExpression*>(int (*&)(RsComm*, ExecRuleExpression*), RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3
35# 0x00007F880AD85877 in /lib/libirods_server.so.4.3.3
36# std::__1::__function::__func<int (*)(RsComm*, ExecRuleExpression*), std::__1::allocator<int (*)(RsComm*, ExecRuleExpression*)>, int (RsComm*, ExecRuleExpression*)>::operator()(RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3
37# 0x00007F880AC71D6F in /lib/libirods_server.so.4.3.3
38# std::__1::function<int (RsComm*, ExecRuleExpression*)>::operator()(RsComm*, ExecRuleExpression*) const in /lib/libirods_server.so.4.3.3
39# irods::api_call_adaptor<ExecRuleExpression*>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
40# 0x00007F880AC719D1 in /lib/libirods_server.so.4.3.3
41# irods::error std::__1::__invoke_void_return_wrapper<irods::error, false>::__call<irods::api_call_adaptor<ExecRuleExpression*>&, irods::plugin_context&, RsComm*, ExecRuleExpression*>(irods::api_call_adaptor<ExecRuleExpression*>&, irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3
42# 0x00007F880AC718DF in /lib/libirods_server.so.4.3.3
43# std::__1::__function::__func<irods::api_call_adaptor<ExecRuleExpression*>, std::__1::allocator<irods::api_call_adaptor<ExecRuleExpression*> >, irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*&&, ExecRuleExpression*&&) in /lib/libirods_server.so.4.3.3
44# 0x00007F880AC71E87 in /lib/libirods_server.so.4.3.3
45# std::__1::function<irods::error (irods::plugin_context&, RsComm*, ExecRuleExpression*)>::operator()(irods::plugin_context&, RsComm*, ExecRuleExpression*) const in /lib/libirods_server.so.4.3.3
46# int irods::api_entry::call_handler<ExecRuleExpression*>(RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
47# call_execRuleExpressionInp(irods::api_entry*, RsComm*, ExecRuleExpression*) in /lib/libirods_server.so.4.3.3
48# rsApiHandler(RsComm*, int, BytesBuf*, BytesBuf*) in /lib/libirods_server.so.4.3.3
49# readAndProcClientMsg(RsComm*, int) in /lib/libirods_server.so.4.3.3
50# agentMain(RsComm*) in /lib/libirods_server.so.4.3.3
51# runIrodsAgentFactory(sockaddr_un) in /lib/libirods_server.so.4.3.3
52# main::$_5::operator()() const at /irods_source/server/main_server/src/rodsServer.cpp:1320
53# main at /irods_source/server/main_server/src/rodsServer.cpp:1387
54# 0x00007F8807D121CA in /lib/x86_64-linux-gnu/libc.so.6
55# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
56# _start in /usr/sbin/irodsServer

I thought I built with debugging symbols on, but, apparently not... Will try that again in a bit.

I suspect that these issues may be resolved by taking a sweep over the codebase with ASAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants