UI testing randomly getting stuck in CI (OSOE-464) #228

Piedone · 2022-11-11T03:25:46Z

See e.g. this build. The NuGetTest UI testing of OSOCE got stuck and thus timed out in 30 minutes, while all the other builds worked fine. Rerunning just that build fixed the issue. This problem started way before these recent examples and the Orchard Core 1.5 upgrade. Reruns resolve the issue every time.

Other examples unrelated to the 1.5 upgrade: here, here, here, here.

Troubleshooting this is made harder by it apparently not being possible to finish the test-dotnet action and upload artifacts when the build is canceled, see Lombiq/GitHub-Actions#77.

It seems this bug didn't solve itself in the end: Lombiq/Open-Source-Orchard-Core-Extensions#126

Jira issue

See comments for notes.

To be done:

Check if this only happens if tests fail and are then retired, due to Task.Delay() potentially having an effect. - No, this may also happen if there are no failing tests, see here.
Test fix for snyc-over-async in WebApplicationFactory. See:
Test input redirection.
- Lombiq/Open-Source-Orchard-Core-Extensions@206ff3e
- Lombiq/GitHub-Actions@83f47fd
Test disabled node reuse.
- OSOE-464: OSOCE NuGetTest UI testing randomly getting stuck under Windows in Lombiq.UITestingToolbox Open-Source-Orchard-Core-Extensions#273
- OSOE-464: Disabling node reuse during dotnet build and test GitHub-Actions#105
Test ui-test-parallelism: 0 for NuGetTest builds. Note though that this is already the case for root builds using the larger runners, and it still happened there under Ubuntu.
- OSOE-464: Instead of the null input for dotnet test that doesn't help, let's try if ui-test-parallelism: 0 will fix NuGetTest test execution hangs Open-Source-Orchard-Core-Extensions#303
Test xUnit "maxParallelThreads": -1.
- OSOE-464: Surfacing configuration for xUnit maxParallelThreads in build workflow Open-Source-Orchard-Core-Extensions#309
- OSOE-464: Surfacing configuration for xUnit maxParallelThreads GitHub-Actions#133
Investigate hangs with dump file, see this blog post for tips.
- OSOE-464: Adding blame-hang-timeout GitHub-Actions#215
- Investigation: UI testing randomly getting stuck in CI (OSOE-464) #228 (comment)
Fix sync over async in OC RecipeHarvester
- OSOE-464: Replacing IRecipeHarvester implementations #263
- Here is an open issue for this in OC: Fix bad async usage in RecipeHarvester OrchardCMS/OrchardCore#10329
Check all processes (eg. smtp4dev) stopped properly. Hunting for zombie processes...
- It looks like the hang timeout does not help
- Maybe we can use Wait-Process or something similar

Stuck tests outside the scope of `blame-hang`

Comment	OS	Runner	Build
#228 (comment)	Windows	standard	NuGet
#228 (comment), #228 (comment)	Windows	standard	root
#228 (comment)	Windows	standard	NuGet

The text was updated successfully, but these errors were encountered:

Piedone · 2022-11-12T22:30:35Z

I couldn't reproduce this yet with 140 builds, see the https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/workflows/build-and-test.yml?query=branch%3Aissue%2FOSOE-464. However, the currently added logging shows the details of UI test execution, even if the build is canceled due to a timeout, see here.

So, I think we should merge the new logging, and wait for the issue to randomly arise again. At that point, we'll be able to determine the cause better.

Piedone · 2022-11-16T20:51:43Z

Notes:

Until now, This happened exclusively with the NuGetTest build under Windows. This is a new case that might be the same: With the new, streaming test logging, this failed for the root solution and under Ubuntu (i.e. the polar opposite...).

The log here shows that out of the 6 test projects, only tests for 2 were run: Lombiq.TrainingDemo.Tests and Lombiq.Tests.UI.Samples. There, apparently, all 26 tests have run:

2022-11-16T11:34:32.5767718Z Results File: /home/runner/work/Open-Source-Orchard-Core-Extensions/Open-Source-Orchard-Core-Extensions/test/Lombiq.UITestingToolbox/Lombiq.Tests.UI.Samples/TestResults/test-results.trx
2022-11-16T11:34:32.5768672Z 
2022-11-16T11:34:32.5816191Z Test Run Successful.
2022-11-16T11:34:32.5816903Z Total tests: 26
2022-11-16T11:34:32.5818566Z      Passed: 26
2022-11-16T11:34:32.5820511Z  Total time: 4.9606 Minutes

The following tests I checked and indeed have run (and they all passed) before the build timed out. There are actually 28 "Finishing execution of" runs for the supposedly 26 tests!

AccessibilityTest.FrontendPagesShouldBeAccessible
BasicVisualVerificationTests.VerifyHomePage x2, for Chrome and Edge
BasicTests.LoginShouldWork
ErrorHandlingTests.ServerSideErrorOnLoadedPageShouldHaltTest
EmailTests.SendingTestEmailShouldWork
MonkeyTests.TestAdminPagesAsMonkeyRecursivelyShouldWorkWithAdminUser
SqlServerTests.TogglingFeaturesShouldWorkWithSqlServer
TenantTests.CreatingTenantShouldWork
AzureBlobStorageTests.AnonymousHomePageShouldExistWithAzureBlobStorage
BasicTests.TogglingFeaturesShouldWork
ErrorHandlingTests.BrowserLogsShouldPersist
MonkeyTests.TestAdminBackgroundTasksAsMonkeyRecursivelyShouldWorkWithAdminUser
BasicOrchardFeaturesTests.BasicOrchardFeaturesShouldWork - Two "Finishing execution" messages since it failed once but was retried.
SqlServerTests.AnonymousHomePageShouldExistWithSqlServer
BasicTests.ShortcutsShouldWork
AzureBlobStorageTests.TogglingFeaturesShouldWorkWithAzureBlobStorage
ErrorHandlingTests.ErrorDuringSetupShouldHaltTest
BasicVisualVerificationTests.VerifyNavbar x2, for Chrome and Edge
BasicTests.AnonymousHomePageShouldExist
MonkeyTests.TestCurrentPageAsMonkeyRecursivelyShouldWorkWithAnonymousUser
ErrorHandlingTests.ClientSideErrorOnLoadedPageShouldHaltTest
DatabaseSnapshotTests.BasicOrchardFeaturesShouldWorkWithExistingDatabase - Two "Finishing execution" messages but not because fails and retries. It was actually started twice, 1:15 mins apart. This is because to test resume from an existing DB, it starts a test within a test.
MonkeyTests.TestCurrentPageAsMonkeyShouldWorkWithConfiguredRandomSeed
BasicVisualVerificationTests.VerifyBlogImage

In the end, all tests have run from the project, just a couple of them more than once. The above list contains 24 tests + 2 Edge executions (normal until now) + 2 duplicated "Finishing execution" messages.

Log:
9_Tests.txt

So, nothing strange here. It's just that nothing happens after the "Total time" line. Then we should have a "Test successful" or "Test failed". But, this never arrives, the script is stuck between this line and this one. I guess it doesn't get stuck on if ($?) but somewhere in dotnet test.

Piedone · 2022-11-16T22:37:54Z

dotnet test hanging is not an issue just for us.

Tests hang from dotnet test microsoft/vstest#2080 Even James Newton-King himself had this.
- For JNK this was only with failing tests. However, in our case, tests aren't failing. After Lombiq.Tests.UI.Samples comes Lombiq.HelpfulLibraries.Tests (until the set of test projects, or at least the projects in the solution don't change, then the order should be deterministic), but it's not reached, thus unreported tests can't fail there. Nor can they in Lombiq.Tests.UI.Samples, since that project completed without issues.
- noprogress logging might work, but then we lose visible progress. Redirecting the input from /dev/null might solve this. See Tests hang from dotnet test microsoft/vstest#2080 (comment) for both. For that person this was only an issue under Linux and bash, so again the polar opposite of ours (for the most part).
- The input redirect trick didn't help others, but more CPU cores did (the last example of our problem happened on an 4-core Ubuntu machine, but usually happens under Windows with 2 cores). Unlimited threads for xUnit might help too.
- This corresponds to the previous one: Running tests with WebApplicationFactory can cause them to deadlock, because there's still a sync-over-async in TestServer. Here's the fix. I experienced hangs when changing the thread count for xUnit too, see OSOE-149. But! Our tests don't hang actually, since they supposedly fully finish. However, maybe it's worth trying out since perhaps this issue started to manifest after OSOE-46: Use WebApplicationFactory directly instead of running the tested app with the dotnet CLI #186. Corresponding ASP.NET Core issue: Remove blocking APIs in asp.net core integration tests code dotnet/aspnetcore#43353.
- We have GetResult() in a couple of places. The only it seems that's not inevitable is in TestReverseProxy. But Dispose(bool disposing) doesn't seem to actually be called.
Regression: dotnet test hangs for 15 minutes after test run completed on mac and linux dotnet/sdk#9452 This is exactly the same symptom as ours.
- A possible tip is disabling node reuse, also see the env var approach here. This comment hints that this is a bug in Process, but still, disabling node reuse can help. dotnet test is incompatible with MSBuild node reuse microsoft/vstest#1503 shows that this has been a known incompatibility between MSBuild and dotnet test since 2018.
- The same thing apparently happens with dotnet run roo.
- Maybe [--blame-hang-timeout](https://learn.microsoft.com/en-us/dotnet/core/tools/dotnet-test#:~:text=blame%2Dhang.-,%2D%2Dblame%2Dhang%2Dtimeout,-%3CTIMESPAN%3E%20(Available) would be useful to figure out the cause. But then again, it's not the tests that hang.
- Adding a Microsoft.NET.Test.Sdk package reference to all tests might help. We already have this (for projects actually containing test classes, not UI tests extension methods).
.Net (DotNet) tests get stuck with no reported errors or warning and I could not find the cause on Gitlab Runner's Kubernetes executor dotnet/sdk#27106 Same issue, but no new info.
Deadlock in integration tests after upgrading to ASP.NET Core 6 dotnet/aspnetcore#38649 Related.
- We have a couple of Task.Delay()s. Removing all of them is not an option, however, only in UITestExecutionSession it could be moved outside of the app's scope.

BTW @dministro since this hanging seems to correlate with #186, does something ring a bell?

dministro · 2022-11-16T23:17:01Z

Yes, i faced deadlock in NuGetTests some times on TC, but newer GHA. I investigated it, and replaced all lock() statements inside async methods using SemaphorSlim and thread safe collections. I can collect all relevant commits.
It was really simple to reproduce locally, I started the NuGetTests like TC does, and usually the deadlock come up after 2-3 runs. After I removed lock() the way as I mentioned above, the issue has gone locally and on TC too.

Piedone · 2022-11-16T23:41:57Z

So you could repro by basically running dotnet test for all test DLLs (what TC did)? Simply repeated runs of the testing script running in GHA locally doesn't do anything strange on my machine.

I'm testing disabling node reuse under Lombiq/Open-Source-Orchard-Core-Extensions#273. We'll see if it reliably fixes the issue, but what's for sure, is that for the root solution, build takes about twice as much (~4 minutes vs ~8)... If it helps, then we can check if removing any of the switches makes it faster but still retains the fix (or perhaps only leave the env var for the test-dotnet action.

Piedone · 2022-11-17T02:30:32Z

I just remembered about dotnet build-server shutdown. We don't need no-node reuse throughout, it's enough to just shut down the build servers after dotnet build. So, builds are not slower.

dministro · 2022-11-17T08:05:41Z

So you could repro by basically running dotnet test for all test DLLs (what TC did)?

Yes exactly.

Piedone · 2022-11-17T14:40:37Z

OK, thanks.

Piedone · 2022-11-28T13:17:40Z

Happened again here.

So, until now, disabled node reuse (or rather, shutting down build servers) and input redirection didn't help.

Piedone · 2022-11-30T12:49:46Z

Perhaps the problem will get some attention from Microsoft now? microsoft/vstest#2080 (comment)

Piedone · 2022-12-05T16:04:53Z

ui-test-parallelism: 0 didn't help, since the issue happened again here. I'll try xUnit "maxParallelThreads": -1 though that needs some restructuring of how our GHA workflows handle configuration.

Piedone · 2022-12-08T23:27:07Z

Apparently, the xUnit "maxParallelThreads": -1 trick didn't make the issue go away, but it did change it. This run and even its rerun timed out, during Tests as usual. However, while formerly the step got stuck after finishing all tests of Lombiq.OSOCE.NuGet.Tests.UI\, now it was in the middle of the test execution:

Timing out after starting executing tests in the project but no output from the actual tests:

Getting stuck between tests:

Interesting, and this makes a case for the "fix for snyc-over-async in WebApplicationFactory" attempt, since this can be related to what the tests do (as opposed to just what dotnet test does).

I'll wait for some more data to come in.

Piedone · 2022-12-09T17:48:29Z

@0liver you told about this one happening, thanks.

Since this wasn't a NuGetTest build, nor did it use the -1 xUnit config (but 0, as before, since that came with ui-test-parallelism before too; see here), it also suggests that we need the WebApplicationFactory fix. However, it didn't start test execution at all, not even reaching this line. Most possibly it got stuck here, which we can do nothing about, since it's not running tests yet at all...

Piedone · 2022-12-11T23:34:35Z

This run got stuck between two tests. This is suggestive of the WebApplicationFactory fix.

Piedone · 2022-12-13T21:53:19Z

Now we wait to see if the WebApplicationFactory fix solves the issue.

Piedone · 2022-12-14T00:31:14Z

And it doesn't, this build failed. It hung between two tests. So, the xUnit "maxParallelThreads": -1 config apparently didn't cause this, i.e. to hang between tests as opposed to between test projects.

I'm out of ideas since I've tried everything that I could think of and that everyone suggested. Perhaps any other suggestions from you @BenedekFarkas @DAud-IcI @dministro @0liver?

0liver · 2022-12-14T00:42:56Z

This has been beyond me for quite some time. So, unfortunately, no.

BenedekFarkas · 2022-12-14T12:58:32Z

I've started a build in TC, just in case it yields any useful information about timings: https://ci.lombiq.com/buildConfiguration/OrchardExtensions_OSOCE_Developer_PullAndBuild/1143340
We can build an updated runner image and spin up a VM based on it, so you can monitor the runner machine real-time.
For local testing, try changing your BIOS settings to match the core/thread count of the runner machines.
.NET SDK downgrade? We could stress-test against multiple different versions to see which one(s) produce this problem.

dministro · 2022-12-14T14:10:42Z

The only thing that comes to my mind, running the tests locally in a loop until the error comes, then attaching the debugger to the running process. It looks like the issue affects windows runner which means for me, we can reproduce it on windows(locally) so this simplifies it.

What about the distribution between standard and larger runners? Is there a significant difference?
As I see the linked failed builds, most of them are on standard-runner, except the OSOE-84 one.

Piedone · 2022-12-15T01:29:05Z

Thank you for your tips! I'll follow up with these if the issue still surfaces. Because I just noticed that the NuGetTest build, that recently failed, wasn't actually using the new code, since I didn't NuGet-publish and update the UI Testing Toolbox... I just did that now, it's in dev, so waiting for results.

Yes, this is by far the most of an issue with the Windows NuGetTest build, which happens on the 2-core standard runners, It much more rarely happens, but sometimes it does, with the full solution builds, also under Ubuntu (I don't know it every happening for the NuGetTest build under Ubuntu though). So yeah, we should be able to reproduce if constrained to a 2-core CPU.

0liver · 2022-12-15T14:31:11Z

I’ve had Tests hanging in a root build just last night, on the dev branch (Windows 2core): https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3700330664/jobs/6268653535#step:10:3968

And another root build timed out just now, on a Ubuntu-22.04-4core runner: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/3704714311/jobs/6277678502#step:10:63

Piedone · 2022-12-15T23:50:59Z

Thank you. This makes it clear that the WebApplicationFactory change didn't fix this. I'll look into this more, then.

Piedone · 2022-12-21T11:34:06Z

This will come handy: https://www.meziantou.net/generating-a-dump-file-when-tests-hang-on-a-ci-machine.htm

0liver · 2022-12-27T09:09:22Z

Let's hit this:

I think we could become a GitHub sponsor for this guy - we're using his Analyzers in our daily work already, and he provides some really valuable content.

Piedone · 2022-12-27T15:56:33Z

We'll definitely consider him as well during our yearly review of OSS projects/contributors to sponsor.

Piedone · 2023-01-03T22:53:37Z

@meziantou would you be open to investigating and fixing this dotnet test hang as contract work? I don't see a way to get in touch with you privately, so if you would, then please write me an e-mail, it's visible under my profile.

Piedone · 2024-01-24T18:03:06Z

So, what's the plan @dministro @sarahelsaig? Builds continue to randomly fail in dev.

Piedone · 2024-02-04T21:38:05Z

There's a hang here with a huge hang dump @dministro: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7776721615?pr=691 Download it before it gets cleaned up.

Piedone · 2024-02-04T22:22:56Z

BTW now I expect more of these, since we'll have to run issue branch Windows builds on standard runners too after GitRunners shutting down and us having to use Standard Runners for issue branch builds too: Lombiq/GitHub-Actions#320

Piedone · 2024-02-04T22:25:57Z

Another dump: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7776941706?pr=691

Piedone · 2024-02-04T23:17:32Z

Nah, these are red herrings, and the build now runs to completion after changing dotnet-test-process-timeout to 720000.

dministro · 2024-02-07T20:38:03Z

Surprising: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7791752953

The only running process: C:\Program Files\PowerShell\7\pwsh.exe -c "dotnet test --configuration Debug --nologo --logger 'trx;LogFileName=test-results.trx' --logger 'console;verbosity=detailed' --verbosity quiet --filter FullyQualifiedName!~SecurityScanningTests test\Lombiq.UITestingToolbox\Lombiq.Tests.UI.Samples\Lombiq.Tests.UI.Samples.csproj 2>&1"
Running time: 00:12:00.1811181

We will see.

dministro · 2024-02-07T20:46:07Z

This was finished in 7 min the dotnet-test-process-timeout: 360000 which is 6 min.

Piedone · 2024-02-12T22:05:50Z

Looks like an actual hang: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7877646651?pr=699

dministro · 2024-02-13T20:28:20Z

Looks like an actual hang: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/7877646651?pr=699

Not really, this is monkey test again:

The chrome mem dump is extremely big.

Piedone · 2024-02-13T20:41:36Z

I see. I made that test inspect only a single page under https://github.com/Lombiq/UI-Testing-Toolbox/pull/342/files#diff-56d06a46558b30038bf02e6c2e823aa208dc42bd2cb8125aab592bf4a8ef67a9 as opposed to randomly potentially the whole admin, so that shouldn't happen anymore.

Piedone · 2024-03-09T23:01:08Z

Invoke-SolutionTests buffers the test output, so no output only means that the test process didn’t finish, not that it didn’t produce any output while running, but we don’t see that. This causes an issue for me in another project, where I try to debug a test-running workflow timing out, and I don't see what's happening after the tests start (only that there's no output before the workflow times out).

Can the output be streaming instead?

dministro · 2024-03-09T23:31:03Z

I'm not sure. I'll check it in the upcoming days.

Piedone · 2024-03-09T23:40:46Z

Thanks!

dministro · 2024-03-19T22:24:55Z

Lombiq/GitHub-Actions#338 FYI @Piedone

Piedone · 2024-03-19T22:46:46Z

Awesome!

Piedone · 2024-03-28T22:00:41Z

Does this add new info? https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/8473959367?pr=720

Piedone · 2024-03-28T23:59:07Z

So, I guess this can be closed, but please see my previous message.

dministro · 2024-03-29T10:45:07Z

Does this add new info? https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/8473959367?pr=720

Not really, here are the execution times per method from 3_Build and Test - root solution (larger runners) _ Build and Test (buildjet-4vcpu-ubuntu-2204).txt.

Method name	Execution time
Lombiq.OSOCE.Tests.UI.Tests.UITestingToolboxTests.SecurityShortcutsTests.AddUserToFakeRoleShouldThrow	0:00:07
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorUIKitShowcaseTests.UIKitShowcasePageShouldBeCorrect	0:00:23
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorEmailQuotaTests.EmailQuotaShouldBlockEmailsOverLimitAndWarn	0:00:18
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorVueTests.RecipeDataShouldBeDisplayedCorrectly	0:00:26
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorAsyncEditorTests.DemoContentItemAsyncEditorShouldWork	0:00:31
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorHelpfulExtensionsTests.FeatureContentSets	0:00:05
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorApplicationInsightsTests.ApplicationInsightsTrackingInOfflineOperationShouldWork	0:00:02
Lombiq.OSOCE.Tests.UI.Tests.ThemeTests.BehaviorBaseThemeTests.ThemeFeaturesShouldWork	0:00:08
Lombiq.OSOCE.Tests.UI.Tests.UITestingToolboxTests.WorkflowShortcutsTests.GenerateHttpEventUrlShouldWork	0:00:00
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorMediaStorageManagementTests.MediaQuotaShouldWork	0:00:02
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorTenantManagementTests.TenantShellSettingsEditorShouldSaveSettings	0:00:05
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorMaintenanceTests.AddSiteOwnerPermissionToRoleMaintenanceTaskShouldBeExecutedSuccessfully	0:00:01
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkAnonymous	0:00:02
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorChartJsTests.RecipeDataShouldBeDisplayedCorrectly	0:00:04
Lombiq.OSOCE.Tests.UI.Tests.VisualVerificationTests.VisualVerificationTests.VerifyHomePageAndLayout	0:00:06
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorLoginAsAnybodyTests.SwitchingUserShouldWorkCorrectly	0:00:13
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorEnvironmentRobotsTests.RobotsMetaTagShouldBePresent	0:00:01
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorFeaturesGuardTests.ConditionallyEnabledFeaturesShouldWorkCorrectlyOnTenants	0:00:21
Lombiq.OSOCE.Tests.UI.Tests.UITestingToolboxTests.CustomAdminPrefixTests.NavigationWithCustomAdminPrefixShouldWork	0:00:02
Lombiq.OSOCE.Tests.UI.Tests.ThemeTests.BehaviorMediaThemeTests.MediaThemeShouldWorkWhenDeployed	0:00:04
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorDataTablesTests.RecipeDataShouldBeDisplayedCorrectly	0:00:05
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.RetrievalBuildVersionDisplayTests.BuildVersionShouldBeBeDisplayedCorrectly	0:00:01
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorJsonEditorTests.JsonEditorShouldWorkCorrectly	0:00:10
Lombiq.OSOCE.Tests.UI.Tests.UITestingToolboxTests.SecurityShortcutsTests.AddUserToRoleShouldWork	0:00:01
Lombiq.OSOCE.Tests.UI.Tests.LibraryTests.BehaviorOrchardCoreApiClientTests.OrchardCoreApiClientShouldWork	0:00:07
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorEmailQuotaTests.EmailQuotaShouldNotBlockEmailsWhenDifferentHostIsUsedThanOriginalFromConfig	0:00:05
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorVueTests.QrCardScanShouldWorkAsync	0:00:12
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorHelpfulExtensionsTests.FeatureCodeGeneration	0:00:11
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorAsyncEditorTests.DemoFrontEndAsyncEditorShouldWork	0:00:22
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkAdmin	0:00:12
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorMaintenanceTests.ChangeUserSensitiveContentMaintenanceTaskShouldBeExecutedSuccessfully	0:00:15
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorEnvironmentRobotsTests.RobotsMetaTagShouldBeMissingWithoutConfiguration	0:00:01
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorFeaturesGuardTests.ForbiddenFeaturesShouldNotBeActivatableOnTenants	0:00:05
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorLoginAsAnybodyTests.PermissionCheckShouldWorkCorrectly	0:00:12
Lombiq.OSOCE.Tests.UI.Tests.UITestingToolboxTests.SecurityShortcutsTests.AllowFakePermissionToRoleShouldThrow	0:00:00
Lombiq.OSOCE.Tests.UI.Tests.ThemeTests.BehaviorMediaThemeTests.MediaThemeShouldWorkLocally	0:00:04
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorVueTests.QrCardScanShouldReportNotFoundAsync	0:00:12
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.FormConsentCheckboxShouldWork	0:00:03
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorMaintenanceTests.MaintenanceTaskShouldBeExecutedSuccessfully	0:00:01
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorEnvironmentRobotsTests.RobotsMetaTagShouldBeMissing	0:00:01
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.RegistrationConsentCheckboxShouldWork	0:00:04
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorIdleTenantsTests.ShuttingDownIdleTenantsShouldWork	0:02:16
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorHelpfulExtensionsTests.FeatureWidgets	0:00:35
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkWithRazorAndLiquidBasedThemes	0:00:12
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkWithRazorAndLiquidBasedThemes	0:00:12
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorHelpfulExtensionsTests.FeatureFlow	0:00:12
Lombiq.OSOCE.Tests.UI.Tests.ModuleTests.BehaviorWalkthroughsTests.WalkthroughsShouldWorkCorrectly	0:07:57
Lombiq.OSOCE.Tests.UI.Tests.ThemeTests.BehaviorBlogBaseThemeTests.ThemeWithoutSetupShouldWork	0:00:06
Sum	0:16:55

Maybe we can increase the timeout again.

sarahelsaig · 2024-03-29T10:51:53Z

Maybe we can increase the timeout again.

That failure was from just before I merged the latest PR, so probably not yet.

Piedone · 2024-03-30T00:57:38Z

Ah OK, thanks.

Do we still need the build-create-binary-log and dotnet-test-process-timeout in both OSOCE workflows? Please also notify the team about what to do with these for other projects (at least Lombiq Tenants is affected).

Piedone · 2024-04-02T19:23:57Z

And this? https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/8526370146/job/23355457980

dministro · 2024-04-02T20:13:23Z

Almost finished. Using sed -En "s/(.*Finishing execution of )(([0-9a-zA-Z_]+[\.]{0,1})*), total time: (([0-9]{2}[:]{0,1}){3})(.*)/\2,\4/p" '.\4_Build and Test Windows - NuGetTest solution _ Build and Test (windows-2022).txt'

Method name	Execution time
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ThemeTests.BehaviorThemeTests.BaseThemeShouldWork	0:03:44
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ThemeTests.BehaviorMediaThemeTests.MediaThemeShouldRenderTemplatesFromMediaLibrary	0:00:07
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.RetrievalBuildVersionDisplayTests.BuildVersionShouldBeBeDisplayedCorrectly	0:00:04
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorVueTests.RecipeDataShouldBeDisplayedCorrectly	0:04:00
Lombiq.OSOCE.NuGet.Tests.UI.Tests.BehaviorFileUploadTests.SampleFilesShouldBeAccessible	0:00:05
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorFeaturesGuardTests.ConditionallyEnabledFeaturesShouldWorkCorrectlyOnTenants	0:04:04
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorChartJsTests.RecipeDataShouldBeDisplayedCorrectly	0:00:11
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorMaintenanceTests.ChangeUserSensitiveContentMaintenanceTaskShouldBeExecutedSuccessfully	0:00:14
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorEnvironmentRobotsTests.RobotsMetaTagShouldBeMissing	0:00:05
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorApplicationInsightsTests.ApplicationInsightsTrackingInOfflineOperationShouldWork	0:00:07
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorTenantManagementTests.TenantShellSettingsEditorShouldSaveSettings	0:00:19
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorHelpfulExtensionsTests.FeatureFlow	0:00:20
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorJsonEditorTests.JsonEditorShouldWorkCorrectly	0:00:20
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorLoginAsAnybodyTests.SwitchingUserShouldWorkCorrectly	0:00:09
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorDataTableTests.DataTableShouldWork	0:00:12
Lombiq.OSOCE.NuGet.Tests.UI.Tests.BasicOrchardFeaturesTests.BasicOrchardFeaturesShouldWork	0:00:20
Lombiq.OSOCE.NuGet.Tests.UI.Tests.LibraryTests.BehaviorOrchardCoreApiClientTests.OrchardCoreApiClientShouldWork	0:00:16
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorEmailQuotaTests.EmailQuotaShouldBlockEmailsOverLimitAndWarn	0:01:59
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.IdleTenantTests.ShuttingDownIdleTenantsShouldWork	0:06:00
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorAsyncEditorTests.DemoFrontEndAsyncEditorShouldWork	0:00:21
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkAnonymous	0:00:10
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorFeaturesGuardTests.ForbiddenFeaturesShouldNotBeActivatableOnTenants	0:00:11
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorEnvironmentRobotsTests.RobotsMetaTagShouldBeMissingWithoutConfiguration	0:00:09
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorLoginAsAnybodyTests.PermissionCheckShouldWorkCorrectly	0:00:06
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorAsyncEditorTests.DemoContentItemAsyncEditorShouldWork	0:00:41
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorHelpfulExtensionsTests.FeatureWidgets	0:00:54
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorEnvironmentRobotsTests.RobotsMetaTagShouldBePresent	0:00:05
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorEmailQuotaTests.EmailQuotaShouldNotBlockEmailsWhenDifferentHostIsUsedThanOriginalFromConfig	0:00:55
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorHelpfulExtensionsTests.FeatureCodeGeneration	0:00:04
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkAdmin	0:00:16
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.RegistrationConsentCheckboxShouldWork	0:00:09
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorWalkthroughsTests.WalkthroughsShouldWorkCorrectly	0:02:35
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkWithRazorAndLiquidBasedThemes	0:00:15
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.ConsentBannerShouldWorkWithRazorAndLiquidBasedThemes	0:00:15
Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.BehaviorPrivacyTests.FormConsentCheckboxShouldWork	0:00:28
Sum	0:30:10

I would highlight this one: Lombiq.OSOCE.NuGet.Tests.UI.Tests.ModuleTests.IdleTenantTests.ShuttingDownIdleTenantsShouldWork -> 0:06:00
In this test happends nothing, only waits 1 or 2 mins for a log message. No client activities, no calculations...

Do we have any hisorical data about execution time for each methods?

dministro · 2024-04-02T20:21:43Z

Ah OK, thanks.

Do we still need the build-create-binary-log and dotnet-test-process-timeout in both OSOCE workflows? Please also notify the team about what to do with these for other projects (at least Lombiq Tenants is affected).

Lets keep the current configuration for a whilw (1-2 months). I created an issue for this to keep track. Lombiq/Open-Source-Orchard-Core-Extensions#736

Piedone · 2024-04-03T00:35:31Z

Here's the log of ShuttingDownIdleTenantsShouldWork (from the workflow output):

ShuttingDownIdleTenantsShouldWork output.txt

Note these lines:

2024-04-02T16:58:03.2535678Z  2024-04-02 16:48:55.0193 - Starting waiting for the setup operation.
2024-04-02T16:58:03.2537135Z  2024-04-02 16:52:09.3170 - Finished waiting for the setup operation.

(the first timestamp is when the test's log was flushed to the output, the second timestamp matters)

I.e., the test started together with all the others, then waited ~3 minutes for the common setup to be finished, by the test that happened to start that the first time (ExecuteTestAfterSetupAsync makes sure that the setup is only run once for matching tests, then snapshot, and re-used, but this async waiting adds to the wall clock time of the other tests). The test, after setting up another tenant and waiting for it to idle shutdown, finished at 16:54:55, i.e. within ~3,5 minutes, which is expected.

So, this looks normal, though the setup itself was slow: it should finish within seconds, just the tenant setup finished within 5s. However, I'd chalk this up to a random slowness of the GitHub-hosted runner, which is a random Azure VM running Windows (which is like two times slower for these already).

Piedone · 2024-07-17T20:46:56Z

So, this doesn't seem to be fixed. @dministro let's continue the investigation here instead of Lombiq/Open-Source-Orchard-Core-Extensions#736 since this is where the rest of the stuff is.

As mentioned under Lombiq/Open-Source-Orchard-Core-Extensions#80, Lombiq/GitHub-Actions#373 can help you diagnose this.

This run here has the same issue, but with test diag logs: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9980235741

This looks telling:

TpTrace Verbose: 0 : 19163, 10, 2024/07/17, 19:33:42.634, 1134212959530, vstest.console.dll, TestRequestSender.EndSession: Sending end session.
TpTrace Verbose: 0 : 19163, 10, 2024/07/17, 19:33:42.634, 1134213101092, vstest.console.dll, ProxyOperationManager.Close: waiting for test host to exit for 100 ms
TpTrace Warning: 0 : 19163, 10, 2024/07/17, 19:33:42.734, 1134313269943, vstest.console.dll, ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process.

Looks related: microsoft/vstest#2952. Especially microsoft/vstest#2952 (comment) and the "infamous 100ms bug". That looks like this line.

Setting the VSTEST_TESTHOST_SHUTDOWN_TIMEOUT to 10s (like this to 60s, see this run) still causes the process to hang, despite the config actually being uses:

TpTrace Verbose: 0 : 19393, 7, 2024/07/17, 20:18:39.861, 790229310102, vstest.console.dll, ProxyOperationManager.Close: waiting for test host to exit for 10000 ms
TpTrace Warning: 0 : 19393, 7, 2024/07/17, 20:18:49.861, 800229492443, vstest.console.dll, ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process.

With 60s it still hangs. So, it seems that this is an actual hang. The last message is always "Finishing execution of Lombiq.OSOCE.Tests.UI.Tests.ThemeTests.BehaviorBlogBaseThemeTests.ContentMenuItemShouldWorkCorrectly".

@sarahelsaig check this out, it might help you too.

Piedone added the bug Something isn't working label Nov 11, 2022

github-actions bot changed the title ~~OSOCE NuGetTest UI testing randomly getting stuck under Windows~~ OSOCE NuGetTest UI testing randomly getting stuck under Windows (OSOE-464) Nov 11, 2022

Piedone self-assigned this Nov 12, 2022

Piedone mentioned this issue Nov 28, 2022

OSOE-464: Instead of the null input for dotnet test that doesn't help, let's try if ui-test-parallelism: 0 will fix NuGetTest test execution hangs Lombiq/Open-Source-Orchard-Core-Extensions#303

Merged

sarahelsaig mentioned this issue Jan 24, 2024

OSOE-464: Executing dotnet test with process timeout Lombiq/Open-Source-Orchard-Core-Extensions#676

Merged

dministro mentioned this issue Apr 2, 2024

Remove build-create-binary-log and dotnet-test-process-timeout configurations from OSOCE workflows (OSOE-834) Lombiq/Open-Source-Orchard-Core-Extensions#736

Open

Piedone mentioned this issue Apr 28, 2024

Build and Test Windows workflows are timing out (OSOE-840) #358

Closed

Piedone mentioned this issue Jul 8, 2024

In certain cases UI test hangs indefinitely instead of fails (OSOE-880) #387

Closed

Piedone changed the title ~~OSOCE NuGetTest UI testing randomly getting stuck under Windows (OSOE-464)~~ UI testing randomly getting stuck in CI (OSOE-464) Jul 17, 2024

UI testing randomly getting stuck in CI (OSOE-464) #228

UI testing randomly getting stuck in CI (OSOE-464) #228

Comments

Piedone commented Nov 11, 2022 • edited by dministro Loading

Stuck tests outside the scope of blame-hang

Piedone commented Nov 12, 2022 • edited Loading

Piedone commented Nov 16, 2022

Piedone commented Nov 16, 2022

dministro commented Nov 16, 2022

Piedone commented Nov 16, 2022

Piedone commented Nov 17, 2022

dministro commented Nov 17, 2022

Piedone commented Nov 17, 2022

Piedone commented Nov 28, 2022

Piedone commented Nov 30, 2022

Piedone commented Dec 5, 2022

Piedone commented Dec 8, 2022

Piedone commented Dec 9, 2022

Piedone commented Dec 11, 2022

Piedone commented Dec 13, 2022

Piedone commented Dec 14, 2022

0liver commented Dec 14, 2022

BenedekFarkas commented Dec 14, 2022 • edited Loading

dministro commented Dec 14, 2022

Piedone commented Dec 15, 2022

0liver commented Dec 15, 2022

Piedone commented Dec 15, 2022

Piedone commented Dec 21, 2022

0liver commented Dec 27, 2022 • edited Loading

Piedone commented Dec 27, 2022

Piedone commented Jan 3, 2023

Piedone commented Jan 24, 2024

Piedone commented Feb 4, 2024

Piedone commented Feb 4, 2024

Piedone commented Feb 4, 2024

Piedone commented Feb 4, 2024 • edited Loading

dministro commented Feb 7, 2024

dministro commented Feb 7, 2024

Piedone commented Feb 12, 2024

dministro commented Feb 13, 2024

Piedone commented Feb 13, 2024

Piedone commented Mar 9, 2024

dministro commented Mar 9, 2024

Piedone commented Mar 9, 2024

dministro commented Mar 19, 2024

Piedone commented Mar 19, 2024

Piedone commented Mar 28, 2024

Piedone commented Mar 28, 2024

dministro commented Mar 29, 2024

sarahelsaig commented Mar 29, 2024

Piedone commented Mar 30, 2024

Piedone commented Apr 2, 2024

dministro commented Apr 2, 2024

dministro commented Apr 2, 2024 • edited Loading

Piedone commented Apr 3, 2024

Piedone commented Jul 17, 2024

Piedone commented Nov 11, 2022 •

edited by dministro

Loading

Stuck tests outside the scope of `blame-hang`

Piedone commented Nov 12, 2022 •

edited

Loading

BenedekFarkas commented Dec 14, 2022 •

edited

Loading

0liver commented Dec 27, 2022 •

edited

Loading

Piedone commented Feb 4, 2024 •

edited

Loading

dministro commented Apr 2, 2024 •

edited

Loading