-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PowerBIDatasetRefreshOperator task fails whereas dataset refresh succeeds #44618
Comments
Can you just specify a retry with following parameters on the PowerBIDatasetRefreshOperator?
|
Thank you for the suggestion! (sorry I didn't know these fields were available) |
Those parameters apply on the whole operator, you could also try to specify the check_interval and timeout parameter, which by default both are 60 seconds, I think timeout should be greater then then check_interval, otherwise it will stop polling due to timeout. |
OK I see, so the retry is not specific to the "get status" request. |
Indeed, you're correct about it. I still don't understand your issue completely, as from what I see in the code of the |
I understand the issue, I've checked the code of the PowerBITrigger and the check_interval is set to 60 seconds, but the timeout is also 60 seconds by default, which means the status is only fetched once and then it fails like in your example log. So you should reduce the check_interval value to like 10 seconds for example, or even a bit less. So try setting the check_interval to 10 seconds. Maybe we should do a PR and add a check that the check_interval must be smaller than the timeout and set it to 10 seconds by default. |
Hello! Also, I thought that the default timeout was 1 week because of this line:
But maybe I got this wrong. Am I mistaken? Regarding the bug, I think the issue may come from the very first request to get the refresh status, at this line:
This request is made by the PowerBi hook (using the method
and the hook raises the error "Unable to fetch the details of dataset refresh with Request Id" (which is also what I find in my Airflow logs). Do you think it makes sense? I did try to add a retry on this request (and it seemed to fix the bug), but I understand that you don't want to introduce this type of custom retry mechanism. Maybe another option could be to wait "check_interval" seconds BEFORE sending the first request. What do you think? |
@Ohashiro from what I can see from my smartphone (no laptop atm) your explanation does totally make sense. Waiting before doing the actual invocation could make sense, but I would suggest checking the PowerBIDatasetRefreshException and reinvoke the method after waiting for the the interval when this error occurs within the code. |
@Ohashiro also thank you for investigating this thouroughly. It would also be nice to have a unit test which reproduces this behaviour, so on fist invocation raising an PowerBIDatasetRefreshException, then the code waits for the interval on on second invocation the call would succeed. This should be doable by mocking the PowerBI hook. |
@dabla I have a quick question regarding the best practices. Given that the PowerBi Hook can raise |
I opened a PR with both the retry in case of exception on the first request, and the corresponding unit test. I'd be happy to have your feedbacks on this when you have some time! |
Hi @Ohashiro, thank you for bringing this up and creating the PR to address it. After reviewing the conversation, I see the issue lies in So, I suggest, we should add the retry mechanism to the get refresh history function only as below, it will retry to fetch histories, if it still fails, then we just throw the exception. That would also mean we don't need any extra exception class as you created "PowerBIDatasetRefreshStatusExecption"
|
I don't think it's a good practice to use tenacity retry mechanism while operators (e.g. task instances) have there own retry mechanism managed by Airflow tasks, but I could be wrong. I also went back to the code and I still don't understand why the call would still fail when the task instance for that operator is retried the second time by Airflow, as some time would have passed between the first and the second attempt, I would expect the second call to succeed but apparently it still doesn't. Maybe it because of this part:
So maybe @Ohashiro a delay before calling the get_refresh_details_by_refresh_id would be a possible easy fix to avoid the issue on the second call, even though I don't like it that much and also doesn't ensure it will eventually succeed. What I suggest is to refactor the PowerBITrigger and PowerBIDatasetRefreshOperator. The PowerBITrigger should get an extra dataset_refresh_id parameter in the constructor, so the run method can be called without and with the dataset_refresh_id parameter, that way it can handle both scenario's and return corresponding TriggerEvents regarding the executed flow (e.g. is a dataset refresh being triggered or do we want to get the dataset refresh details), code modifications could possibly look like this in PowerBITrigger :
Then in the PowerBIDatasetRefreshOperator, both TriggerEvents should be handled accordingly, which means the PowerBITrigger will be called (e.g. deferred) twice, once to trigger the dataset_refresh_id and once to poll for the get_refresh_details_by_refresh_id. On the first attempt, if the trigger succeeds, the dataset_refresh_id should be persisted as an XCom within the operator, so that when the seconds calls fails the operator can directly retry the second Trigger call. This is of course a more complex approach due to the deferrable aspect, but would be more in line with how operators should work imho instead of hiding failures and doing retries outside Airflow flow using the tenacity library. From what I see in the code base, tenacity is only used in the retries module of Airflow and the cli commands, not in the operators/hooks/triggerers. |
Hello! @dabla thank you for your investigation and suggestion! Regarding your first point:
From what I understand, when the task fails (for example, in our case, because the refreshId was not found), the operator cancels the refresh (using
Regarding the fix implementation, the delay solution we discussed seems to work well (which, imho, confirm the bug root cause), but I agree that this fix is more of a "quick fix" than a clean one. I can work on the refactor you suggested to check, I just have a few questions:
Regarding the retry mechanism you are suggesting in the operator, how would you do it? Regarding the trigger, if I understand correctly, the async def run(self) -> AsyncIterator[TriggerEvent]:
if not self.dataset_refresh_id:
# Just yield a TriggerEvent with the dataset_refresh_id, that will then be used by the operator to retrigger it with the corresponding dataset_refresh_id so the PowerBITrigger knows it has to only get the refresh details in case of failure, then the refresh details would then be executed.
yield TriggerEvent(
...
)
else:
# Handle the "while" loop looking for the refresh status And regarding the operator, if I understand correctly, it would look like that: class PowerBIDatasetRefreshOperator(BaseOperator):
def execute(self, context: Context):
"""Refresh the Power BI Dataset."""
if self.wait_for_termination:
self.defer(
trigger=PowerBITrigger(...),
method_name=self.push_refreshId.__name__,
)
def push_refreshId():
# push the refresh Id to xcom
self.xcom_push(
context=context, key="powerbi_dataset_refresh_Id", value=event["dataset_refresh_id"]
)
self.defer(
trigger=PowerBITrigger(...),
method_name=self.execute.__name__,
)
def execute():
# exit the operator as currently done |
Indeed, that's the main issue, because the fact that the refresh details fails and how the operator is implemented today, instead of trying to directly get the refresh details on the second attempt, it will again trigger a new dataset refresh and then again try to get its refresh details, which of course like you mentioned, will fail again, thus the main problem persists. That's why I suggested the refactor, so that the PowerBITrigger can handle both cases separately, and that the operator can then retry the second case directly instead of redoing the whole flow. I also saw after a remark made by my colleague @joffreybienvenu-infrabel that the tenacity is actually also used in the KubernetesPodOperator, so I was wrong there that it's not being used by operators. But still I think it's better to use the retry mechanism implemented by the TaskInstance instead of bypassing it and doing it directly within the hook/operator, as imho that's not the purpose/ good practise but again I could be wrong. Also regarding you retry question, the retry mechanism is implemented by default by the task handling in Airflow, that's why I want to avoid the usage of tenacity (I see this more as a hack or quick fix than an actual good solution) as there is already a solution for that in Airflow. So I you do the refactor as I suggested, the retry mechanism will work fine if the xcom_push will work when a task fails afterwards, so that something to be tested. |
Btw, I just (locally) implemented your refactor and the separation between the 2 flows (refresh trigger and get refresh status) in 2 deferrable triggers seems to let enough time between the refresh creation and the first "get refresh status" request. It seems that I don't get the error anymore (not that the bug is really fixed, but in my environment, the duration between both events is enough to prevent the error).
Regarding this, how/where would you retry the second case? |
Indeed, that's also a consequence of using the trigger twice, as more time will pass automatically between both invocations, the error will probably not occurs anymore and everything will succeed in one attempt. For the second case, in the operator, you should do in the execute method an xcom_pull to see if there is an existing dataset_refresh_id or not, and if not you know you have to execute the whole flow, if it's there then you know you should only trigger the second part. |
After a bit a investigation, I think this won't be possible to pass XCom messages between different task executions of the same operator (if this was what you meant). According to Airflow XComs documentation, "If the first task run is not succeeded then on every retry task XComs will be cleared to make the task run idempotent.". From what I understand, it is not possible for a 2nd operator (in a task retry) to access the 1st task XCom message. airflow/providers/src/airflow/providers/amazon/aws/operators/dms.py Lines 720 to 723 in 9ba279d
Though I'm not sure if this is what you'd prefer. After looking at the codebase, I see retry mechanisms in some hooks (ex: Not sure which solution is best to be honest. What do you think?
|
You're right about the Xcoms being flushed when a task fails, I like the idea/above of approach, by define a dedicated retry method which is passed as argument of the next_method of the Trigger, that could be an option. If it's too difficult, then I would do what @ambika-garg proposed and I think you also proposed at the beginning of this issue. |
Hi @dabla def execute(self, context: Context):
"""Refresh the Power BI Dataset."""
if self.wait_for_termination:
self.defer(
trigger=PowerBITrigger(
conn_id=self.conn_id,
group_id=self.group_id,
dataset_id=self.dataset_id,
timeout=self.timeout,
proxies=self.proxies,
api_version=self.api_version,
check_interval=self.check_interval,
wait_for_termination=self.wait_for_termination,
),
method_name=self.get_refresh_status.__name__,
)
def get_refresh_status(self, context: Context, event: dict[str, str] | None = None):
"""Push the refresh Id to XCom then runs the Triggers to wait for refresh completion."""
if event:
if event["status"] == "error" and "Unable to fetch the details of dataset refresh with Request Id" not in event["message"] and "not found" not in event["message"]:
raise AirflowException(event["message"])
self.xcom_push(context=context, key="powerbi_dataset_refresh_Id", value=event["dataset_refresh_id"])
dataset_refresh_id = self.xcom_pull(context=context, key="powerbi_dataset_refresh_Id")
if dataset_refresh_id:
self.defer(
trigger=PowerBITrigger(
conn_id=self.conn_id,
group_id=self.group_id,
dataset_id=self.dataset_id,
dataset_refresh_id=dataset_refresh_id,
timeout=self.timeout,
proxies=self.proxies,
api_version=self.api_version,
check_interval=self.check_interval,
wait_for_termination=self.execute_complete,
),
method_name=self.execute_complete.__name__,
)
def retry_execution(self, context: Context):
retries = self.xcom_pull(context=context, key="retries")
if retries and retries >= self.max_retries:
raise AirflowException("Max number of retries reached!")
if not retries:
retries = 0
self.xcom_push(context=context, key="retries", value=retries+1)
self.get_refresh_status(context)
def execute_complete(self, context: Context, event: dict[str, str]) -> Any:
"""
Return immediately - callback for when the trigger fires.
Relies on trigger to throw an exception, otherwise it assumes execution was successful.
"""
if event:
if event["status"] == "error":
if "Unable to fetch the details of dataset refresh with Request Id" in event["message"] or "not found" in event["message"]:
self.retry_execution(context)
else:
raise AirflowException(event["message"])
self.xcom_push(context=context, key="powerbi_dataset_refresh_status", value=event["status"]) Note: in addition to these changes, we should add a new way to handle the refresh cancellation. By default, if the trigger encounters an exception, it cancels the refresh (which is not compatible with the retry made by the operator). If we keep this solution, we have to change this behavior. I think this solution can work but might add a little too much complexity to the operator compared to a simple retry, though I think that this separation between the trigger refresh and the status fetch is nice. |
The separation of the 2 flows handled by the PowerBITriggerer is a good thing indeed, so the effort there is not lost and is in fact a cleaner design. Now if it's to complex to handle the retry within the operator, then I propose you use the tenacity on that refresh_history method of the PowerBIHook. |
Hi @dabla |
Apache Airflow Provider(s)
microsoft-azure
Versions of Apache Airflow Providers
apache-airflow-providers-microsoft-azure==11.1.0
Apache Airflow version
2.10.2
Operating System
linux
Deployment
Google Cloud Composer
Deployment details
No response
What happened
We use the operator
PowerBIDatasetRefreshOperator
to refresh our PowerBI datasets. Sometimes, the task quickly fails with the following error:However, on PowerBI side, the dataset is refreshed (timezone GMT+1, both the time in the refresh and the log are about the same):
If I understand correctly, this corresponds to an error in this function, which may mean that there was an error when trying to fetch the refresh status, even though the refresh was running. Maybe the error comes from the request to the Microsoft API, which could be improved with a retry (but we did not test for the moment).
What you think should happen instead
The task should succeed. (since the dataset refresh suceeded)
How to reproduce
On our side, we use the following configuration.
The dataset ID and group ID correspond to valid PowerBI dataset and workspace.
Anything else
This bug occurs about once every 3 task runs (the task fails but the refresh succeeds). Sometimes if fails several times in a row then works again, other times it works on the first run. I think this should mean that the configuration is good (since the operator often works fine).
I haven't been able to identify any specific pattern.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: