Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

Wordlift #556

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Wordlift #556

wants to merge 19 commits into from

Conversation

msftwarelab
Copy link
Contributor

@msftwarelab msftwarelab commented Oct 2, 2023

Description

Modifies the usage of the reader and methods of url handling.

How Has This Been Tested?

I tested with the sample data and all worked fined.

  • I stared at the code and made sure it makes sense

Copy link
Collaborator

@EmanuelCampos EmanuelCampos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one comment, adding async not sounds a big deal. but I'm curious to understand better what you want to achieve

Comment on lines +84 to +89
response = await asyncio.to_thread(
requests.post,
self.endpoint,
json={"query": query},
headers=self.headers,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to run the request on another thread? or add async operations?

@EmanuelCampos
Copy link
Collaborator

quick bump @msftwarelab

@EmanuelCampos
Copy link
Collaborator

bump @msftwarelab

@msftwarelab
Copy link
Contributor Author

Sorry for late response @EmanuelCampos
I used async for Multiple HTTP requests.
In the is_valid_html and clean_html functions, if the content is a URL, the function will make an HTTP request. This can lead to a lot of outgoing requests if many fields in the returned data from the WordLift GraphQL API are URLs. This can potentially cause the entire process to be stuck, especially if the URLs are non-responsive or slow.
Do you have any suggestions?

@EmanuelCampos
Copy link
Collaborator

EmanuelCampos commented Oct 10, 2023

Thanks for the explanation @msftwarelab

I understand the problem, but we have some issues with eg: OnDemandLoader runs synchronously

And probably on other Tools which I'm not able to find now but that needs to run synchronously

Maybe a solution would be to set a read timeout on the is_valid_html and clean_html functions, what do you think? https://requests.readthedocs.io/en/latest/user/advanced/#timeouts

will cc: @logan-markewich on this one

@msftwarelab
Copy link
Contributor Author

Thank you for your kind response! @EmanuelCampos
Ok. Let me try it.

@logan-markewich
Copy link
Collaborator

@msftwarelab Hmm, yea I'd say to implement both load_data and aload_data

In most cases, load_data() can call the aload_data() using asyncio.run() to run the async method. You could even use asyncio.run() for the underlying fetch_data function if you wanted.

We should probably update the base class to make this more clear

@EmanuelCampos
Copy link
Collaborator

quick bump @msftwarelab

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants