Cleaning and compression of raw html #44

Winsome-A · 2024-09-26T01:53:54Z

Winsome-A
Sep 26, 2024

Dear authors, I would like to ask how I can utilize part of your work to clean and compress a raw html file to get a new compressed html file. By reading the official website, I tried to use the python library functions you provide to do this, and I wonder if this should be the idea?
Another question is: if those python libraries are utilized, would this involve the selection of candidates, that is, keeping the content corresponding to the candidate? Wouldn't that require artificially setting the candidate in advance?
One of my ideal effects would be to input the html file, then process it through, and the output would be the html file after the operation, so if I need to set the candidate ahead of time wouldn't I be missing automation?
Looking forward and thank you for your insights and replies!

Answered by xhluca

Sep 27, 2024

I would like to ask how I can utilize part of your work to clean and compress a raw html file to get a new compressed html file.

Although you may find the library useful for cleaning html files, it is not the primary goal of this library; rather, our goal is to process the html files so they can be ingested by LLM models for predicting web actions.

if those python libraries are utilized, would this involve the selection of candidates, that is, keeping the content corresponding to the candidate? Wouldn't that require artificially setting the candidate in advance?

You can use the DMR retriever to dynamically find relevant candidate, given a context (action history similar to those of we…

View full answer

xhluca · 2024-09-27T15:53:38Z

xhluca
Sep 27, 2024
Maintainer

I would like to ask how I can utilize part of your work to clean and compress a raw html file to get a new compressed html file.

Although you may find the library useful for cleaning html files, it is not the primary goal of this library; rather, our goal is to process the html files so they can be ingested by LLM models for predicting web actions.

if those python libraries are utilized, would this involve the selection of candidates, that is, keeping the content corresponding to the candidate? Wouldn't that require artificially setting the candidate in advance?

You can use the DMR retriever to dynamically find relevant candidate, given a context (action history similar to those of weblinx). See this example for concreteness: https://github.com/McGill-NLP/webllama/blob/main/examples/complete/run_all.py

One of my ideal effects would be to input the html file, then process it through, and the output would be the html file after the operation, so if I need to set the candidate ahead of time wouldn't I be missing automation?

If I understand your question correctly: you can use the DMR model as part of your automation pipeline, so you can get candidates automatically given raw html, bounding boxe coordinates and action/dialogue history.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaning and compression of raw html #44

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Cleaning and compression of raw html #44

Winsome-A Sep 26, 2024

Replies: 1 comment

xhluca Sep 27, 2024 Maintainer

Winsome-A
Sep 26, 2024

xhluca
Sep 27, 2024
Maintainer