Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polite for robots.txt #5

Open
PallHaraldsson opened this issue Aug 29, 2023 · 3 comments
Open

polite for robots.txt #5

PallHaraldsson opened this issue Aug 29, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@PallHaraldsson
Copy link

PallHaraldsson commented Aug 29, 2023

Hi,

I like seeing this new package, I did know of Tidier but was clearly ignorant of rvest, wasn't expecting it to have web scraping.

Are you reimplementing 100% just to have same API as tidier, or would this be now the go-to Julia package for web scraping? It seems your dependencies do not do it. I seemed to recall Julia people doing already, but they may have by calling beautiful soup (would that be the best Python package for it, and maybe best of all [including, at least previously, also native Julia packages]?)

I see:

If you’re scraping multiple pages, I highly recommend using rvest in concert with polite. The polite package ensures that you’re respecting the robots.txt and not hammering the site with too many requests.

To be clear, that is not yet implemented/ported in another package, or included into this one? If it belongs here please add to the to do at #1.

[I like the name and pun, and the logo, it's just very obscure what this package is about. I hope the package will not be overlooked for that reason. It should not, The name of beautiful soup doesn't seem to have harmed.]

@kdpsingh
Copy link
Member

Thanks for the feedback. The original "rvest" name is a play on words on the word "harvest" (as in to harvest/scrape a web page). TidierVest is a further play on words on that.

We will make make the README (and eventually the documentation) clear on the purpose of the package.

The TidierVest package is fully implemented in Julia so there is no dependency on R. I love the suggestion of respecting robots.txt and rate-limiting requests. To do this, we would need to implement the concepts from the polite R package in Julia.

Will let @jdiaz97 weigh in on his thoughts and whether he has the bandwidth to work on this.

@PallHaraldsson
Copy link
Author

PallHaraldsson commented Aug 29, 2023

Original "rvest" for harvest, I see it now, though h not silent. :)

I believe that package is useful even without Tider.jl (or its R equivalent, nor R), giving you a DataFrame, so I would make it clear in the docs/README. [And possibly mention that you still can use Tidier with it.]

Given the name is even more obscure, and I don't think you want to rename the package, I also suggest explaining rvest, and TiderVest in the README. I think it might make it likelier for the name to stick in you mind after you learn this.

Why polite is a separate package in R, I don't know, but if it's not too big, then maybe its functionality fits here as (an optional) feature. I'm not pressing for it implemented (soon), was also curious if already available elsewhere in Julia. Do you think this is the best (non-polite) web scraping (or only?) package in Julia yet? How would you rate this package, or the original vs Beautiful Soup or any best-in-class web scraping package?

[You can of course use polite as is, if really needed, i.e. in R, and then with rvest, from Julia; though most likely not (that) polite with TiderVest.jl.]

I see now:
https://stackoverflow.com/questions/59825336/how-can-i-do-web-scraping-in-julia

and Cascadia.j to finally scrape using a CSS selector API.

So maybe some functionality belongs there in that "CSS selector" library, I didn't know what that was, so overlooked, it did not at all seem like web scraping functionality. I thought I sort of new what CSS is about though.

@jdiaz97
Copy link
Collaborator

jdiaz97 commented Feb 27, 2024

Hi @PallHaraldsson, thanks for the comments.

I also suggest explaining rvest, and TiderVest in the README

Will do

I don't know, but if it's not too big, then maybe its functionality fits here as (an optional) feature.

I was thinking the same thing, pretty sure we could implement the core functions bow() and scrape() without bloating tidiervest. https://github.com/dmi3kno/polite

Do you think this is the best (non-polite) web scraping (or only?) package in Julia yet? How would you rate this package, or the original vs Beautiful Soup or any best-in-class web scraping package?

I think TidierVest has the best syntax right now, it's just sugarcode tho, it doesn't add new features that didn't exist before. But I like it and sits at the same spot as rvest, imo.
I haven't used Beautiful Soup, so I don't know, but we're missing some key features that rvest also doesn't have, but Selenium does, so maybe we have to take a look at that and see how to implement them.

@jdiaz97 jdiaz97 added the enhancement New feature or request label Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants