-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polite for robots.txt #5
Comments
Thanks for the feedback. The original "rvest" name is a play on words on the word "harvest" (as in to harvest/scrape a web page). TidierVest is a further play on words on that. We will make make the README (and eventually the documentation) clear on the purpose of the package. The TidierVest package is fully implemented in Julia so there is no dependency on R. I love the suggestion of respecting robots.txt and rate-limiting requests. To do this, we would need to implement the concepts from the polite R package in Julia. Will let @jdiaz97 weigh in on his thoughts and whether he has the bandwidth to work on this. |
Original "rvest" for harvest, I see it now, though h not silent. :) I believe that package is useful even without Tider.jl (or its R equivalent, nor R), giving you a DataFrame, so I would make it clear in the docs/README. [And possibly mention that you still can use Tidier with it.] Given the name is even more obscure, and I don't think you want to rename the package, I also suggest explaining rvest, and TiderVest in the README. I think it might make it likelier for the name to stick in you mind after you learn this. Why polite is a separate package in R, I don't know, but if it's not too big, then maybe its functionality fits here as (an optional) feature. I'm not pressing for it implemented (soon), was also curious if already available elsewhere in Julia. Do you think this is the best (non-polite) web scraping (or only?) package in Julia yet? How would you rate this package, or the original vs Beautiful Soup or any best-in-class web scraping package? [You can of course use polite as is, if really needed, i.e. in R, and then with rvest, from Julia; though most likely not (that) polite with TiderVest.jl.] I see now:
So maybe some functionality belongs there in that "CSS selector" library, I didn't know what that was, so overlooked, it did not at all seem like web scraping functionality. I thought I sort of new what CSS is about though. |
Hi @PallHaraldsson, thanks for the comments.
Will do
I was thinking the same thing, pretty sure we could implement the core functions bow() and scrape() without bloating tidiervest. https://github.com/dmi3kno/polite
I think TidierVest has the best syntax right now, it's just sugarcode tho, it doesn't add new features that didn't exist before. But I like it and sits at the same spot as rvest, imo. |
Hi,
I like seeing this new package, I did know of Tidier but was clearly ignorant of rvest, wasn't expecting it to have web scraping.
Are you reimplementing 100% just to have same API as tidier, or would this be now the go-to Julia package for web scraping? It seems your dependencies do not do it. I seemed to recall Julia people doing already, but they may have by calling beautiful soup (would that be the best Python package for it, and maybe best of all [including, at least previously, also native Julia packages]?)
I see:
To be clear, that is not yet implemented/ported in another package, or included into this one? If it belongs here please add to the to do at #1.
[I like the name and pun, and the logo, it's just very obscure what this package is about. I hope the package will not be overlooked for that reason. It should not, The name of beautiful soup doesn't seem to have harmed.]
The text was updated successfully, but these errors were encountered: