Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update func isDomainAllowed #420

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Update func isDomainAllowed #420

wants to merge 1 commit into from

Conversation

jiguangsdf
Copy link

like scrapy. if c.Visit("http://www.xxxx.com.cn/"), colly.AllowedDomains("xxxx.com.cn") should be work for main domain. so if we request 'www.xxxx.com.cn' and AllowedDomains are 'xxxx.com.cn'. This means that we have access to all subdomains under the top-level domain name

like scrapy. if c.Visit("http://www.xxxx.com.cn/"), colly.AllowedDomains("xxxx.com.cn") should be work for main domain. so if we request 'www.xxxx.com.cn' and AllowedDomains are 'xxxx.com.cn'. This means that we have access to all subdomains under the top-level domain name
@@ -693,6 +693,11 @@ func (c *Collector) isDomainAllowed(domain string) bool {
if c.AllowedDomains == nil || len(c.AllowedDomains) == 0 {
return true
}
for _, d2 := range c.AllowedDomains {
if strings.Contains(domain, d2) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to use "HasSuffix" method instead of "Contains". Does that make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. "HasSuffix" is well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasSuffix sounds better for me, could you change it? Also, please update the docstring with the new behavior.

@maaronking
Copy link

This is a great change, I was just looking for a good solution for this issue. 👍

@edoardottt
Copy link

Any update about this? ...

@WGH-
Copy link
Collaborator

WGH- commented Jul 14, 2021

I dunno, this is a pretty breaking change. Depending on your use case, you might want this behaviour, or not.

In any case, one can do arbitrary origin check of the URL just before calling Visit, having it built in into Collector is not strictly necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants