-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401
Comments
I do think that creating a dedicated tags would be nice to avoid crawling a small subset of pages. However, let's try to avoid making it a regular practice since it might increase the load of the crawl by downloading more content than required. Providing a dedicated sitemap might be wiser. We will only follow the links from this dedicated documentation sitemap. WDYT? |
That could be a great idea. Some kind of |
Scrapy have a built-in support for the robots.txt: https://doc.scrapy.org/en/latest/topics/settings.html?highlight=robot#std:setting-ROBOTSTXT_OBEY Should be easy to add the Scrapy settings here ( But it will maybe impact existing configurations though. |
Yep, I think we should not follow robots.txt by default (because changing will not be backward compatible). My suggestion was that maybe we could reuse the robots.txt syntax to add custom DocSearch information. Maybe something like:
|
Good idea but let's put this into the configuration. Let's wait for the codebase refactor? (migrate to python 3) |
Yeah, let's wait for the refactor, we can then add a new middleware inspired by the built-in one from scrapy: https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/robotstxt.py#L88 |
from 2018:
Any updates on this issue? It would be great to at least know a decision from Algolia whether adding support for respecting the no-follow meta headers is intended at all or not. |
Hi @nkuehn |
Sure: Our docs site generator supports a markdown frontmatter that triggers the standard meta=noindex HTML tagging to ensure a given page is not indexed in search engines. There are varying use cases: pre-release documentation, deprecated features that are just documented as an archive, pages that are just lists or navigation helps and should not appear in search results etc.. These pages are often in that state temporarily and do not follow a specific "regex'able" pattern that we could put into the docsearch config and also, we need immediate control over adding / removing them without having to constantly bother you (the algolia docsearch team) with every individual change with a PR to your configs repo. We have now understood that docsearch is only relying on whether the page is reachable through crawling. So we are teaching docs authors the different behavior of the on-site search vs. the public search engines and live with some pages appearing in search that we ideally would like to not see there. It's an acceptable situation - something that we absolutely want to hide would not be linked anyways. TL;DR: The main downside is the additional mental workload for the authors to understand the subtle differences between the behaviors of excluding from "search" (onsite) vs "search"(public). IMHO absolutely acceptable for a free product that is great in all other respect. PS: I personally think that de-facto standard HTML headers should be respected by a crawler by default and not only via customization. But that's likely rather feedback to scrapy than to docsearch. |
Legit. cc @shortcuts We should have a look a the current state of this. |
A user expected the crawler to respect the
<meta name="robots" content="noindex,nofollow">
meta tag that should tell crawlers to skip a page. We don't honor this tag at all (nor do we honor therobots.txt
).I've always considered DocSearch as an opt-in crawler, so not bound to respect those rules as everything it will crawl or not is configured in the config file that each website owner can edit, so I don't think we should respect this.
That being said, maybe we should introduce a new DocSearch meta tag to exclude pages, to allow owners more fine-grain without requiring to PR their config.
Thoughts @Shipow @s-pace @clemfromspace?
The text was updated successfully, but these errors were encountered: