Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401

pixelastic · 2018-10-04T15:08:30Z

A user expected the crawler to respect the <meta name="robots" content="noindex,nofollow"> meta tag that should tell crawlers to skip a page. We don't honor this tag at all (nor do we honor the robots.txt).

I've always considered DocSearch as an opt-in crawler, so not bound to respect those rules as everything it will crawl or not is configured in the config file that each website owner can edit, so I don't think we should respect this.

That being said, maybe we should introduce a new DocSearch meta tag to exclude pages, to allow owners more fine-grain without requiring to PR their config.

Thoughts @Shipow @s-pace @clemfromspace?

The text was updated successfully, but these errors were encountered:

s-pace · 2018-10-08T16:14:22Z

I do think that creating a dedicated tags would be nice to avoid crawling a small subset of pages. However, let's try to avoid making it a regular practice since it might increase the load of the crawl by downloading more content than required. Providing a dedicated sitemap might be wiser. We will only follow the links from this dedicated documentation sitemap. WDYT?

pixelastic · 2018-10-09T07:16:48Z

Providing a dedicated sitemap might be wiser.

That could be a great idea. Some kind of docsearch.xml on the root that would list all the pages to crawl. Or maybe there is a way to re-use the standard robots.txt file?

clemfromspace · 2018-10-09T08:14:59Z

Scrapy have a built-in support for the robots.txt: https://doc.scrapy.org/en/latest/topics/settings.html?highlight=robot#std:setting-ROBOTSTXT_OBEY
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#topics-dlmw-robots

Should be easy to add the Scrapy settings here ('ROBOTSTXT_OBEY': True): https://github.com/algolia/docsearch-scraper/blob/master/scraper/src/index.py#L52

But it will maybe impact existing configurations though.

pixelastic · 2018-10-12T11:49:58Z

Yep, I think we should not follow robots.txt by default (because changing will not be backward compatible).

My suggestion was that maybe we could reuse the robots.txt syntax to add custom DocSearch information. Maybe something like:

User-agent: DocSearch
Disallow: /dont-index-that-directory/
Disallow: /tmp/

s-pace · 2018-10-17T07:19:01Z

Good idea but let's put this into the configuration.

Let's wait for the codebase refactor? (migrate to python 3)

clemfromspace · 2018-10-17T09:13:38Z

Yeah, let's wait for the refactor, we can then add a new middleware inspired by the built-in one from scrapy: https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/robotstxt.py#L88

nkuehn · 2021-01-27T20:31:20Z

from 2018:

Yeah, let's wait for the refactor, we can then add a new middleware

Any updates on this issue? It would be great to at least know a decision from Algolia whether adding support for respecting the no-follow meta headers is intended at all or not.

Shipow · 2021-02-01T19:40:34Z

Hi @nkuehn
As far as I know, nothing is in the pipe regarding this for the moment.
Could you give more detail on how this would impact you experience or technical requirement?

nkuehn · 2021-02-05T21:06:40Z

Sure: Our docs site generator supports a markdown frontmatter that triggers the standard meta=noindex HTML tagging to ensure a given page is not indexed in search engines.

There are varying use cases: pre-release documentation, deprecated features that are just documented as an archive, pages that are just lists or navigation helps and should not appear in search results etc..

These pages are often in that state temporarily and do not follow a specific "regex'able" pattern that we could put into the docsearch config and also, we need immediate control over adding / removing them without having to constantly bother you (the algolia docsearch team) with every individual change with a PR to your configs repo.

We have now understood that docsearch is only relying on whether the page is reachable through crawling. So we are teaching docs authors the different behavior of the on-site search vs. the public search engines and live with some pages appearing in search that we ideally would like to not see there. It's an acceptable situation - something that we absolutely want to hide would not be linked anyways.

TL;DR: The main downside is the additional mental workload for the authors to understand the subtle differences between the behaviors of excluding from "search" (onsite) vs "search"(public). IMHO absolutely acceptable for a free product that is great in all other respect.

PS: I personally think that de-facto standard HTML headers should be respected by a crawler by default and not only via customization. But that's likely rather feedback to scrapy than to docsearch.

Shipow · 2021-02-06T07:36:39Z

Legit. cc @shortcuts We should have a look a the current state of this.
Thanks @nkuehn for taking the time to give more details.

s-pace added the enhancement label Aug 27, 2019

davifantasia mentioned this issue Jan 27, 2021

excludeFromSearchindex frontmatter not effective for algolia on-site search commercetools/commercetools-docs-kit#829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401

Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401

pixelastic commented Oct 4, 2018

s-pace commented Oct 8, 2018

pixelastic commented Oct 9, 2018

clemfromspace commented Oct 9, 2018 •

edited

Loading

pixelastic commented Oct 12, 2018

s-pace commented Oct 17, 2018

clemfromspace commented Oct 17, 2018

nkuehn commented Jan 27, 2021

Shipow commented Feb 1, 2021

nkuehn commented Feb 5, 2021

Shipow commented Feb 6, 2021

Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401

Should the crawler respect the <meta name="robots" content="noindex,nofollow">? #401

Comments

pixelastic commented Oct 4, 2018

s-pace commented Oct 8, 2018

pixelastic commented Oct 9, 2018

clemfromspace commented Oct 9, 2018 • edited Loading

pixelastic commented Oct 12, 2018

s-pace commented Oct 17, 2018

clemfromspace commented Oct 17, 2018

nkuehn commented Jan 27, 2021

Shipow commented Feb 1, 2021

nkuehn commented Feb 5, 2021

Shipow commented Feb 6, 2021

clemfromspace commented Oct 9, 2018 •

edited

Loading