docs: add HyperbrowserLoader docs (#29143)

### Description This PR adds docs for the [langchain-hyperbrowser](https://pypi.org/project/langchain-hyperbrowser/) package. It includes a document loader that uses Hyperbrowser to scrape or crawl any urls and return formatted markdown or html content as well as relevant metadata. [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site. ### Issue None ### Dependencies None ### Twitter Handle `@hyperbrowser`
langchain-ai · Jan 13, 2025 · 335ca3a · 335ca3a
1 parent 4c02176
commit 335ca3a
Show file tree

Hide file tree

Showing 4 changed files with 299 additions and 0 deletions.
diff --git a/docs/docs/integrations/document_loaders/hyperbrowser.ipynb b/docs/docs/integrations/document_loaders/hyperbrowser.ipynb
@@ -0,0 +1,221 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# HyperbrowserLoader"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.\n",
+    "\n",
+    "Key Features:\n",
+    "- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches\n",
+    "- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright\n",
+    "- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more\n",
+    "- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies\n",
+    "\n",
+    "This notebook provides a quick overview for getting started with Hyperbrowser [document loader](https://python.langchain.com/docs/concepts/#document-loaders).\n",
+    "\n",
+    "For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).\n",
+    "\n",
+    "## Overview\n",
+    "### Integration details\n",
+    "\n",
+    "| Class | Package | Local | Serializable | JS support|\n",
+    "| :--- | :--- | :---: | :---: |  :---: |\n",
+    "| HyperbrowserLoader | langchain-hyperbrowser | ❌ | ❌ | ❌ | \n",
+    "### Loader features\n",
+    "| Source | Document Lazy Loading | Native Async Support |\n",
+    "| :---: | :---: | :---: | \n",
+    "| HyperbrowserLoader | ✅ | ✅ | \n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "To access Hyperbrowser document loader you'll need to install the `langchain-hyperbrowser` integration package, and create a Hyperbrowser account and get an API key.\n",
+    "\n",
+    "### Credentials\n",
+    "\n",
+    "Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Installation\n",
+    "\n",
+    "Install **langchain-hyperbrowser**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install -qU langchain-hyperbrowser"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Initialization\n",
+    "\n",
+    "Now we can instantiate our model object and load documents:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_hyperbrowser import HyperbrowserLoader\n",
+    "\n",
+    "loader = HyperbrowserLoader(\n",
+    "    urls=\"https://example.com\",\n",
+    "    api_key=\"YOUR_API_KEY\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "Document(metadata={'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, page_content='Example Domain\\n\\n# Example Domain\\n\\nThis domain is for use in illustrative examples in documents. You may use this\\ndomain in literature without prior coordination or asking for permission.\\n\\n[More information...](https://www.iana.org/domains/example)')"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "docs = loader.load()\n",
+    "docs[0]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(docs[0].metadata)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Lazy Load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "page = []\n",
+    "for doc in loader.lazy_load():\n",
+    "    page.append(doc)\n",
+    "    if len(page) >= 10:\n",
+    "        # do some paged operation, e.g.\n",
+    "        # index.upsert(page)\n",
+    "\n",
+    "        page = []"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Advanced Usage\n",
+    "\n",
+    "You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = HyperbrowserLoader(\n",
+    "    urls=\"https://hyperbrowser.ai\", api_key=\"YOUR_API_KEY\", operation=\"crawl\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "loader = HyperbrowserLoader(\n",
+    "    urls=\"https://example.com\",\n",
+    "    api_key=\"YOUR_API_KEY\",\n",
+    "    operation=\"scrape\",\n",
+    "    params={\"scrape_options\": {\"include_tags\": [\"h1\", \"h2\", \"p\"]}},\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## API reference\n",
+    "\n",
+    "- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)\n",
+    "- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)\n",
+    "- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/docs/docs/integrations/providers/hyperbrowser.mdx b/docs/docs/integrations/providers/hyperbrowser.mdx
@@ -0,0 +1,67 @@
+# Hyperbrowser
+
+> [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.
+>
+> Key Features:
+>
+> - Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches
+> - Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright
+> - Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more
+> - Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies
+
+For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).
+
+## Installation and Setup
+
+To get started with `langchain-hyperbrowser`, you can install the package using pip:
+
+```bash
+pip install langchain-hyperbrowser
+```
+
+And you should configure credentials by setting the following environment variables:
+
+`HYPERBROWSER_API_KEY=<your-api-key>`
+
+Make sure to get your API Key from https://app.hyperbrowser.ai/
+
+## Document Loader
+
+The `HyperbrowserLoader` class in `langchain-hyperbrowser` can easily be used to load content from any single page or multiple pages as well as crawl an entire site.
+The content can be loaded as markdown or html.
+
+```python
+from langchain_hyperbrowser import HyperbrowserLoader
+
+loader = HyperbrowserLoader(urls="https://example.com")
+docs = loader.load()
+
+print(docs[0])
+```
+
+## Advanced Usage
+
+You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page.
+
+```python
+loader = HyperbrowserLoader(
+  urls="https://hyperbrowser.ai", api_key="YOUR_API_KEY", operation="crawl"
+)
+```
+
+Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait.
+
+```python
+loader = HyperbrowserLoader(
+  urls="https://example.com",
+  api_key="YOUR_API_KEY",
+  operation="scrape",
+  params={"scrape_options": {"include_tags": ["h1", "h2", "p"]}}
+)
+```
+
+## Additional Resources
+
+- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)
+- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)
+- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)
diff --git a/docs/src/theme/FeatureTables.js b/docs/src/theme/FeatureTables.js
@@ -815,6 +815,13 @@ const FEATURE_TABLES = {
                 source: "Uses Docling to load and parse web pages",
                 api: "Package",
                 apiLink: "https://python.langchain.com/docs/integrations/document_loaders/docling/"
+            },
+            {
+                name: "Hyperbrowser",
+                link: "hyperbrowser",
+                source: "Platform for running and scaling headless browsers, can be used to scrape/crawl any site",
+                api: "API",
+                apiLink: "https://python.langchain.com/docs/integrations/document_loaders/hyperbrowser/"
             }
         ]
     },

diff --git a/libs/packages.yml b/libs/packages.yml
@@ -337,3 +337,7 @@ packages:
   path: .
   repo: AlwaysBluer/langchain-lindorm-integration
   downloads: 0
+- name: langchain-hyperbrowser
+  path: .
+  repo: hyperbrowserai/langchain-hyperbrowser
+  downloads: 0