Skip to content

Commit

Permalink
docs: add HyperbrowserLoader docs (#29143)
Browse files Browse the repository at this point in the history
### Description
This PR adds docs for the
[langchain-hyperbrowser](https://pypi.org/project/langchain-hyperbrowser/)
package. It includes a document loader that uses Hyperbrowser to scrape
or crawl any urls and return formatted markdown or html content as well
as relevant metadata.
[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and
scaling headless browsers. It lets you launch and manage browser
sessions at scale and provides easy to use solutions for any webscraping
needs, such as scraping a single page or crawling an entire site.

### Issue
None

### Dependencies
None

### Twitter Handle
`@hyperbrowser`
  • Loading branch information
NikhilShahi authored Jan 13, 2025
1 parent 4c02176 commit 335ca3a
Show file tree
Hide file tree
Showing 4 changed files with 299 additions and 0 deletions.
221 changes: 221 additions & 0 deletions docs/docs/integrations/document_loaders/hyperbrowser.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# HyperbrowserLoader"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.\n",
"\n",
"Key Features:\n",
"- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches\n",
"- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright\n",
"- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more\n",
"- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies\n",
"\n",
"This notebook provides a quick overview for getting started with Hyperbrowser [document loader](https://python.langchain.com/docs/concepts/#document-loaders).\n",
"\n",
"For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"| Class | Package | Local | Serializable | JS support|\n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| HyperbrowserLoader | langchain-hyperbrowser | ❌ | ❌ | ❌ | \n",
"### Loader features\n",
"| Source | Document Lazy Loading | Native Async Support |\n",
"| :---: | :---: | :---: | \n",
"| HyperbrowserLoader | ✅ | ✅ | \n",
"\n",
"## Setup\n",
"\n",
"To access Hyperbrowser document loader you'll need to install the `langchain-hyperbrowser` integration package, and create a Hyperbrowser account and get an API key.\n",
"\n",
"### Credentials\n",
"\n",
"Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation\n",
"\n",
"Install **langchain-hyperbrowser**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -qU langchain-hyperbrowser"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Initialization\n",
"\n",
"Now we can instantiate our model object and load documents:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain_hyperbrowser import HyperbrowserLoader\n",
"\n",
"loader = HyperbrowserLoader(\n",
" urls=\"https://example.com\",\n",
" api_key=\"YOUR_API_KEY\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Document(metadata={'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, page_content='Example Domain\\n\\n# Example Domain\\n\\nThis domain is for use in illustrative examples in documents. You may use this\\ndomain in literature without prior coordination or asking for permission.\\n\\n[More information...](https://www.iana.org/domains/example)')"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docs = loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lazy Load"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"page = []\n",
"for doc in loader.lazy_load():\n",
" page.append(doc)\n",
" if len(page) >= 10:\n",
" # do some paged operation, e.g.\n",
" # index.upsert(page)\n",
"\n",
" page = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Advanced Usage\n",
"\n",
"You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = HyperbrowserLoader(\n",
" urls=\"https://hyperbrowser.ai\", api_key=\"YOUR_API_KEY\", operation=\"crawl\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"loader = HyperbrowserLoader(\n",
" urls=\"https://example.com\",\n",
" api_key=\"YOUR_API_KEY\",\n",
" operation=\"scrape\",\n",
" params={\"scrape_options\": {\"include_tags\": [\"h1\", \"h2\", \"p\"]}},\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)\n",
"- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)\n",
"- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
67 changes: 67 additions & 0 deletions docs/docs/integrations/providers/hyperbrowser.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Hyperbrowser

> [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.
>
> Key Features:
>
> - Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches
> - Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright
> - Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more
> - Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies
For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).

## Installation and Setup

To get started with `langchain-hyperbrowser`, you can install the package using pip:

```bash
pip install langchain-hyperbrowser
```

And you should configure credentials by setting the following environment variables:

`HYPERBROWSER_API_KEY=<your-api-key>`

Make sure to get your API Key from https://app.hyperbrowser.ai/

## Document Loader

The `HyperbrowserLoader` class in `langchain-hyperbrowser` can easily be used to load content from any single page or multiple pages as well as crawl an entire site.
The content can be loaded as markdown or html.

```python
from langchain_hyperbrowser import HyperbrowserLoader

loader = HyperbrowserLoader(urls="https://example.com")
docs = loader.load()

print(docs[0])
```

## Advanced Usage

You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page.

```python
loader = HyperbrowserLoader(
urls="https://hyperbrowser.ai", api_key="YOUR_API_KEY", operation="crawl"
)
```

Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait.

```python
loader = HyperbrowserLoader(
urls="https://example.com",
api_key="YOUR_API_KEY",
operation="scrape",
params={"scrape_options": {"include_tags": ["h1", "h2", "p"]}}
)
```

## Additional Resources

- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)
- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)
- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)
7 changes: 7 additions & 0 deletions docs/src/theme/FeatureTables.js
Original file line number Diff line number Diff line change
Expand Up @@ -815,6 +815,13 @@ const FEATURE_TABLES = {
source: "Uses Docling to load and parse web pages",
api: "Package",
apiLink: "https://python.langchain.com/docs/integrations/document_loaders/docling/"
},
{
name: "Hyperbrowser",
link: "hyperbrowser",
source: "Platform for running and scaling headless browsers, can be used to scrape/crawl any site",
api: "API",
apiLink: "https://python.langchain.com/docs/integrations/document_loaders/hyperbrowser/"
}
]
},
Expand Down
4 changes: 4 additions & 0 deletions libs/packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -337,3 +337,7 @@ packages:
path: .
repo: AlwaysBluer/langchain-lindorm-integration
downloads: 0
- name: langchain-hyperbrowser
path: .
repo: hyperbrowserai/langchain-hyperbrowser
downloads: 0

0 comments on commit 335ca3a

Please sign in to comment.