-
Notifications
You must be signed in to change notification settings - Fork 16.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: add HyperbrowserLoader docs (#29143)
### Description This PR adds docs for the [langchain-hyperbrowser](https://pypi.org/project/langchain-hyperbrowser/) package. It includes a document loader that uses Hyperbrowser to scrape or crawl any urls and return formatted markdown or html content as well as relevant metadata. [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site. ### Issue None ### Dependencies None ### Twitter Handle `@hyperbrowser`
- Loading branch information
1 parent
4c02176
commit 335ca3a
Showing
4 changed files
with
299 additions
and
0 deletions.
There are no files selected for viewing
221 changes: 221 additions & 0 deletions
221
docs/docs/integrations/document_loaders/hyperbrowser.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,221 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# HyperbrowserLoader" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"[Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.\n", | ||
"\n", | ||
"Key Features:\n", | ||
"- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches\n", | ||
"- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright\n", | ||
"- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more\n", | ||
"- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies\n", | ||
"\n", | ||
"This notebook provides a quick overview for getting started with Hyperbrowser [document loader](https://python.langchain.com/docs/concepts/#document-loaders).\n", | ||
"\n", | ||
"For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai).\n", | ||
"\n", | ||
"## Overview\n", | ||
"### Integration details\n", | ||
"\n", | ||
"| Class | Package | Local | Serializable | JS support|\n", | ||
"| :--- | :--- | :---: | :---: | :---: |\n", | ||
"| HyperbrowserLoader | langchain-hyperbrowser | ❌ | ❌ | ❌ | \n", | ||
"### Loader features\n", | ||
"| Source | Document Lazy Loading | Native Async Support |\n", | ||
"| :---: | :---: | :---: | \n", | ||
"| HyperbrowserLoader | ✅ | ✅ | \n", | ||
"\n", | ||
"## Setup\n", | ||
"\n", | ||
"To access Hyperbrowser document loader you'll need to install the `langchain-hyperbrowser` integration package, and create a Hyperbrowser account and get an API key.\n", | ||
"\n", | ||
"### Credentials\n", | ||
"\n", | ||
"Head to [Hyperbrowser](https://app.hyperbrowser.ai/) to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Installation\n", | ||
"\n", | ||
"Install **langchain-hyperbrowser**." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install -qU langchain-hyperbrowser" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Initialization\n", | ||
"\n", | ||
"Now we can instantiate our model object and load documents:\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain_hyperbrowser import HyperbrowserLoader\n", | ||
"\n", | ||
"loader = HyperbrowserLoader(\n", | ||
" urls=\"https://example.com\",\n", | ||
" api_key=\"YOUR_API_KEY\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"Document(metadata={'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, page_content='Example Domain\\n\\n# Example Domain\\n\\nThis domain is for use in illustrative examples in documents. You may use this\\ndomain in literature without prior coordination or asking for permission.\\n\\n[More information...](https://www.iana.org/domains/example)')" | ||
] | ||
}, | ||
"execution_count": null, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"docs = loader.load()\n", | ||
"docs[0]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print(docs[0].metadata)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Lazy Load" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"page = []\n", | ||
"for doc in loader.lazy_load():\n", | ||
" page.append(doc)\n", | ||
" if len(page) >= 10:\n", | ||
" # do some paged operation, e.g.\n", | ||
" # index.upsert(page)\n", | ||
"\n", | ||
" page = []" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Advanced Usage\n", | ||
"\n", | ||
"You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"loader = HyperbrowserLoader(\n", | ||
" urls=\"https://hyperbrowser.ai\", api_key=\"YOUR_API_KEY\", operation=\"crawl\"\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"loader = HyperbrowserLoader(\n", | ||
" urls=\"https://example.com\",\n", | ||
" api_key=\"YOUR_API_KEY\",\n", | ||
" operation=\"scrape\",\n", | ||
" params={\"scrape_options\": {\"include_tags\": [\"h1\", \"h2\", \"p\"]}},\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## API reference\n", | ||
"\n", | ||
"- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/)\n", | ||
"- [PyPi](https://pypi.org/project/langchain-hyperbrowser/)\n", | ||
"- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.16" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
# Hyperbrowser | ||
|
||
> [Hyperbrowser](https://hyperbrowser.ai) is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site. | ||
> | ||
> Key Features: | ||
> | ||
> - Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches | ||
> - Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright | ||
> - Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more | ||
> - Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies | ||
For more information about Hyperbrowser, please visit the [Hyperbrowser website](https://hyperbrowser.ai) or if you want to check out the docs, you can visit the [Hyperbrowser docs](https://docs.hyperbrowser.ai). | ||
|
||
## Installation and Setup | ||
|
||
To get started with `langchain-hyperbrowser`, you can install the package using pip: | ||
|
||
```bash | ||
pip install langchain-hyperbrowser | ||
``` | ||
|
||
And you should configure credentials by setting the following environment variables: | ||
|
||
`HYPERBROWSER_API_KEY=<your-api-key>` | ||
|
||
Make sure to get your API Key from https://app.hyperbrowser.ai/ | ||
|
||
## Document Loader | ||
|
||
The `HyperbrowserLoader` class in `langchain-hyperbrowser` can easily be used to load content from any single page or multiple pages as well as crawl an entire site. | ||
The content can be loaded as markdown or html. | ||
|
||
```python | ||
from langchain_hyperbrowser import HyperbrowserLoader | ||
|
||
loader = HyperbrowserLoader(urls="https://example.com") | ||
docs = loader.load() | ||
|
||
print(docs[0]) | ||
``` | ||
|
||
## Advanced Usage | ||
|
||
You can specify the operation to be performed by the loader. The default operation is `scrape`. For `scrape`, you can provide a single URL or a list of URLs to be scraped. For `crawl`, you can only provide a single URL. The `crawl` operation will crawl the provided page and subpages and return a document for each page. | ||
|
||
```python | ||
loader = HyperbrowserLoader( | ||
urls="https://hyperbrowser.ai", api_key="YOUR_API_KEY", operation="crawl" | ||
) | ||
``` | ||
|
||
Optional params for the loader can also be provided in the `params` argument. For more information on the supported params, visit https://docs.hyperbrowser.ai/reference/sdks/python/scrape#start-scrape-job-and-wait or https://docs.hyperbrowser.ai/reference/sdks/python/crawl#start-crawl-job-and-wait. | ||
|
||
```python | ||
loader = HyperbrowserLoader( | ||
urls="https://example.com", | ||
api_key="YOUR_API_KEY", | ||
operation="scrape", | ||
params={"scrape_options": {"include_tags": ["h1", "h2", "p"]}} | ||
) | ||
``` | ||
|
||
## Additional Resources | ||
|
||
- [Hyperbrowser Docs](https://docs.hyperbrowser.ai/) | ||
- [GitHub](https://github.com/hyperbrowserai/langchain-hyperbrowser/) | ||
- [PyPi](https://pypi.org/project/langchain-hyperbrowser/) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters