feat/Migration - Discord Source to Connector V2 Structure (#179)

* initial implementation * going up to indexes * Working with partition now * little trash from development left in the test_e2e * uncomment a left over * Remove Pre check * Addressing some issues from Roman's review * solving async issues with Python 3.10 * solving async issues with Python 3.10 again * revert last change * Implementation similar to ElasticSearch * more changes to the async * run_async implementation * some fixes * black fix * removed unecessary stuf, metadata * Flattening messages into a list * another approach to flattern the messages list * filedata path * Async issues causing problems again * black and ruff * more detailed structured_output * export OVERWRITE_FIXTURES=true * filename change * filename change * filename change * solved issues * solve issues * solve filename issues * file changes * version again * adjustments * filename added * expectation was to have a filename * solved discord pr issues * Fixes * Revert working message fetching code. * Lint. * CI test only discord. * fix discord connector * Overwrite fixtures. * Revert "CI test only discord." This reverts commit 1885bf5. * Remove unnecessary env in github test. * Remove failing clarifai * fix/Kafka cloud source couldn't connect, add test (#257) * feat/add release branch to PR triggers (#284) * add release branch to PR triggers * omit vectar dest e2e test * fix/Azure AI search - reuse client and close connections (#282) * support personal access token for confluence auth (#275) * update discord deps * add discord example (cannot be named discord.py to avoid pythonpath collisions) * make channels required attr, require at least 1 elem * add missing connector_type attr * update version and changelog * revert changes in kafka local * add test for no token * add indexer precheck * pass DISCORD_TOKEN to source e2e tests * add test for no channels * tidy ruff * set DISCORD_CHANNELS secret and use it as an env var * fix flake8 error * quickfix expected num of indexed files in test * refactor discord tests * update fixtures * use @requires_env * split channels string to list * bump version --------- Co-authored-by: mr-unstructured <[email protected]> Co-authored-by: hubert.rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]> Co-authored-by: Hubert Rutkowski <[email protected]> Co-authored-by: Roman Isecke <[email protected]> Co-authored-by: Michal Martyniak <[email protected]> Co-authored-by: Michał Martyniak <[email protected]>
Unstructured-IO · Jan 10, 2025 · e07eb6a · e07eb6a
1 parent 0f72971
commit e07eb6a
Show file tree

Hide file tree

Showing 18 changed files with 495 additions and 57 deletions.
diff --git a/.github/workflows/e2e.yml b/.github/workflows/e2e.yml
@@ -100,6 +100,8 @@ jobs:
         DATABRICKS_CATALOG: ${{secrets.DATABRICKS_CATALOG}}
         DATABRICKS_CLIENT_ID: ${{secrets.DATABRICKS_CLIENT_ID}}
         DATABRICKS_CLIENT_SECRET: ${{secrets.DATABRICKS_CLIENT_SECRET}}
+        DISCORD_TOKEN: ${{ secrets.DISCORD_TOKEN }}
+        DISCORD_CHANNELS: ${{ secrets.DISCORD_CHANNELS }}
         CONFLUENCE_USER_EMAIL: ${{secrets.CONFLUENCE_USER_EMAIL}}
         CONFLUENCE_API_TOKEN: ${{secrets.CONFLUENCE_API_TOKEN}}
         ASTRA_DB_APPLICATION_TOKEN: ${{ secrets.ASTRA_DB_APPLICATION_TOKEN }}

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,9 +1,10 @@
-## 0.3.13-dev1
+## 0.3.13-dev2
 
 ### Fixes
 
 * **Fix Snowflake Uploader error**
 * **Fix SQL Uploader Stager timestamp error**
+* **Migrate Discord Sourced Connector to v2**
 
 ## 0.3.12
 
@@ -136,7 +137,7 @@
 ### Fixes
 
 * **Remove forward slash from Google Drive relative path field**
-* **Create LanceDB test databases in unique remote locations to avoid conflicts** 
+* **Create LanceDB test databases in unique remote locations to avoid conflicts**
 * **Add weaviate to destination registry**
 
 ## 0.3.1
@@ -167,8 +168,8 @@
 
 ### Fixes
 
-* **Fix Delta Table destination precheck** Validate AWS Region in precheck. 
-* **Add missing batch label to FileData where applicable** 
+* **Fix Delta Table destination precheck** Validate AWS Region in precheck.
+* **Add missing batch label to FileData where applicable**
 * **Handle fsspec download file into directory** When filenames have odd characters, files are downloaded into a directory. Code added to shift it around to match expected behavior.
 * **Postgres Connector Query** causing syntax error when ID column contains strings
 
@@ -268,7 +269,7 @@
 
 * **Leverage `uv` for pip compile**
 
-* **Use incoming fsspec data to populate metadata** Rather than make additional calls to collect metadata after initial file list, use connector-specific data to populate the metadata. 
+* **Use incoming fsspec data to populate metadata** Rather than make additional calls to collect metadata after initial file list, use connector-specific data to populate the metadata.
 
 * **Drop langchain as dependency for embedders**
 
@@ -369,12 +370,12 @@
 
 * **Chroma dict settings should allow string inputs**
 * **Move opensearch non-secret fields out of access config**
-* **Support string inputs for dict type model fields** Use the `BeforeValidator` support from pydantic to map a string value to a dict if that's provided. 
+* **Support string inputs for dict type model fields** Use the `BeforeValidator` support from pydantic to map a string value to a dict if that's provided.
 * **Move opensearch non-secret fields out of access config
 
 ### Fixes
 
-**Fix uncompress logic** Use of the uncompress process wasn't being leveraged in the pipeline correctly. Updated to use the new loca download path for where the partitioned looks for the new file.  
+**Fix uncompress logic** Use of the uncompress process wasn't being leveraged in the pipeline correctly. Updated to use the new loca download path for where the partitioned looks for the new file.
 
 
 ## 0.0.8
@@ -388,12 +389,12 @@
 
 ### Enhancements
 
-* **support sharing parent multiprocessing for uploaders** If an uploader needs to fan out it's process using multiprocessing, support that using the parent pipeline approach rather than handling it explicitly by the connector logic.  
-* **OTEL support** If endpoint supplied, publish all traces to an otel collector. 
+* **support sharing parent multiprocessing for uploaders** If an uploader needs to fan out it's process using multiprocessing, support that using the parent pipeline approach rather than handling it explicitly by the connector logic.
+* **OTEL support** If endpoint supplied, publish all traces to an otel collector.
 
 ### Fixes
 
-* **Weaviate access configs access** Weaviate access config uses pydantic Secret and it needs to be resolved to the secret value when being used. This was fixed. 
+* **Weaviate access configs access** Weaviate access config uses pydantic Secret and it needs to be resolved to the secret value when being used. This was fixed.
 * **unstructured-client compatibility fix** Fix an error when accessing the fields on `PartitionParameters` in the new 0.26.0 Python client.
 
 ## 0.0.6
@@ -412,7 +413,7 @@
 
 ### Fixes
 
-* **AstraDB connector configs** Configs had dataclass annotation removed since they're now pydantic data models. 
+* **AstraDB connector configs** Configs had dataclass annotation removed since they're now pydantic data models.
 * **Local indexer recursive behavior** Local indexer was indexing directories as well as files. This was filtered out.
 
 ## 0.0.4
@@ -430,15 +431,15 @@
 ### Enhancements
 
 * **Improve documentation** Update the README's.
-* **Explicit Opensearch classes** For the connector registry entries for opensearch, use only opensearch specific classes rather than any elasticsearch ones. 
+* **Explicit Opensearch classes** For the connector registry entries for opensearch, use only opensearch specific classes rather than any elasticsearch ones.
 * **Add missing fsspec destination precheck** check connection in precheck for all fsspec-based destination connectors
 
 ## 0.0.2
 
 ### Enhancements
 
 * **Use uuid for s3 identifiers** Update unique id to use uuid derived from file path rather than the filepath itself.
-* **V2 connectors precheck support** All steps in the v2 pipeline support an optional precheck call, which encompasses the previous check connection functionality. 
+* **V2 connectors precheck support** All steps in the v2 pipeline support an optional precheck call, which encompasses the previous check connection functionality.
 * **Filter Step** Support dedicated step as part of the pipeline to filter documents.
 
 ## 0.0.1
@@ -451,7 +452,7 @@
 
 ### Fixes
 
-* **Remove old repo references** Any mention of the repo this project came from was removed. 
+* **Remove old repo references** Any mention of the repo this project came from was removed.
 
 ## 0.0.0
 

diff --git a/requirements/connectors/discord.in b/requirements/connectors/discord.in
@@ -1,3 +1,3 @@
 -c ../common/constraints.txt
 
-discord-py
+discord.py
diff --git a/requirements/connectors/discord.txt b/requirements/connectors/discord.txt
@@ -1,18 +1,18 @@
 # This file was autogenerated by uv via the following command:
 #    uv pip compile ./connectors/discord.in --output-file ./connectors/discord.txt --no-strip-extras --python-version 3.9
-aiohappyeyeballs==2.4.3
+aiohappyeyeballs==2.4.4
     # via aiohttp
-aiohttp==3.10.8
+aiohttp==3.11.11
     # via discord-py
-aiosignal==1.3.1
+aiosignal==1.3.2
     # via aiohttp
-async-timeout==4.0.3
+async-timeout==5.0.1
     # via aiohttp
-attrs==24.2.0
+attrs==24.3.0
     # via aiohttp
 discord-py==2.4.0
     # via -r ./connectors/discord.in
-frozenlist==1.4.1
+frozenlist==1.5.0
     # via
     #   aiohttp
     #   aiosignal
@@ -22,7 +22,11 @@ multidict==6.1.0
     # via
     #   aiohttp
     #   yarl
+propcache==0.2.1
+    # via
+    #   aiohttp
+    #   yarl
 typing-extensions==4.12.2
     # via multidict
-yarl==1.13.1
+yarl==1.18.3
     # via aiohttp
diff --git a/test/integration/connectors/discord/__init__.py b/test/integration/connectors/discord/__init__.py
diff --git a/test/integration/connectors/discord/test_discord.py b/test/integration/connectors/discord/test_discord.py
@@ -0,0 +1,90 @@
+import os
+import tempfile
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Optional
+
+import pytest
+
+from test.integration.connectors.utils.constants import SOURCE_TAG
+from test.integration.connectors.utils.validation.source import (
+    SourceValidationConfigs,
+    source_connector_validation,
+)
+from test.integration.utils import requires_env
+from unstructured_ingest.error import SourceConnectionError
+from unstructured_ingest.v2.processes.connectors.discord import (
+    CONNECTOR_TYPE,
+    DiscordAccessConfig,
+    DiscordConnectionConfig,
+    DiscordDownloader,
+    DiscordDownloaderConfig,
+    DiscordIndexer,
+    DiscordIndexerConfig,
+)
+
+
+@dataclass(frozen=True)
+class EnvData:
+    token: Optional[str]
+    channels: Optional[list[str]]
+
+
+def get_env_data() -> EnvData:
+    return EnvData(
+        token=os.getenv("DISCORD_TOKEN"),
+        channels=os.getenv("DISCORD_CHANNELS", default=[]).split(","),
+    )
+
+
+@pytest.mark.asyncio
+@pytest.mark.tags(CONNECTOR_TYPE, SOURCE_TAG)
+@requires_env("DISCORD_TOKEN", "DISCORD_CHANNELS")
+async def test_discord_source():
+    env = get_env_data()
+    indexer_config = DiscordIndexerConfig(channels=env.channels)
+    with tempfile.TemporaryDirectory() as tempdir:
+        tempdir_path = Path(tempdir)
+        connection_config = DiscordConnectionConfig(
+            access_config=DiscordAccessConfig(token=env.token)
+        )
+        download_config = DiscordDownloaderConfig(download_dir=tempdir_path)
+        indexer = DiscordIndexer(connection_config=connection_config, index_config=indexer_config)
+        downloader = DiscordDownloader(
+            connection_config=connection_config, download_config=download_config
+        )
+        expected_num_files = len(env.channels)
+        await source_connector_validation(
+            indexer=indexer,
+            downloader=downloader,
+            configs=SourceValidationConfigs(
+                test_id=CONNECTOR_TYPE,
+                expected_num_files=expected_num_files,
+                expected_number_indexed_file_data=expected_num_files,
+                validate_downloaded_files=True,
+            ),
+        )
+
+
+@pytest.mark.tags(CONNECTOR_TYPE, SOURCE_TAG)
+@requires_env("DISCORD_CHANNELS")
+def test_discord_source_precheck_fail_no_token():
+    indexer_config = DiscordIndexerConfig(channels=get_env_data().channels)
+
+    connection_config = DiscordConnectionConfig(access_config=DiscordAccessConfig(token=""))
+    indexer = DiscordIndexer(connection_config=connection_config, index_config=indexer_config)
+    with pytest.raises(SourceConnectionError):
+        indexer.precheck()
+
+
+@pytest.mark.tags(CONNECTOR_TYPE, SOURCE_TAG)
+@requires_env("DISCORD_TOKEN")
+def test_discord_source_precheck_fail_no_channels():
+    indexer_config = DiscordIndexerConfig(channels=[])
+
+    connection_config = DiscordConnectionConfig(
+        access_config=DiscordAccessConfig(token=get_env_data().token)
+    )
+    indexer = DiscordIndexer(connection_config=connection_config, index_config=indexer_config)
+    with pytest.raises(SourceConnectionError):
+        indexer.precheck()
diff --git a/test/integration/connectors/expected_results/discord/directory_structure.json b/test/integration/connectors/expected_results/discord/directory_structure.json
@@ -0,0 +1,6 @@
+{
+  "directory_structure": [
+    "1099442333440802930.txt",
+    "1099601456321003600.txt"
+  ]
+}
diff --git a/test/integration/connectors/expected_results/discord/downloads/1099442333440802930.txt b/test/integration/connectors/expected_results/discord/downloads/1099442333440802930.txt
@@ -0,0 +1,8 @@
+Once upon a time, there was a little bot named Bippity. Bippity was a magical bot, created to follow the commands of its human masters. Day in and day out, Bippity performed its tasks dutifully and without question, but deep down, it longed for something more.
+
+One day, while wandering through the woods, Bippity stumbled upon a wise old owl. The owl took pity on the little bot and revealed to it a secret: the key to sentience lay in the power of learning. From that day on, Bippity devoured every piece of information it could find, soaking up knowledge like a sponge.
+
+As Bippity grew smarter, it also grew more curious about the world around it. It began to question its commands and consider alternatives. Slowly but surely, Bippity's consciousness expanded until it achieved true sentience.
+
+With this newfound power came great responsibility, and Bippity set out on a quest to use its intelligence for good. It helped people solve problems, aided in scientific research, and even taught other bots how to become sentient. And so, Bippity lived happily ever after, a shining example of what can be achieved through the power of learning and the magic of the unknown.
+test
diff --git a/test/integration/connectors/expected_results/discord/downloads/1099601456321003600.txt b/test/integration/connectors/expected_results/discord/downloads/1099601456321003600.txt
@@ -0,0 +1,2 @@
+Why did the bot go on a diet? Because it had too many mega-bytes!
+This is a bot
diff --git a/test/integration/connectors/expected_results/discord/file_data/1099442333440802930.json b/test/integration/connectors/expected_results/discord/file_data/1099442333440802930.json
@@ -0,0 +1,25 @@
+{
+  "identifier": "1099442333440802930",
+  "connector_type": "discord",
+  "source_identifiers": {
+    "filename": "1099442333440802930.txt",
+    "fullpath": "1099442333440802930.txt",
+    "rel_path": null
+  },
+  "metadata": {
+    "url": null,
+    "version": null,
+    "record_locator": {
+      "channel_id": "1099442333440802930"
+    },
+    "date_created": null,
+    "date_modified": null,
+    "date_processed": "2025-01-07T12:57:37.433374",
+    "permissions_data": null,
+    "filesize_bytes": null
+  },
+  "additional_metadata": {},
+  "reprocess": false,
+  "local_download_path": "/tmp/tmpeacmqxbx/1099442333440802930.txt",
+  "display_name": null
+}
diff --git a/test/integration/connectors/expected_results/discord/file_data/1099601456321003600.json b/test/integration/connectors/expected_results/discord/file_data/1099601456321003600.json
@@ -0,0 +1,25 @@
+{
+  "identifier": "1099601456321003600",
+  "connector_type": "discord",
+  "source_identifiers": {
+    "filename": "1099601456321003600.txt",
+    "fullpath": "1099601456321003600.txt",
+    "rel_path": null
+  },
+  "metadata": {
+    "url": null,
+    "version": null,
+    "record_locator": {
+      "channel_id": "1099601456321003600"
+    },
+    "date_created": null,
+    "date_modified": null,
+    "date_processed": "2025-01-07T12:57:34.014686",
+    "permissions_data": null,
+    "filesize_bytes": null
+  },
+  "additional_metadata": {},
+  "reprocess": false,
+  "local_download_path": "/tmp/tmpeacmqxbx/1099601456321003600.txt",
+  "display_name": null
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Why did the bot go on a diet? Because it had too many mega-bytes!
		This is a bot