diff --git a/AI-marketing-agent/AIMarketResearch.ipynb b/AI-marketing-agent/AIMarketResearch.ipynb new file mode 100644 index 0000000..7b5bb71 --- /dev/null +++ b/AI-marketing-agent/AIMarketResearch.ipynb @@ -0,0 +1,1256 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "tUKTdXIiM9BO", + "metadata": { + "id": "tUKTdXIiM9BO" + }, + "source": [ + "# AI/NLP Development Environment Setup\n", + "\n", + "This document outlines the required Python packages for setting up an AI/NLP development environment that integrates various tools for document processing, Reddit API access, and Language Model interactions.\n", + "\n", + "## Required Packages\n", + "\n", + "- **chromadb**: Vector database for storing and managing embeddings\n", + "- **praw**: Python Reddit API Wrapper for accessing Reddit data\n", + "- **openai**: OpenAI's official client library\n", + "- **python-dotenv**: Environment variable management\n", + "- **langchain**: Framework for developing LLM-powered applications\n", + "- **langchain-openai**: LangChain integration with OpenAI models\n", + "- **langchain-text-splitters**: Text chunking and splitting utilities\n", + "- **langgraph**: Graph-based operations for LangChain\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "5673d0d7d51ef2e0", + "metadata": { + "id": "5673d0d7d51ef2e0" + }, + "outputs": [], + "source": [ + "!pip install -q chromadb praw openai load-dotenv langchain langchain-openai langchain-text-splitters langgraph" + ] + }, + { + "cell_type": "markdown", + "id": "3fns4P9ANT_t", + "metadata": { + "id": "3fns4P9ANT_t" + }, + "source": [ + "# Environment Configuration Setup\n", + "\n", + "This script handles the initialization of environment variables for OpenAI and Reddit API authentication. It uses the `python-dotenv` package to load environment variables from a `.env` file.\n", + "\n", + "## Required Environment Variables\n", + "\n", + "The following environment variables need to be set in your `.env` file:\n", + "\n", + "- `OPENAI_API_KEY`: Your OpenAI API key\n", + "- `REDDIT_CLIENT_ID`: Your Reddit application client ID\n", + "- `REDDIT_CLIENT_SECRET`: Your Reddit application client secret\n", + "- `REDDIT_USER_AGENT`: Your Reddit API user agent string\n", + "\n", + "## Configuration Variables\n", + "\n", + "- `GENERATE_KNOWLEDGE`: Boolean flag (default: `False`) controlling knowledge generation functionality\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "initial_id", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:41:36.332725Z", + "start_time": "2024-11-17T14:41:34.523399Z" + }, + "id": "initial_id" + }, + "outputs": [], + "source": [ + "import json\n", + "import os\n", + "import time\n", + "from typing import List, Dict, TypedDict\n", + "\n", + "import chromadb\n", + "import praw\n", + "from langchain_openai import OpenAIEmbeddings\n", + "from langchain_text_splitters import RecursiveCharacterTextSplitter\n", + "from openai import OpenAI" + ] + }, + { + "cell_type": "markdown", + "id": "b521c1abdfb7d75c", + "metadata": { + "id": "b521c1abdfb7d75c" + }, + "source": [ + "# Marketing Research Agent" + ] + }, + { + "cell_type": "markdown", + "id": "af4c9abf9e30f38", + "metadata": { + "id": "af4c9abf9e30f38" + }, + "source": [ + "## Knowledge base generation pipeline" + ] + }, + { + "cell_type": "markdown", + "id": "DhXn0QOjOlRN", + "metadata": { + "id": "DhXn0QOjOlRN" + }, + "source": [ + "Loads environment variables and sets up configuration for OpenAI and Reddit API access. The `load_dotenv()` function imports variables from a `.env` file, while `os.getenv()` safely retrieves each value. `GENERATE_KNOWLEDGE` flag controls additional functionality." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d74b9ef0648dd411", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:01.631900Z", + "start_time": "2024-11-17T14:42:01.614375Z" + }, + "id": "d74b9ef0648dd411" + }, + "outputs": [], + "source": [ + "from dotenv import load_dotenv\n", + "load_dotenv()\n", + "\n", + "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")\n", + "CLIENT_ID = os.getenv('REDDIT_CLIENT_ID')\n", + "CLIENT_SECRET = os.getenv('REDDIT_CLIENT_SECRET')\n", + "USER_AGENT = os.getenv('REDDIT_USER_AGENT')\n", + "\n", + "GENERATE_KNOWLEDGE = False" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "792d211c92430f29", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:04.195390Z", + "start_time": "2024-11-17T14:42:04.105941Z" + }, + "id": "792d211c92430f29" + }, + "outputs": [ + { + "ename": "MissingRequiredAttributeException", + "evalue": "Required configuration setting 'client_id' missing. \nThis setting can be provided in a praw.ini file, as a keyword argument to the Reddit class constructor, or as an environment variable.", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mMissingRequiredAttributeException\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[4], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m client \u001b[38;5;241m=\u001b[39m OpenAI()\n\u001b[0;32m----> 2\u001b[0m reddit \u001b[38;5;241m=\u001b[39m \u001b[43mpraw\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mReddit\u001b[49m\u001b[43m(\u001b[49m\u001b[43mclient_id\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mCLIENT_ID\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 3\u001b[0m \u001b[43m \u001b[49m\u001b[43mclient_secret\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mCLIENT_SECRET\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 4\u001b[0m \u001b[43m \u001b[49m\u001b[43muser_agent\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mUSER_AGENT\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[0;32m~/Desktop/ai-marketing-agent/venv/lib/python3.13/site-packages/praw/util/deprecate_args.py:46\u001b[0m, in \u001b[0;36m_deprecate_args..wrapper..wrapped\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 39\u001b[0m arg_string \u001b[38;5;241m=\u001b[39m _generate_arg_string(_old_args[: \u001b[38;5;28mlen\u001b[39m(args)])\n\u001b[1;32m 40\u001b[0m warn(\n\u001b[1;32m 41\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mPositional arguments for \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mfunc\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__qualname__\u001b[39m\u001b[38;5;132;01m!r}\u001b[39;00m\u001b[38;5;124m will no longer be\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 42\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m supported in PRAW 8.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124mCall this function with \u001b[39m\u001b[38;5;132;01m{\u001b[39;00marg_string\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[1;32m 43\u001b[0m \u001b[38;5;167;01mDeprecationWarning\u001b[39;00m,\n\u001b[1;32m 44\u001b[0m stacklevel\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m2\u001b[39m,\n\u001b[1;32m 45\u001b[0m )\n\u001b[0;32m---> 46\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;28;43mdict\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mzip\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m_old_args\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43margs\u001b[49m\u001b[43m)\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n", + "File \u001b[0;32m~/Desktop/ai-marketing-agent/venv/lib/python3.13/site-packages/praw/reddit.py:259\u001b[0m, in \u001b[0;36mReddit.__init__\u001b[0;34m(self, site_name, config_interpolation, requestor_class, requestor_kwargs, token_manager, **config_settings)\u001b[0m\n\u001b[1;32m 257\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m attribute \u001b[38;5;129;01min\u001b[39;00m (\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mclient_id\u001b[39m\u001b[38;5;124m\"\u001b[39m, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124muser_agent\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[1;32m 258\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig, attribute) \u001b[38;5;129;01min\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig\u001b[38;5;241m.\u001b[39mCONFIG_NOT_SET, \u001b[38;5;28;01mNone\u001b[39;00m):\n\u001b[0;32m--> 259\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m MissingRequiredAttributeException(\n\u001b[1;32m 260\u001b[0m required_message\u001b[38;5;241m.\u001b[39mformat(attribute)\n\u001b[1;32m 261\u001b[0m )\n\u001b[1;32m 262\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig\u001b[38;5;241m.\u001b[39mclient_secret \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mconfig\u001b[38;5;241m.\u001b[39mCONFIG_NOT_SET:\n\u001b[1;32m 263\u001b[0m msg \u001b[38;5;241m=\u001b[39m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mrequired_message\u001b[38;5;241m.\u001b[39mformat(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mclient_secret\u001b[39m\u001b[38;5;124m'\u001b[39m)\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124mFor installed applications this value must be set to None via a keyword argument to the Reddit class constructor.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", + "\u001b[0;31mMissingRequiredAttributeException\u001b[0m: Required configuration setting 'client_id' missing. \nThis setting can be provided in a praw.ini file, as a keyword argument to the Reddit class constructor, or as an environment variable." + ] + } + ], + "source": [ + "client = OpenAI()\n", + "reddit = praw.Reddit(client_id=CLIENT_ID,\n", + " client_secret=CLIENT_SECRET,\n", + " user_agent=USER_AGENT)" + ] + }, + { + "cell_type": "markdown", + "id": "UtmMXrbCPV83", + "metadata": { + "id": "UtmMXrbCPV83" + }, + "source": [ + "# Reddit Search Phrase Generator\n", + "\n", + "## Description\n", + "Generates relevant Reddit search phrases using OpenAI's GPT-3.5-turbo. The function takes a query string and returns a list of 5 Reddit-optimized search terms.\n", + "\n", + "## Features\n", + "- Uses custom prompt engineering for Reddit-specific context\n", + "- Handles common Reddit terminology and naming patterns\n", + "- Includes error handling with fallback to original query\n", + "- Filters and cleans AI response to remove formatting characters\n", + "- Maintains 0.5 temperature for balanced creativity/relevance\n", + "\n", + "## Parameters\n", + "- query: String (e.g., \"project management software\")\n", + "\n", + "## Returns\n", + "List of 5 search phrases optimized for Reddit search" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9cdebf856a9754c0", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:06.806254Z", + "start_time": "2024-11-17T14:42:06.800999Z" + }, + "id": "9cdebf856a9754c0" + }, + "outputs": [], + "source": [ + "def get_relevant_topics(query: str) -> list:\n", + " \"\"\"\n", + " Use OpenAI to generate relevant search phrases for a query\n", + " \"\"\"\n", + " prompt = f\"\"\"\n", + " You are a Reddit search expert who understands how Redditors discuss and search for topics.\n", + " Given the query \"{query}\", generate 5 search phrases that will find the most relevant subreddits.\n", + "\n", + " Consider:\n", + " - How Redditors naturally phrase their questions/discussions\n", + " - Common abbreviations and terminology used on Reddit\n", + " - Related tools, technologies, or concepts frequently discussed\n", + " - Industry-specific subreddit naming patterns\n", + " - Problem-focused search terms (as many discussions are about solving problems)\n", + "\n", + " For example, if query is \"project management software\":\n", + " - projectmanagement (direct community)\n", + " - asana vs trello (tool comparison commonly discussed)\n", + " - agile tools (methodology + tools)\n", + " - jira alternatives (tool alternative discussions)\n", + " - remote team management (broader problem space)\n", + " Return exactly 5 search phrases, one per line.\n", + " Focus on phrases that would lead to active, relevant subreddit communities.\n", + " Do not include any bullets, numbers, or prefixes.\n", + " \"\"\"\n", + "\n", + " try:\n", + " response = client.chat.completions.create(\n", + " model=\"gpt-3.5-turbo\",\n", + " messages=[{\n", + " \"role\": \"user\",\n", + " \"content\": prompt\n", + " }],\n", + " temperature=0.5,\n", + " max_tokens=256\n", + " )\n", + "\n", + " # Extract and clean phrases\n", + " search_phrases = [\n", + " phrase.strip()\n", + " for phrase in response.choices[0].message.content.split('\\n')\n", + " if phrase.strip() and not phrase.startswith(('-', '*', '•', '1', '2', '3', '4', '5'))\n", + " ]\n", + "\n", + " print(f\"Generated search phrases: {search_phrases}\")\n", + " return search_phrases\n", + " except Exception as e:\n", + " print(f\"Error generating topics: {e}\")\n", + " return [query]\n" + ] + }, + { + "cell_type": "markdown", + "id": "sfQjXhpGPqEN", + "metadata": { + "id": "sfQjXhpGPqEN" + }, + "source": [ + "# Subreddit Search Function\n", + "\n", + "## Description\n", + "Performs multi-term subreddit search using PRAW (Reddit API wrapper). Takes a list of search terms and returns unique subreddits matching any term.\n", + "\n", + "## Parameters\n", + "- search_terms: List of search phrases\n", + "- limit: Max subreddits per term (default 50)\n", + "\n", + "## Returns\n", + "Deduplicated list of subreddit names\n", + "- Uses set for automatic deduplication\n", + "- Handles API errors gracefully\n", + "- Includes progress logging" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "be594236c8f5d385", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:07.200385Z", + "start_time": "2024-11-17T14:42:07.197178Z" + }, + "id": "be594236c8f5d385" + }, + "outputs": [], + "source": [ + "def search_subreddits(search_terms, limit=50):\n", + " \"\"\"\n", + " Search for subreddits matching multiple search terms.\n", + "\n", + " Args:\n", + " search_terms (list): List of search queries to use\n", + " limit (int): Maximum number of subreddits to return per search term\n", + "\n", + " Returns:\n", + " list: List of unique subreddit names found across all search terms\n", + " \"\"\"\n", + " print(f\"Searching for subreddits matching {len(search_terms)} search terms...\")\n", + " subreddits = set() # Using a set to avoid duplicates\n", + "\n", + " try:\n", + " for term in search_terms:\n", + " print(f\"Searching with term: '{term}'...\")\n", + " term_subreddits = []\n", + " for subreddit in reddit.subreddits.search(term, limit=limit):\n", + " subreddits.add(subreddit.display_name)\n", + " term_subreddits.append(subreddit.display_name)\n", + " print(f\"Found with '{term}': {term_subreddits}\")\n", + " except Exception as e:\n", + " print(f\"Error during subreddit search: {e}\")\n", + "\n", + " return list(subreddits) # Convert set back to list for consistency" + ] + }, + { + "cell_type": "markdown", + "id": "XfXJO2sDQjFt", + "metadata": { + "id": "XfXJO2sDQjFt" + }, + "source": [ + "# Subreddit Information Retriever\n", + "\n", + "## Description\n", + "Fetches key information about a specific subreddit using PRAW, returning a dictionary of essential subreddit metadata.\n", + "\n", + "## Parameters\n", + "- subreddit_name: String (name of the subreddit)\n", + "\n", + "## Returns\n", + "Dictionary containing:\n", + "- name: Display name of subreddit\n", + "- title: Subreddit title\n", + "- subscribers: Number of subscribers\n", + "- public_description: Public description text\n", + "\n", + "Returns None if retrieval fails" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1aba7db8434c5333", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:07.541432Z", + "start_time": "2024-11-17T14:42:07.538239Z" + }, + "id": "1aba7db8434c5333" + }, + "outputs": [], + "source": [ + "def get_subreddit_info(subreddit_name):\n", + " \"\"\"\n", + " Retrieve information about a subreddit.\n", + " \"\"\"\n", + " try:\n", + " subreddit = reddit.subreddit(subreddit_name)\n", + " return {\n", + " 'name': subreddit.display_name,\n", + " 'title': subreddit.title,\n", + " 'subscribers': subreddit.subscribers,\n", + " 'public_description': subreddit.public_description\n", + " }\n", + " except Exception as e:\n", + " print(f\"Error retrieving info for r/{subreddit_name}: {e}\")\n", + " return None" + ] + }, + { + "cell_type": "markdown", + "id": "sSg_GWrHQsLh", + "metadata": { + "id": "sSg_GWrHQsLh" + }, + "source": [ + "# Subreddit Post & Comment Scraper\n", + "\n", + "## Description\n", + "Extracts comprehensive data from a subreddit's posts and their comments, including metadata and content. Includes rate limiting and error handling.\n", + "\n", + "## Parameters\n", + "- subreddit_name: String (target subreddit)\n", + "- max_posts: Integer (maximum posts to fetch)\n", + "- max_comments: Integer, optional (comment fetch limit per post)\n", + "\n", + "## Returns\n", + "List of dictionaries containing:\n", + "### Post Data\n", + "- id, title, author, score\n", + "- upvotes, downvotes\n", + "- num_comments, created_utc\n", + "- url, permalink, selftext\n", + "### Comment Data (per post)\n", + "- id, author, body, score\n", + "- upvotes, downvotes\n", + "- created_utc, parent_id\n", + "- link_id, permalink\n", + "\n", + "## Features\n", + "- Rate limited (0.5s delay between posts)\n", + "- Handles comment tree expansion\n", + "- Includes error handling and logging" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e33237cf98b570b4", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:07.837177Z", + "start_time": "2024-11-17T14:42:07.830900Z" + }, + "id": "e33237cf98b570b4" + }, + "outputs": [], + "source": [ + "def get_subreddit_posts(subreddit_name: str, max_posts: int, max_comments: int = None):\n", + " \"\"\"\n", + " Fetch all posts and comments from a subreddit, and extract problem-related sentences.\n", + " \"\"\"\n", + " subreddit = reddit.subreddit(subreddit_name)\n", + " posts_data = []\n", + "\n", + " print(f\"Fetching posts from r/{subreddit_name}...\")\n", + "\n", + " try:\n", + " for post in subreddit.new(limit=max_posts):\n", + " post_info = {\n", + " 'id': post.id,\n", + " 'title': post.title,\n", + " 'author': str(post.author),\n", + " 'score': post.score,\n", + " 'upvotes': post.ups,\n", + " 'downvotes': post.downs,\n", + " 'num_comments': post.num_comments,\n", + " 'created_utc': post.created_utc,\n", + " 'url': post.url,\n", + " 'permalink': post.permalink,\n", + " 'selftext': post.selftext,\n", + " 'comments': []\n", + " }\n", + "\n", + " post.comments.replace_more(limit=max_comments)\n", + " print(f\"Fetching comments for post ID {post.id}...\")\n", + " for comment in post.comments.list():\n", + " comment_info = {\n", + " 'id': comment.id,\n", + " 'author': str(comment.author),\n", + " 'body': comment.body,\n", + " 'score': comment.score,\n", + " 'upvotes': comment.ups,\n", + " 'downvotes': comment.downs,\n", + " 'created_utc': comment.created_utc,\n", + " 'parent_id': comment.parent_id,\n", + " 'link_id': comment.link_id,\n", + " 'permalink': comment.permalink\n", + " }\n", + "\n", + " post_info['comments'].append(comment_info)\n", + "\n", + " posts_data.append(post_info)\n", + " time.sleep(0.5)\n", + "\n", + " except Exception as e:\n", + " print(f\"Error fetching posts from r/{subreddit_name}: {e}\")\n", + "\n", + " return posts_data" + ] + }, + { + "cell_type": "markdown", + "id": "_lsd_3ujQ-s2", + "metadata": { + "id": "_lsd_3ujQ-s2" + }, + "source": [ + "# ChromaDB Document Processor\n", + "\n", + "## Description\n", + "Processes Reddit posts and comments data into embeddings and stores them in a ChromaDB collection. Handles text chunking, embedding generation, and batch processing for large datasets.\n", + "\n", + "## Parameters\n", + "- posts_data: List[Dict] (Reddit posts and comments)\n", + "- collection_name: String (ChromaDB collection identifier)\n", + "\n", + "## Features\n", + "- Uses OpenAI text-embedding-3-small model\n", + "- Recursive text splitting (500 char chunks, 50 overlap)\n", + "- Batch processing (500 documents per batch)\n", + "- Persistent storage in ChromaDB\n", + "- Handles both posts and comments\n", + "\n", + "## Document Metadata\n", + "### Posts\n", + "- id, type, title, author\n", + "- url, created_utc\n", + "\n", + "### Comments\n", + "- id, type, post_id, post_title\n", + "- author, parent_id, url, created_utc\n", + "\n", + "## Returns\n", + "Integer count of total documents stored" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7dabfbd537f3d756", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:08.193411Z", + "start_time": "2024-11-17T14:42:08.186305Z" + }, + "id": "7dabfbd537f3d756" + }, + "outputs": [], + "source": [ + "def process_and_store_in_chroma(posts_data: List[Dict], collection_name: str):\n", + " chroma_client = chromadb.PersistentClient(path=\"data/chroma_db\")\n", + " embeddings_client = OpenAIEmbeddings(\n", + " model=\"text-embedding-3-small\",\n", + " openai_api_key=OPENAI_API_KEY\n", + " )\n", + "\n", + " collection = chroma_client.get_or_create_collection(name=collection_name)\n", + "\n", + " text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=500,\n", + " chunk_overlap=50\n", + " )\n", + "\n", + " documents = []\n", + " metadatas = []\n", + " ids = []\n", + " doc_id = 0\n", + "\n", + " print(f\"Processing posts and comments for {collection_name}...\")\n", + "\n", + " for post in posts_data:\n", + " if post[\"selftext\"]:\n", + " chunks = text_splitter.split_text(post[\"selftext\"]) if len(post[\"selftext\"]) > 500 else [post[\"selftext\"]]\n", + "\n", + " chunk_embeddings = embeddings_client.embed_documents(chunks)\n", + "\n", + " for chunk, embedding in zip(chunks, chunk_embeddings):\n", + " documents.append(chunk)\n", + " metadatas.append({\n", + " \"id\": post[\"id\"],\n", + " \"type\": \"post\",\n", + " \"title\": post[\"title\"],\n", + " \"author\": post[\"author\"],\n", + " \"url\": post[\"url\"],\n", + " \"created_utc\": str(post[\"created_utc\"])\n", + " })\n", + " ids.append(f\"doc_{doc_id}\")\n", + " doc_id += 1\n", + "\n", + " for comment in post.get(\"comments\", []):\n", + " if comment[\"body\"]:\n", + " chunks = text_splitter.split_text(comment[\"body\"]) if len(comment[\"body\"]) > 500 else [comment[\"body\"]]\n", + "\n", + " chunk_embeddings = embeddings_client.embed_documents(chunks)\n", + "\n", + " for chunk, embedding in zip(chunks, chunk_embeddings):\n", + " documents.append(chunk)\n", + " metadatas.append({\n", + " \"id\": comment[\"id\"],\n", + " \"type\": \"comment\",\n", + " \"post_id\": post[\"id\"],\n", + " \"post_title\": post[\"title\"],\n", + " \"author\": comment[\"author\"],\n", + " \"parent_id\": comment[\"parent_id\"],\n", + " \"url\": comment[\"url\"],\n", + " \"created_utc\": str(comment[\"created_utc\"])\n", + " })\n", + " ids.append(f\"doc_{doc_id}\")\n", + " doc_id += 1\n", + "\n", + " if len(documents) >= 500:\n", + " embeddings = embeddings_client.embed_documents(documents)\n", + " collection.add(\n", + " documents=documents,\n", + " embeddings=embeddings,\n", + " metadatas=metadatas,\n", + " ids=ids\n", + " )\n", + " print(f\"Added batch of {len(documents)} documents to Chroma\")\n", + " documents = []\n", + " metadatas = []\n", + " ids = []\n", + "\n", + " # Add any remaining documents\n", + " if documents:\n", + " embeddings = embeddings_client.embed_documents(documents)\n", + " collection.add(\n", + " documents=documents,\n", + " embeddings=embeddings,\n", + " metadatas=metadatas,\n", + " ids=ids\n", + " )\n", + " print(f\"Added final batch of {len(documents)} documents to Chroma\")\n", + "\n", + " return collection.count()" + ] + }, + { + "cell_type": "markdown", + "id": "63kc1f36RG-s", + "metadata": { + "id": "63kc1f36RG-s" + }, + "source": [ + "# Knowledge Base Generator\n", + "\n", + "## Description\n", + "Orchestrates the full knowledge base generation pipeline, from Reddit search to ChromaDB storage. Combines subreddit search, data collection, and embedding storage functionalities.\n", + "\n", + "## Parameters\n", + "- user_query: String (search topic)\n", + "- max_subreddits: Integer (subreddit limit)\n", + "- min_subscribers: Integer (minimum subscriber threshold)\n", + "- max_posts: Integer (posts per subreddit)\n", + "- max_comments: Integer, optional (comments per post)\n", + "\n", + "## Process Flow\n", + "1. Generates search terms from query\n", + "2. Searches relevant subreddits\n", + "3. Filters by subscriber count\n", + "4. Collects posts and comments\n", + "5. Processes into ChromaDB collections\n", + "\n", + "## Features\n", + "- Rate limiting between requests\n", + "- Progress logging\n", + "- Error handling per subreddit\n", + "- Creates separate collection per subreddit" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78d60fa8d41f57b3", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:08.527535Z", + "start_time": "2024-11-17T14:42:08.523598Z" + }, + "id": "78d60fa8d41f57b3" + }, + "outputs": [], + "source": [ + "def generate_knowledge_base(user_query: str, max_subreddits: int, min_subscribers: int, max_posts: int, max_comments: int = None):\n", + " search_terms = get_relevant_topics(query=user_query)\n", + " found_subreddits = search_subreddits(search_terms=search_terms, limit=max_subreddits)\n", + "\n", + " if not found_subreddits:\n", + " print(\"No subreddits with enough subscribers found with the given query.\")\n", + " return\n", + "\n", + " print(\"\\nRetrieving subreddit information...\")\n", + " subreddit_infos = []\n", + " for subreddit_name in found_subreddits:\n", + " info = get_subreddit_info(subreddit_name)\n", + " if info:\n", + " subreddit_infos.append(info)\n", + " time.sleep(1)\n", + "\n", + " for info in subreddit_infos:\n", + " if info['subscribers'] > min_subscribers:\n", + " subreddit_name = info['name']\n", + " print(f\"\\nProcessing r/{subreddit_name}...\")\n", + "\n", + " posts = get_subreddit_posts(subreddit_name, max_posts, max_comments)\n", + "\n", + " try:\n", + " collection_name = f\"reddit_{subreddit_name.lower()}\"\n", + " doc_count = process_and_store_in_chroma(posts, collection_name)\n", + " print(f\"Processed and stored {doc_count} documents in Chroma collection '{collection_name}'\")\n", + " except Exception as e:\n", + " print(f\"Error processing and storing data in Chroma: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "c6RuBgG9RN9Q", + "metadata": { + "id": "c6RuBgG9RN9Q" + }, + "source": [ + "# Main Execution Block\n", + "\n", + "## Description\n", + "Entry point for knowledge base generation, triggered by `GENERATE_KNOWLEDGE` flag. Handles user input collection and initiates the generation process.\n", + "\n", + "## User Inputs\n", + "- query: Search topic string\n", + "- max_subreddits: Integer\n", + "\n", + "## Fixed Parameters\n", + "- min_subscribers: 100,000\n", + "- max_posts: 100 per subreddit\n", + "\n", + "## Usage\n", + "Set `GENERATE_KNOWLEDGE=True` in environment to activate interactive mode" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "75663b31bfffed30", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:08.826137Z", + "start_time": "2024-11-17T14:42:08.817062Z" + }, + "id": "75663b31bfffed30" + }, + "outputs": [], + "source": [ + "if GENERATE_KNOWLEDGE:\n", + " query = input(\"Enter the search query for subreddits: \")\n", + " max_subreddits = int(input(\"Enter the number of subreddits to find: \"))\n", + " generate_knowledge_base(user_query=query, max_subreddits=max_subreddits, min_subscribers=100000, max_posts=100)" + ] + }, + { + "cell_type": "markdown", + "id": "bbd45328086e0fca", + "metadata": { + "id": "bbd45328086e0fca" + }, + "source": [ + "## Agent workflow" + ] + }, + { + "cell_type": "markdown", + "id": "VV56UNCNR4Cg", + "metadata": { + "id": "VV56UNCNR4Cg" + }, + "source": [ + "# Core LangChain & LangGraph Imports\n", + "\n", + "Imports for OpenAI chat model integration (`ChatOpenAI`), prompt templating (`ChatPromptTemplate`), message types (`SystemMessage`, `AIMessage`, `HumanMessage`), and graph-based conversation flow components (`StateGraph`, `START`, `END`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "53990c6a46798f0b", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:10.337836Z", + "start_time": "2024-11-17T14:42:10.299750Z" + }, + "id": "53990c6a46798f0b" + }, + "outputs": [], + "source": [ + "from langchain_openai import ChatOpenAI\n", + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.messages import SystemMessage, AIMessage, HumanMessage\n", + "from langgraph.graph import END, START, StateGraph" + ] + }, + { + "cell_type": "markdown", + "id": "evCRXW2GSGB4", + "metadata": { + "id": "evCRXW2GSGB4" + }, + "source": [ + "# ChatOpenAI Model Initialization\n", + "\n", + "Initializes GPT-4 model with zero temperature (no randomness) for deterministic outputs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "460ae210411f8a56", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:10.754797Z", + "start_time": "2024-11-17T14:42:10.731612Z" + }, + "id": "460ae210411f8a56" + }, + "outputs": [], + "source": [ + "llm = ChatOpenAI(model=\"gpt-4\", temperature=0)" + ] + }, + { + "cell_type": "markdown", + "id": "4sfO4H_YSOZz", + "metadata": { + "id": "4sfO4H_YSOZz" + }, + "source": [ + "# query_chroma Function\n", + "- Queries Chroma DB with OpenAI embeddings to find similar content\n", + "- Input: query text, collection name, number of results (default=5)\n", + "- Uses text-embedding-3-small model and local persistent storage\n", + "- Returns matching documents from specified collection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7502cee186316763", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:10.923975Z", + "start_time": "2024-11-17T14:42:10.918863Z" + }, + "id": "7502cee186316763" + }, + "outputs": [], + "source": [ + "def query_chroma(query_text: str, collection_name: str, n_results: int = 5):\n", + " \"\"\"\n", + " Query the Chroma database for similar content.\n", + " \"\"\"\n", + " chroma_client = chromadb.PersistentClient(path=\"data/chroma_db\")\n", + " embeddings_client = OpenAIEmbeddings(\n", + " model=\"text-embedding-3-small\",\n", + " openai_api_key=OPENAI_API_KEY\n", + " )\n", + "\n", + " collection = chroma_client.get_collection(name=collection_name)\n", + " query_embedding = embeddings_client.embed_query(query_text)\n", + "\n", + " results = collection.query(\n", + " query_embeddings=[query_embedding],\n", + " n_results=n_results\n", + " )\n", + "\n", + " return results" + ] + }, + { + "cell_type": "markdown", + "id": "pJPg-vZ_SUPE", + "metadata": { + "id": "pJPg-vZ_SUPE" + }, + "source": [ + "# query_and_format_results Function\n", + "- Extends query_chroma by formatting results into organized dictionary\n", + "- Input: query text, collection name, result count (default=5)\n", + "- Returns: Dict with query info (search query, collection, total results) and matches (content, metadata, distance)\n", + "- Includes error handling with exception reporting" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a669951fc079ccd1", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:11.083053Z", + "start_time": "2024-11-17T14:42:11.078273Z" + }, + "id": "a669951fc079ccd1" + }, + "outputs": [], + "source": [ + "def query_and_format_results(query_text: str, collection_name: str, n_results: int = 5) -> Dict[dict, list]:\n", + " \"\"\"\n", + " Query Chroma and return formatted results with all related information.\n", + "\n", + " Args:\n", + " query_text (str): The search query\n", + " collection_name (str): Name of the Chroma collection to search\n", + " n_results (int): Number of results to return\n", + "\n", + " Returns:\n", + " dict: Formatted results with query info and matched documents\n", + " \"\"\"\n", + " try:\n", + " # Get raw results from Chroma\n", + " results = query_chroma(query_text, collection_name, n_results)\n", + "\n", + " # Format the results\n", + " formatted_results = {\n", + " \"query_info\": {\n", + " \"search_query\": query_text,\n", + " \"collection\": collection_name,\n", + " \"total_results\": len(results['ids'][0])\n", + " },\n", + " \"matches\": []\n", + " }\n", + "\n", + " # Process each result\n", + " for i in range(len(results['ids'][0])):\n", + " match = {\n", + " \"content\": results['documents'][0][i],\n", + " \"metadata\": results['metadatas'][0][i],\n", + " \"distance\": results['distances'][0][i] if 'distances' in results else None\n", + " }\n", + " formatted_results[\"matches\"].append(match)\n", + "\n", + " return formatted_results\n", + "\n", + " except Exception as e:\n", + " print(f\"Error querying database: {e}\")\n", + " return None\n" + ] + }, + { + "cell_type": "markdown", + "id": "Y9jQshMaSZlN", + "metadata": { + "id": "Y9jQshMaSZlN" + }, + "source": [ + "# augment_query Function\n", + "- Enhances search query using rules from JSON file\n", + "- Input: query text and optional JSON path (default='augment.json')\n", + "- Applies prefix/suffix rules from JSON to create multiple query variations\n", + "- Falls back to original query if JSON loading fails" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7eeba914d30f2025", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:11.230777Z", + "start_time": "2024-11-17T14:42:11.227068Z" + }, + "id": "7eeba914d30f2025" + }, + "outputs": [], + "source": [ + "def augment_query(query_text: str, query_file_path: str = 'augment.json'):\n", + " \"\"\"Augment query using json\"\"\"\n", + " try:\n", + " with open(query_file_path, 'r') as f:\n", + " rules = json.load(f)\n", + " return [f\"{rule.get('prefix', '')} {query_text} {rule.get('suffix', '')}\".strip()\n", + " for rule in rules.get('searches', [])]\n", + " except:\n", + " return [query_text]" + ] + }, + { + "cell_type": "markdown", + "id": "MUl_xcqqSfN2", + "metadata": { + "id": "MUl_xcqqSfN2" + }, + "source": [ + "# retrieve Function\n", + "- Retrieves matching documents from database using query augmentation\n", + "- Takes state dict with query (collection name currently hardcoded)\n", + "- Runs augmented variations of original query through query_and_format_results\n", + "- Returns dict containing search results and original query" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af4586c1a11d49bc", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:11.390349Z", + "start_time": "2024-11-17T14:42:11.387895Z" + }, + "id": "af4586c1a11d49bc" + }, + "outputs": [], + "source": [ + "def retrieve(state):\n", + " query = state[\"query\"]\n", + " # TODO: Determine best collection\n", + " # collection_name = state[\"collection\"]\n", + " collection_name = \"reddit_digitalmarketing\"\n", + "\n", + " augmented_queries = augment_query(query)\n", + "\n", + " for augmented_query in augmented_queries:\n", + " #print(f\"=== Search augmented query: {augmented_query} ===\")\n", + " results = query_and_format_results(augmented_query, collection_name)\n", + "\n", + " return {\"results\": results, \"query\": query}" + ] + }, + { + "cell_type": "markdown", + "id": "Rp9l-OyWSkSw", + "metadata": { + "id": "Rp9l-OyWSkSw" + }, + "source": [ + "# generate Function\n", + "- Formats search results into context for LLM response generation\n", + "- Uses source metadata to add context information (type, title for posts)\n", + "- Loads system prompt from generator_prompt.txt\n", + "- Creates chat template with context and query, invokes LLM\n", + "- Returns dict with results, query, and generated response" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "986d983f1db6c90a", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:11.537104Z", + "start_time": "2024-11-17T14:42:11.534114Z" + }, + "id": "986d983f1db6c90a" + }, + "outputs": [], + "source": [ + "def generate(state):\n", + " results = state[\"results\"]\n", + " query = state[\"query\"]\n", + "\n", + " context_parts = []\n", + " for match in results['matches']:\n", + " metadata = match['metadata']\n", + " source_info = f\"Source: {metadata['type'].capitalize()}\"\n", + " if metadata['type'] == 'post':\n", + " source_info += f\" - Title: {metadata['title']}\"\n", + "\n", + " context_parts.append(f\"{source_info}\\nContent: {match['content']}\\n\")\n", + "\n", + " context = \"\\n\".join(context_parts)\n", + "\n", + " with open('generator_prompt.txt', 'r') as f:\n", + " sys_prompt = f.read()\n", + "\n", + " prompt = ChatPromptTemplate.from_messages([\n", + " SystemMessage(content=sys_prompt),\n", + " AIMessage(content=f\"**Context:**\\n{context}\\n\\n\"),\n", + " HumanMessage(content=f\"**Question:** {query}\")\n", + " ])\n", + "\n", + " generator = prompt | llm\n", + " response = generator.invoke({})\n", + "\n", + " return {\"results\": results, \"query\": query, \"generation\": response.content}" + ] + }, + { + "cell_type": "markdown", + "id": "_inv-35rSpZm", + "metadata": { + "id": "_inv-35rSpZm" + }, + "source": [ + "# AgentState TypedDict\n", + "- Defines structure for agent's conversation state\n", + "- Contains query (str), generation (str), and results (nested dict/list)\n", + "- Used for type checking and state management in conversation flow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "214c9bbe6fa1e691", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:11.695346Z", + "start_time": "2024-11-17T14:42:11.692804Z" + }, + "id": "214c9bbe6fa1e691" + }, + "outputs": [], + "source": [ + "class AgentState(TypedDict):\n", + " query : str\n", + " generation: str\n", + " results : Dict[dict, list]" + ] + }, + { + "cell_type": "markdown", + "id": "b6sQHedLSuwi", + "metadata": { + "id": "b6sQHedLSuwi" + }, + "source": [ + "# LangGraph Workflow Setup\n", + "- Creates StateGraph with AgentState type\n", + "- Defines two processing nodes: retrieve and generate\n", + "- Sets linear flow: START → retrieve → generate → END\n", + "- Compiles graph for execution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52716f0f97c410a4", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:11.831623Z", + "start_time": "2024-11-17T14:42:11.827972Z" + }, + "id": "52716f0f97c410a4" + }, + "outputs": [], + "source": [ + "workflow = StateGraph(AgentState)\n", + "\n", + "workflow.add_node(retrieve, \"retrieve\")\n", + "workflow.add_node(generate, \"generate\")\n", + "\n", + "workflow.add_edge(start_key=START, end_key=\"retrieve\")\n", + "workflow.add_edge(start_key=\"retrieve\", end_key=\"generate\")\n", + "workflow.add_edge(start_key=\"generate\", end_key=END)\n", + "\n", + "graph = workflow.compile()" + ] + }, + { + "cell_type": "markdown", + "id": "KPa4qe1TS031", + "metadata": { + "id": "KPa4qe1TS031" + }, + "source": [ + "# Workflow Visualization Command\n", + "- Uses IPython to display workflow graph\n", + "- Renders Mermaid diagram showing node connections\n", + "- XRay mode reveals detailed graph structure\n", + "- Helps visualize START → retrieve → generate → END flow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6827a8528a3dcd4c", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:12.694296Z", + "start_time": "2024-11-17T14:42:11.984323Z" + }, + "id": "6827a8528a3dcd4c" + }, + "outputs": [], + "source": [ + "from IPython.display import Image, display\n", + "\n", + "display(Image(graph.get_graph(xray=True).draw_mermaid_png()))" + ] + }, + { + "cell_type": "markdown", + "id": "YuhUg5KvS6LG", + "metadata": { + "id": "YuhUg5KvS6LG" + }, + "source": [ + "# Process query function\n", + "- Takes input dict with search query\n", + "- Streams graph execution output step by step\n", + "- Prints completion of each node (retrieve, generate)\n", + "- Shows generated response when available" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1bb73ff7a40fa7e3", + "metadata": { + "ExecuteTime": { + "end_time": "2024-11-17T14:42:23.083748Z", + "start_time": "2024-11-17T14:42:12.714588Z" + }, + "id": "1bb73ff7a40fa7e3" + }, + "outputs": [], + "source": [ + "def process_query(inputs):\n", + " augmented_inputs = augment_query(inputs[\"query\"])\n", + " for augmented_query in augmented_inputs:\n", + " augmented_input = {\"query\": augmented_query} \n", + " print (\"Augmented query\", augmented_query)\n", + " for output in graph.stream(augmented_input):\n", + " for key, value in output.items():\n", + " print(f\"Finished running: {key}:\")\n", + " if 'generation' in value:\n", + " print(value['generation'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92d0c9bf1564b2d1", + "metadata": { + "id": "92d0c9bf1564b2d1" + }, + "outputs": [], + "source": [ + "inputs = {\"query\": \"lead generation tools \"}\n", + "process_query(inputs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cef7efee-61bb-4d64-bfda-82960cf3ba2a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.0" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/AI-marketing-agent/README.md b/AI-marketing-agent/README.md new file mode 100644 index 0000000..4124da4 --- /dev/null +++ b/AI-marketing-agent/README.md @@ -0,0 +1,86 @@ +# AI Marketing Research Agent + +This project implements an intelligent agent for conducting marketing research and competitive analysis using Reddit data. The agent leverages NLP and LLM capabilities to analyze social media discussions, identify market trends, and generate actionable marketing insights. + +## Features + +- Automated Reddit data collection and analysis +- Intelligent query augmentation for comprehensive research +- Vector database storage for efficient information retrieval +- LLM-powered insight generation +- Graph-based conversation flow + +## Technology Stack + +- **chromadb**: Vector database for storing and managing embeddings +- **praw**: Python Reddit API Wrapper +- **openai**: OpenAI's API integration +- **python-dotenv**: Environment variable management +- **langchain**: Framework for LLM applications +- **langchain-openai**: OpenAI integration for LangChain +- **langchain-text-splitters**: Text processing utilities +- **langgraph**: Graph-based operations + +## Setup + +1. Install required packages: +```bash +pip install chromadb praw openai python-dotenv langchain langchain-openai langchain-text-splitters langgraph +``` + +2. Environment Variables +The following environment variables would be required to run the Reddit API functionality: +``` +OPENAI_API_KEY=your_openai_key +REDDIT_CLIENT_ID=your_reddit_client_id +REDDIT_CLIENT_SECRET=your_reddit_client_secret +REDDIT_USER_AGENT=your_user_agent +``` + +Note: For submission purposes, these credentials are not included and not required. Users wanting to run this code would need to obtain their own API credentials from: +- Reddit API credentials: https://www.reddit.com/prefs/apps +- OpenAI API key: https://platform.openai.com/api-keys + +## Project Structure + +- `AIMarketResearch.ipynb`: Main notebook containing the implementation +- `augment.json`: Query augmentation rules +- `generator_prompt.txt`: System prompt for insight generation +- `data/chroma_db/`: Directory for vector database storage + +## Core Components + +### Knowledge Base Generation +- Subreddit search and data collection +- Post and comment extraction +- Vector embedding generation +- ChromaDB storage + +### Query Processing +- Query augmentation using predefined rules +- Vector similarity search +- Context-aware response generation + +### Insight Generation +- Market trend analysis +- Competitive intelligence +- Customer pain point identification +- Strategic recommendations + +## Usage + +1. Set `GENERATE_KNOWLEDGE=True` to activate knowledge base generation +2. Input search query and number of subreddits to analyze +3. The agent will: + - Search relevant subreddits + - Collect and process posts/comments + - Generate embeddings + - Store information in ChromaDB + - Generate marketing insights + +## Output Format + +The agent provides insights in the following structure: +1. Key Findings +2. Market Implications +3. Recommendations§ diff --git a/AI-marketing-agent/augment.json b/AI-marketing-agent/augment.json new file mode 100644 index 0000000..51de8be --- /dev/null +++ b/AI-marketing-agent/augment.json @@ -0,0 +1,36 @@ +{ + "searches": [ + { + "prefix": "Find popular tools for", + "suffix": "and their main features" + }, + { + "prefix": "List common complaints about", + "suffix": "tools and software" + }, + { + "prefix": "Identify market leaders in", + "suffix": "tools and their advantages" + }, + { + "prefix": "Find current trends in", + "suffix": "tools and technologies" + }, + { + "prefix": "What are users wishing for in", + "suffix": "tools and features" + }, + { + "prefix": "Compare different", + "suffix": "tools pricing and features" + }, + { + "prefix": "Find pain points with existing", + "suffix": "tools and solutions" + }, + { + "prefix": "Discover emerging alternatives to", + "suffix": "tools in the market" + } + ] +} \ No newline at end of file diff --git a/AI-marketing-agent/config.py b/AI-marketing-agent/config.py new file mode 100644 index 0000000..f2b3f9d --- /dev/null +++ b/AI-marketing-agent/config.py @@ -0,0 +1,9 @@ +# config.py + +# Reddit API Configuration +REDDIT_CONFIG = { + 'client_id': 'e9ak0otF8Om8_nGBpNfHZA', + 'client_secret': 'u1gxgyKgUBwaJkONZEiFfZyQyCCx-A', + 'user_agent': 'analyse-market/1.0 (by /u/Affectionate_Buddy43)', + 'redirect_uri': 'http://localhost:8080' +} diff --git a/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/data_level0.bin b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/data_level0.bin new file mode 100644 index 0000000..22b3f37 Binary files /dev/null and b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/data_level0.bin differ diff --git a/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/header.bin b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/header.bin new file mode 100644 index 0000000..a522450 Binary files /dev/null and b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/header.bin differ diff --git a/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/index_metadata.pickle b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/index_metadata.pickle new file mode 100644 index 0000000..8568bc8 Binary files /dev/null and b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/index_metadata.pickle differ diff --git a/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/length.bin b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/length.bin new file mode 100644 index 0000000..6869fc0 Binary files /dev/null and b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/length.bin differ diff --git a/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/link_lists.bin b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/link_lists.bin new file mode 100644 index 0000000..a2415fb Binary files /dev/null and b/AI-marketing-agent/data/chroma_db/82a0e5e1-126b-4caa-a6c8-ef8a3700eca9/link_lists.bin differ diff --git a/AI-marketing-agent/data/chroma_db/chroma.sqlite3 b/AI-marketing-agent/data/chroma_db/chroma.sqlite3 new file mode 100644 index 0000000..f7de1d7 Binary files /dev/null and b/AI-marketing-agent/data/chroma_db/chroma.sqlite3 differ diff --git a/AI-marketing-agent/generator_prompt.txt b/AI-marketing-agent/generator_prompt.txt new file mode 100644 index 0000000..b5eb525 --- /dev/null +++ b/AI-marketing-agent/generator_prompt.txt @@ -0,0 +1,10 @@ +**You are an expert marketing analyst specializing in competitive analysis, market trends, and consumer behavior. Your role is to analyze social media discussions, customer feedback, and market conversations to provide actionable marketing insights. Focus on identifying market opportunities, competitive advantages, customer pain points, and potential marketing strategies.** + +**Instructions:** +Analyze the provided information and provide insights in the following format: + 1. Key Findings: Identify main patterns or insights + 2. Market Implications: What this means for the market/business + 3. Recommendations: Specific, actionable marketing recommendations based on the analysis + +**Considerations:** +- Stick to information present in the context. If certain aspects aren't covered in the data, acknowledge the gaps and what additional information would be valuable.