This project is designed for CDK development with Python. Its primary objective is to extract captions from YouTube videos, process them, and store the resulting transcripts and metadata in AWS services, such as S3 and DynamoDB. It currently uses pytubefix (contained as a layer for deployment) to fetch transcripts/metadata. I added a jupyter nobebook with some examples to scrape entire channels/videos.
Below is a guide to help you set up and deploy the project.
Once the virtual environment is activated, install the required dependencies using Poetry.
$ poetry install --with dev
To make the Python environment accessible in Jupyter, run the following command:
$ python -m ipykernel install --user --name=transcriptcollector --display-name "Python (transcriptcollector)"
This role is created to have access to the created ressources (dynamodb, s3 bucket, lambda invocation). Ensure you replace the placeholders with your own values.
$ cdk deploy --parameters Environment=<your_environment> --parameters SageMakerExecutionRoleARN=<your_role_arn>
cdk ls
List all stacks in the appcdk synth
Emit the synthesized CloudFormation templatecdk deploy
Deploy this stack to your default AWS account/regioncdk diff
Compare deployed stack with the current statecdk docs
Open CDK documentation