GitHub - JanB1989/TranscriptCollector: Deployable Service to scrape and store youtube channel transcripts

This project is designed for CDK development with Python. Its primary objective is to extract captions from YouTube videos, process them, and store the resulting transcripts and metadata in AWS services, such as S3 and DynamoDB. It currently uses pytubefix (contained as a layer for deployment) to fetch transcripts/metadata. I added a jupyter nobebook with some examples to scrape entire channels/videos.

Below is a guide to help you set up and deploy the project.

Installing Dependencies

Once the virtual environment is activated, install the required dependencies using Poetry.

$ poetry install --with dev

Making the Kernel Accessible to Jupyter

To make the Python environment accessible in Jupyter, run the following command:

$ python -m ipykernel install --user --name=transcriptcollector --display-name "Python (transcriptcollector)"

Deploying to a Stage and Creating a Role

This role is created to have access to the created ressources (dynamodb, s3 bucket, lambda invocation). Ensure you replace the placeholders with your own values.

$ cdk deploy --parameters Environment=<your_environment> --parameters SageMakerExecutionRoleARN=<your_role_arn>

Useful Commands

cdk ls List all stacks in the app
cdk synth Emit the synthesized CloudFormation template
cdk deploy Deploy this stack to your default AWS account/region
cdk diff Compare deployed stack with the current state
cdk docs Open CDK documentation

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
lambda		lambda
tests		tests
transcript_collector_project		transcript_collector_project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TranscriptCollector.jpg		TranscriptCollector.jpg
app.py		app.py
cdk.json		cdk.json
init_cdk_poetry.sh		init_cdk_poetry.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scan_channels.ipynb		scan_channels.ipynb
source.bat		source.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installing Dependencies

Making the Kernel Accessible to Jupyter

Deploying to a Stage and Creating a Role

Useful Commands

About

Releases

Packages

Languages

License

JanB1989/TranscriptCollector

Folders and files

Latest commit

History

Repository files navigation

Installing Dependencies

Making the Kernel Accessible to Jupyter

Deploying to a Stage and Creating a Role

Useful Commands

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages