From daad911e233e6a3e6ae84eef8b581bdef4ecba41 Mon Sep 17 00:00:00 2001 From: Kanani Nirav Date: Sun, 20 Oct 2024 17:46:26 +0900 Subject: [PATCH] [Modify/Add] Add Databases & Analytics, and Other Compute Service Doc. --- README.md | 4 + sections/databases.md | 306 ++++++++++++++++++++++++++++++++++++++ sections/elb_asg.md | 10 +- sections/iam.md | 3 + sections/other_compute.md | 192 ++++++++++++++++++++++++ sections/s3.md | 1 - 6 files changed, 510 insertions(+), 6 deletions(-) create mode 100644 sections/databases.md create mode 100644 sections/other_compute.md diff --git a/README.md b/README.md index 0fe2dcc..e08ed4b 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,10 @@ - Scalability & High Availability, Vertical Scalability, Horizontal Scalability, High Availability, High Availability & Scalability For EC2, Scalability vs Elasticity (vs Agility), What is load balancing?, What’s an Auto Scaling Group? - [Amazon S3](./sections/s3.md) - S3 Use cases, Amazon S3 Overview - Buckets, Amazon S3 Overview - Objects, S3 Websites, S3 Storage Classes, S3 Object Lock & Glacier Vault Lock, Shared Responsibility Model for S3, AWS Snow Family, What is Edge Computing?, Snow Family - Edge Computing, AWS OpsHub, Hybrid Cloud for Storage, AWS Storage Gateway +- [Databases & Analytics](./sections/databases.md) + - Databases Intro, Relational Databases, NoSQL Databases, Databases & Shared Responsibility on AWS, AWS RDS Overview, Amazon Aurora, Amazon ElastiCache Overview, DynamoDB, Redshift Overview, Amazon EMR, Amazon Athena, Amazon QuickSight, DocumentDB, Amazon Neptune, Amazon QLDB +- [Other Compute Section](./sections/other_compute.md) + - What is Docker?, ECS, Fargate, ECR, What’s serverless?, Why AWS Lambda ?, Amazon API Gateway, AWS Batch, Batch vs Lambda, Amazon Lightsail, Lambda Summary ## Practice Exams ( dumps ) diff --git a/sections/databases.md b/sections/databases.md new file mode 100644 index 0000000..74a669e --- /dev/null +++ b/sections/databases.md @@ -0,0 +1,306 @@ +# Databases & Analytics + +- [Databases \& Analytics](#databases--analytics) + - [Databases Intro](#databases-intro) + - [Relational Databases (SQL)](#relational-databases-sql) + - [NoSQL Databases](#nosql-databases) + - [NoSQL data example: JSON](#nosql-data-example-json) + - [Databases \& Shared Responsibility on AWS](#databases--shared-responsibility-on-aws) + - [AWS RDS Overview](#aws-rds-overview) + - [Advantage over using RDS versus deploying DB on EC2](#advantage-over-using-rds-versus-deploying-db-on-ec2) + - [RDS Deployments](#rds-deployments) + - [RDS Deployments: Read Replicas, Multi-AZ](#rds-deployments-read-replicas-multi-az) + - [RDS Deployments: Multi-Region](#rds-deployments-multi-region) + - [Amazon Aurora](#amazon-aurora) + - [Amazon ElastiCache Overview](#amazon-elasticache-overview) + - [DynamoDB](#dynamodb) + - [DynamoDB Accelerator (DAX)](#dynamodb-accelerator-dax) + - [DynamoDB Global Tables](#dynamodb-global-tables) + - [Redshift Overview](#redshift-overview) + - [Amazon EMR (Elastic MapReduce)](#amazon-emr-elastic-mapreduce) + - [Amazon Athena](#amazon-athena) + - [Amazon QuickSight](#amazon-quicksight) + - [DocumentDB (with MongoDB Compatibility)](#documentdb-with-mongodb-compatibility) + - [Amazon Neptune](#amazon-neptune) + - [Amazon QLDB](#amazon-qldb) + - [Amazon Managed Blockchain](#amazon-managed-blockchain) + - [AWS Glue](#aws-glue) + - [DMS - Database Migration Service](#dms---database-migration-service) + - [Databases \& Analytics Summary](#databases--analytics-summary) + +## Databases Intro + +- Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits +- Sometimes, you want to store data in a database… +- You can structure the data +- You build indexes to efficiently query / search through the data +- You define relationships between your datasets +- Databases are optimized for a purpose and come with different features, shapes and constraint +- **Managed Databases**: AWS takes care of maintenance, backups, and security for databases. +- **Benefits**: Reduced operational complexity, built-in high availability, disaster recovery, scalability, and enhanced security. +- **Types**: + - **Relational Databases** (SQL) + - **NoSQL Databases** + - **Data Warehousing** + - **In-memory Caching** + +## Relational Databases (SQL) + +- **Structured Data**: Stored in predefined schema tables, managed with SQL. +- **Use Cases**: Transactional applications, financial systems. +- **Examples**: MySQL, PostgreSQL, Oracle, SQL Server, MariaDB. + +## NoSQL Databases + +- **Flexible Schema**: No predefined schema, designed for fast and scalable data storage. +- **Use Cases**: Real-time applications, IoT, mobile apps. +- Benefits: + - Flexibility: easy to evolve data model + - Scalability: designed to scale-out by using distributed clusters + - High-performance: optimized for a specific data model + - Highly functional: types optimized for the data model +- **Examples**: DynamoDB, MongoDB (DocumentDB), Key-value, document, graph, in-memory, search databases + +### NoSQL data example: JSON + +- JSON is a common form of data that fits into a NoSQL model +- Data can be nested +- Fields can change over time +- Support for new types: arrays, etc… + +```json +{ + "name": "Abc", + "age": 30, + "cars": [ + "Ford", + "BMW", + "Fiat" + ], + "address": { + "type": "house", + "number": 23, + "street": "Abc Road" + } +} +``` + +## Databases & Shared Responsibility on AWS + +| **AWS Responsibility** | **Customer Responsibility** | +| ------------------------------------------- | ------------------------------------------------ | +| Infrastructure management, backups, patches | Data security, encryption, access controls (IAM) | +| Availability and failover | Data management, monitoring, performance tuning | + +## AWS RDS Overview + +- **RDS (Relational Database Service)**: Fully managed service for relational databases. + - It’s a managed DB service for DB use SQL as a query language. + - Supports **MySQL**, **PostgreSQL**, **MariaDB**, **Oracle**, **SQL Server**. + - Handles **backup**, **patching**, **high availability** (Multi-AZ), and **scaling**. + +### Advantage over using RDS versus deploying DB on EC2 + +- RDS is a managed service: + - Automated provisioning, OS patching + - Continuous backups and restore to specific timestamp (Point in Time Restore)! + - Monitoring dashboards + - Read replicas for improved read performance + - Multi AZ setup for DR (Disaster Recovery) + - Maintenance windows for upgrades + - Scaling capability (vertical and horizontal) + - Storage backed by EBS (gp2 or io1) +- BUT you can’t SSH into your instances + +### RDS Deployments + +- **Read Replicas**: Improves read performance, **asynchronous** replication. +- **Multi-AZ**: Automatic failover, high availability for production environments. +- **Multi-Region**: Disaster recovery across regions, global availability. + +### RDS Deployments: Read Replicas, Multi-AZ + +| Read Replicas | Multi-AZ | +| ----------------------------------- | ------------------------------------------------- | +| Scale the read workload of your DB | Failover in case of AZ outage (high availability) | +| Can create up to 5 Read Replicas | Data is only read/written to the main database | +| Data is only written to the main DB | Can only have 1 other AZ as failover | + +![Read Replicas Multi-AZ](../images/read_replicas_multi_AZ.png) + +### RDS Deployments: Multi-Region + +- Multi-Region (Read Replicas) + - Disaster recovery in case of region issue + - Local performance for global reads + - Replication cost + +![Multi-Region](../images/multi_region.png) + +## Amazon Aurora + +- **Amazon Aurora**: High-performance RDS database. + - Compatible with **MySQL** and **PostgreSQL**. + - **5x faster** than MySQL, **3x faster** than PostgreSQL. + - **Auto-scaling** storage up to **64 TB**. + - Supports **Multi-AZ** and up to **15 read replicas**. + - Great for **enterprise-grade** applications requiring high availability and performance. + - Aurora costs more than RDS (20% more) – but is more efficient + +## Amazon ElastiCache Overview + +- **ElastiCache**: In-memory data caching service. + - **Redis**: Advanced key-value store with replication and persistence. + - **Memcached**: Simple, memory-only caching service. + - Reduces database load and speeds up applications by **caching frequent queries**. + - Caches are in-memory databases with high performance, low latency + - AWS takes care of OS maintenance / patching, optimizations, setup, configuration, monitoring, failure recovery and backup + +## DynamoDB + +- Fully managed, serverless NoSQL database. +- Supports key-value and document data models. +- Automatically scales based on demand. +- Provides high availability and durability with replication across 3 AZ +- Millions of requests per seconds, trillions of row, 100s of TB of storage +- Fast and consistent in performance +- Single-digit millisecond latency – low latency retrieval +- Integrated with IAM for security, authorization and administration +- Low cost and auto scaling capabilities +- Standard & Infrequent Access (IA) Table Class + +### DynamoDB Accelerator (DAX) + +- In-memory caching for DynamoDB. +- **10x faster** read performance. ingle-digit millisecond latency to microseconds latency – when accessing your DynamoDB tables +- Secure, highly scalable & highly available +- Ideal for use cases where **low-latency reads** are critical. + +### DynamoDB Global Tables + +- Multi-region replication for **global** applications. +- **Low-latency** reads and writes across multiple regions. +- Ensures data availability globally with **multi-master replication**. + +## Redshift Overview + +- Managed data warehousing service. +- Optimized for **online analytical processing (OLAP)** and big data analytics. +- Uses **columnar storage** for fast query performance. +- 10x better performance than other data warehouses, scale to PBs of data +- Columnar storage of data (instead of row based) +- Supports integration with **BI tools** (QuickSight, Tableau). +- Massively Parallel Query Execution (MPP), highly available. +- Has a SQL interface for performing the queries. +- Pay-per-query or **reserved instances** for cost savings. +- Designed for **massive datasets**. + +## Amazon EMR (Elastic MapReduce) + +- Managed big data processing service. +- Uses **Hadoop**, **Apache Spark**, and **Hive** for processing large data sets. +- Ideal for **data transformation**, **machine learning**, and **ETL** (Extract, Transform, Load). +- Integration with **S3**, **DynamoDB**, and **Redshift**. +- The clusters can be made of hundreds of EC2 instances +- EMR takes care of all the provisioning and configuration +- Auto-scaling and integrated with Spot instances +- Use cases: data processing, machine learning, web indexing, big data + +## Amazon Athena + +- Serverless query service +- Use **SQL** to query structured and unstructured data stored in **S3**. +- No infrastructure to manage, pay-per-query. +- Supports various formats like **CSV**, **JSON**, **Parquet**, and **ORC**. +- Pricing: $5.00 per TB of data scanned +- Use compressed or columnar data for cost-savings (less scan) +- Use cases: Business intelligence / analytics / reporting, analyze & query VPC Flow Logs, ELB Logs, CloudTrail trails, etc... +- Analyze data in S3 using serverless SQL, use Athena + +## Amazon QuickSight + +- Business Intelligence (BI) tool for data visualization. +- Serverless machine learning-powered business intelligence service to create interactive dashboards +- Fast, automatically scalable, embeddable, with per-session pricing +- Supports data from S3, Redshift, RDS, and other AWS data sources. +- **Pay-per-session** pricing model for cost efficiency. +- Use cases: + - Business analytics + - Building visualizations + - Perform ad-hoc analysis + - Get business insights using data + +## DocumentDB (with MongoDB Compatibility) + +- Managed document database, **MongoDB-compatible**. +- DocumentDB is the same for MongoDB (which is a NoSQL database) +- Highly scalable and durable with **Multi-AZ**. +- Built for **JSON** document storage. +- Aurora storage automatically grows in increments of 10GB, up to 64 TB. +- Automatically scales to workloads with millions of requests per seconds +- Use cases: Content management, cataloging, and mobile backends. + +## Amazon Neptune + +- Fully managed graph database +- A popular graph dataset would be a social network + - Users have friends + - Posts have comments + - Comments have likes from users + - Users share and like posts… +- Highly available across 3 AZ, with up to 15 read replicas +- Build and run applications working with highly connected datasets – optimized for these complex and hard queries +- Can store up to billions of relations and query the graph with milliseconds latency +- Highly available with replications across multiple AZs +- Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking + +## Amazon QLDB + +- QLDB stands for ”Quantum Ledger Database” +- A ledger is a book **recording financial transactions** +- Fully Managed, Serverless, High available, Replication across 3 AZ +- Used to **review history of all the changes made to your application data** over time +- **Immutable** system: no entry can be removed or modified, cryptographically verifiable +- 2-3x better performance than common ledger blockchain frameworks, manipulate data using SQL +- Difference with Amazon Managed Blockchain: no decentralization component, in accordance with financial regulation rules + +## Amazon Managed Blockchain + +- Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority. +- Amazon Managed Blockchain is a managed service to: + - Join public blockchain networks + - Or create your own scalable private network +- Compatible with the frameworks Hyperledger Fabric & Ethereum + +## AWS Glue + +- Managed extract, transform, and load (ETL) service +- Useful to prepare and transform data for analytics +- Fully serverless service +- Glue Data Catalog: catalog of datasets + - can be used by Athena, Redshift, EMR + +## DMS - Database Migration Service + +- Quickly and securely migrate databases to AWS, resilient, self healing +- The source database remains available during the migration +- Supports: + - Homogeneous migrations: ex Oracle to Oracle + - Heterogeneous migrations: ex Microsoft SQL Server to Aurora + +## Databases & Analytics Summary + +- Relational Databases - OLTP: RDS & Aurora (SQL) +- Differences between Multi-AZ, Read Replicas, Multi-Region +- In-memory Database: ElastiCache +- Key/Value Database: DynamoDB (serverless) & DAX (cache for DynamoDB) +- Warehouse - OLAP: Redshift (SQL) +- Hadoop Cluster: EMR +- Athena: query data on Amazon S3 (serverless & SQL) +- QuickSight: dashboards on your data (serverless) +- DocumentDB: “Aurora for MongoDB” (JSON – NoSQL database) +- Amazon QLDB: Financial Transactions Ledger (immutable journal, cryptographically verifiable) +- Amazon Managed Blockchain: managed Hyperledger Fabric & Ethereum blockchains +- Glue: Managed ETL (Extract Transform Load) and Data Catalog service +- Database Migration: DMS +- Neptune: graph database diff --git a/sections/elb_asg.md b/sections/elb_asg.md index e852cae..e0fef2b 100644 --- a/sections/elb_asg.md +++ b/sections/elb_asg.md @@ -57,11 +57,11 @@ ## Scalability vs Elasticity (vs Agility) -| **Term** | **Definition** | -|--------------------|--------------------------------------------------------------------------------------------------| -| **Scalability** | Ability to increase or decrease the capacity to handle varying levels of traffic or load. | -| **Elasticity** | Automatically adjusts resources up or down based on the load in real-time, preventing under or over-provisioning. | -| **Agility** | The ability to deploy and manage resources quickly and efficiently in response to changing demands. | +| **Term** | **Definition** | +| --------------- | ----------------------------------------------------------------------------------------------------------------- | +| **Scalability** | Ability to increase or decrease the capacity to handle varying levels of traffic or load. | +| **Elasticity** | Automatically adjusts resources up or down based on the load in real-time, preventing under or over-provisioning. | +| **Agility** | The ability to deploy and manage resources quickly and efficiently in response to changing demands. | ## What is Load Balancing? diff --git a/sections/iam.md b/sections/iam.md index 35f773f..72e74f1 100644 --- a/sections/iam.md +++ b/sections/iam.md @@ -38,6 +38,7 @@ - **Users**: Represent individual identities that interact with AWS services. Users have unique credentials (username, password, access keys). - **Groups**: Logical grouping of users to simplify permission management. - Permissions assigned to a group are automatically inherited by its users. +- Flexibility in User Management in IAM, users do not have to belong to a group, and a user can belong to multiple groups. This allows user to manage access permissions in a granular and efficient manner. For example, a user could belong to both the “QAs" group and the “Developers” group, inheriting permissions from both. | **IAM Users** | **IAM Groups** | |------------------------------------------------------------|----------------------------------------------------------| @@ -55,6 +56,8 @@ ### IAM Policies Inheritance +![IAM Policies Inheritance](../images/IAM_Policies_inheritance.png) + - Policies are evaluated together for a user, including: - **Directly attached policies**. - **Group policies**. diff --git a/sections/other_compute.md b/sections/other_compute.md new file mode 100644 index 0000000..bb5b27e --- /dev/null +++ b/sections/other_compute.md @@ -0,0 +1,192 @@ +# Other Compute + +## What is Docker? + +- Docker is a software development platform to deploy apps +- Apps are packaged in containers that can be run on any OS +- Apps run the same, regardless of where they’re run + - Any machine + - No compatibility issues + - Predictable behavior + - Less work + - Easier to maintain and deploy + - Works with any language, any OS, any technology +- Scale containers up and down very quickly (seconds) + +### Where are Docker images stored? + +- **Docker Hub**: Centralized public repository for storing Docker images. +- Public: Docker Hub +- Private: **Amazon ECR (Elastic Container Registry)**: AWS service for storing, managing, and deploying container images. + +### Docker versus Virtual Machines + +- Docker is ”sort of” a virtualization technology, but not exactly +- Resources are shared with the host => many containers on one server + +| **Docker Containers** | **Virtual Machines (VMs)** | +| -------------------------------------- | ----------------------------------------- | +| Lightweight, shares the host OS kernel | Heavier, includes full OS | +| Starts in seconds | Slower startup (minutes) | +| Portable, fast scaling | Not as portable, more resource-intensive | +| Best for microservices & modern apps | Best for running multiple OS environments | + +## ECS (Elastic Container Service) + +- Fully managed container orchestration service. +- Supports Docker containers. +- Launch Docker containers on AWS +- AWS takes care of starting / stopping containers +- **Two launch modes**: **EC2** (self-managed instances) and **Fargate** (serverless). +- Provides integration with IAM, VPC, ELB, and ECR. + +## Fargate + +- Serverless compute engine for containers, works with ECS and EKS. +- No need to manage EC2 instances. +- Pay for resources used (vCPU and memory). +- AWS just runs containers for you based on the CPU / RAM you need + +## ECR (Elastic Container Registry) + +- Fully managed Docker container registry. +- Stores, manages, and secures Docker images. +- Integrated with **ECS**, **EKS**, and **Fargate** for easy deployment. +- This is where you store your Docker images so they can be run by ECS or Fargate + +## What’s Serverless? + +- No need to provision, scale, or manage servers. +- Resources are automatically provisioned and scaled by AWS. +- Serverless is a new paradigm in which the developers don’t have to manage servers anymore… +- They just deploy code +- They just deploy… functions ! +- Initially... Serverless == FaaS (Function as a Service) +- Serverless was pioneered by AWS Lambda but now also includes anything that’s managed: “databases, messaging, storage, etc.” +- Serverless does not mean there are no servers… +- it means you just don’t manage / provision / see them +- Ideal for event-driven and stateless applications. + +## Why AWS Lambda? + +- Serverless compute service to run code without managing infrastructure. +- Executes code in response to events (e.g., API calls, file uploads). +- Scales automatically and you only pay for usage. + +| EC2 | Lambda | +| -------------------------------------------------- | ----------------------------------------- | +| Virtual Servers in the Cloud | Virtual functions – no servers to manage! | +| Limited by RAM and CPU | Limited by time - short executions | +| Continuously running | Run on-demand | +| Scaling means intervention to add / remove servers | Scaling is automated! | + +### Benefits of AWS Lambda + +- **No server management**: AWS handles the infrastructure. +- **Automatic scaling**: Scales based on event triggers. +- **Flexible scaling**: Runs from a few requests per day to thousands per second. +- **Event-driven architecture**: Ideal for apps that need to respond to events. +- Easy Pricing: + - Pay per request and compute time + - Free tier of 1,000,000 AWS Lambda requests and 400,000 GBs of compute time +- Integrated with the whole AWS suite of services +- Event-Driven: functions get invoked by AWS when needed +- Integrated with many programming languages +- Easy monitoring through AWS CloudWatch +- Easy to get more resources per functions (up to 10GB of RAM!) +- Increasing RAM will also improve CPU and network! + +### AWS Lambda Language Support + +- Node.js +- Python +- Ruby +- Java +- Go +- .NET Core +- custom runtime (via container images) (community supported, example Rust) +- Lambda Container Image + - The container image must implement the Lambda Runtime API + - ECS / Fargate is preferred for running arbitrary Docker images + +### AWS Lambda Pricing: Example + +- Based on number of requests and execution time. +- You can find overall pricing information here: +- First **1 million requests/month** are free. +- After that, **$0.20 per million requests**. +- **Execution duration**: $0.00001667 for every GB-second used (first 400,000 GB-seconds free per month). +- - Pay per duration: (in increment of 1 ms) + - 400,000 GB-seconds of compute time per month for FREE + - == 400,000 seconds if function is 1GB RAM + - == 3,200,000 seconds if function is 128 MB RAM + - After that $1.00 for 600,000 GB-seconds +- It is usually **very cheap** to run AWS Lambda so it’s **very popular** + +## Amazon API Gateway + +- Managed service for creating, publishing, and monitoring REST, HTTP, and WebSocket APIs. +- Integrates with AWS Lambda for fully serverless APIs. +- Serverless and scalable +- Support for security, user authentication, API throttling, API keys, monitoring. +- **Throttling**, **caching**, and **authorization** features built-in. + +## AWS Batch + +- Fully managed service for running batch processing workloads. +- Dynamically provisions compute resources based on job requirements. +- Suitable for large-scale data processing, such as machine learning and rendering tasks. +- Efficiently run 100,000s of computing batch jobs on AWS +- A “batch” job is a job with a start and an end (opposed to continuous) +- Batch will dynamically launch EC2 instances or Spot Instances +- AWS Batch provisions the right amount of compute / memory +- You submit or schedule batch jobs and AWS Batch does the rest! +- Batch jobs are defined as Docker images and run on ECS +- Helpful for cost optimizations and focusing less on the infrastructure + +## Batch vs Lambda + +| **AWS Batch** | **AWS Lambda** | +| ------------------------------------------- | ------------------------------------------ | +| Designed for **batch processing** | Designed for **event-driven** architecture | +| Handles large-scale compute jobs | Executes short-lived functions | +| Custom EC2 instances or Fargate tasks | Fully serverless, no server management | +| Jobs may take minutes to hours | Max execution time of 15 minutes | +| Rely on EBS / instance store for disk space | Limited temporary disk space | + +## Amazon Lightsail + +- Virtual servers, storage, databases, and networking +- Low & predictable pricing +- Simpler alternative to using EC2, RDS, ELB, EBS, Route 53… +- Great for people with little cloud experience! +- Can setup notifications and monitoring of your Lightsail resources +- Use cases: + - Simple web applications (has templates for LAMP, Nginx, MEAN, Node.js…) + - Websites (templates for WordPress, Magento, Plesk, Joomla) + - Dev / Test environment +- Has high availability but no auto-scaling, limited AWS integrations + +## Lambda Summary + +- Lambda is Serverless, Function as a Service, seamless scaling, reactive +- Lambda Billing: + - By the time run x by the RAM provisioned + - By the number of invocations +- Language Support: many programming languages except (arbitrary) Docker +- Invocation time: up to 15 minutes +- Use cases: + - Create Thumbnails for images uploaded onto S3 + - Run a Serverless cron job +- API Gateway: expose Lambda functions as HTTP API + +## Other Compute Summary + +- Docker: container technology to run applications +- ECS: run Docker containers on EC2 instances +- Fargate: +- Run Docker containers without provisioning the infrastructure +- Serverless offering (no EC2 instances) +- ECR: Private Docker Images Repository +- Batch: run batch jobs on AWS across managed EC2 instances +- Lightsail: predictable & low pricing for simple application & DB stacks diff --git a/sections/s3.md b/sections/s3.md index 622508f..533e2e5 100644 --- a/sections/s3.md +++ b/sections/s3.md @@ -421,4 +421,3 @@ - Snow Family: import data onto S3 through a physical device, edge computing - OpsHub: desktop application to manage Snow Family devices - Storage Gateway: hybrid solution to extend on-premises storage to S3 -