Homework 3: Architectural Changes and Merra Service

Accessing Elves Weather App

add the following entry to /etc/hosts

sudo vi /etc/hosts
149.165.155.17 elves.weatherapp.com

Access the application using the link - Elves Weather App

Access Swagger-Ui for gateway: Gateway Swagger

Access Swagger-Ui for registry: Registry Swagger

Data Availability

Nexrad: Data is available for almost all the listed stations till the current date.

Merra: Data is available till 01/31/2022

Major Changes

Following are the key changes implemented in this iteration

Redesign of the registry database schema.
Implementing a multi-level cache to store converted data.
CI/CD implementation

Redesign of the registry database schema

In the previous version, the events and requests were coupled, which caused the unrelated information to be forced into being in a single table. This caused issues in distinguishing the events generated by the user from the requests. Further, a single request (requestId) goes through multiple states during its lifetime whereas, an eventId is unique per event. This version of the database also made it hard to keep track of the search history of the user.

As part of this redesign, the registry database was simplified. Following are the considerations

The requests and events are separated into individual tables.
Requests table keeps track of a requestId, search parameters, data source along with the status and result details.
Events table only keeps track of an eventId, the name of event and its time of occurrence.
Both requests and events are tagged to the user who generated it through userId.
Two new tables NexradData and MeraData has been added, which holds the current status of data availability in local object store.

Following is the database schema

Implementing a multi-level cache to store converted data.

We have implemented a two-level cache to enhance the response time of the requests. The cache is used to store the converted data files from both Nexrad and Merra.

The two levels of cache are:

L1-Cache
L2-Cache

Caching for Nexrad

The Nexrad data is downloaded for a time range and a particular radar station. Each request results in multiple file downloads which are then converted and used to plot the graphs. The converted files are cached in local s3 and tagged in the registry against the specified time range and radar station. When a new request comes for the same radar station and an overlapping time range, the data for the overlapping time range is downloaded from the local s3 and the remaining data (if any) will be downloaded from nexrad AWS s3.

Caching for Merra

The Merra data is downloaded for a date range and a particular variable (one file per day). Each request results in the download of a number of files equal to the number of days in the requested date range. Each of the downloaded files is converted and uploaded to s3 and tagged in the registry against the date and variable. When a new request comes, for each date in the date range the cache is looked up and available files are downloaded. The remaining files will be downloaded and converted from Mera source.

L1-Cache (implemented by Amol Sangar)

L1 cache is responsible to store recently downloaded files. L1 cache implements an LRU policy based on a fixed count. This fixed count can be configured by setting the 'l1_cache_capacity' environment variable.

Pseudocode

When the request arrives, check if the file is present in the cache based on the dates. If the file is present, then its last access time is refreshed and the file is not deleted immediately.
At the end of the request, update the cache to remove items based on the policy. If the current cache file count exceeds the fixed count, the files are removed from the cache.

Synchronization

Mutexes are used to make sure data is available when 2 requests are in the update(1) and delete(2) phase respectively. This ensures that the files are available in the cache when a process is trying to access the file which is scheduled to be deleted by the other process.

If the file is not found in L1 cache, then only its checked in L2 cache.

L2-Cache (implemented by Madhavan Kalkunte Ramachandra)

L2-Cache is implemented using a combination of local s3 like object store, registry microservice, and Redis.

local s3 like object-store: We have deployed an S3 like object store that is used to store the converted files reliably for a longer duration and to enable file availability across all instances in the system.
registry microservice: Registry microservice is used to persist metadata about files stored in the local object-store. APIs are implemented to query and update the current state of the cache.
Redis: A centrally hosted Redis is used to synchronize the cache updates and cache invalidation across all instances.

LRU Cache implementation

L2-Cache is implemented as an LRU Cache. When the cache capacity is reached, the least recently used files are removed from the cache. Each time a file is accessed (queried or downloaded) its last access time is updated in the registry. When the cache capacity is reached, the contents of the cache are sorted in ascending order of their last access time and the top elements are deleted to make room for new entries.

Synchronized Cache Access (for writes and deletes)

The L2-Cache is accessed by multiple instances of the ingestor microservice. When overlapping requests are received at the same time by different instances, it may result in duplication of work i.e the same data can be downloaded and converted multiple times. Also, there may be chances that this might result in corruption of cache. Hence, it is important to ensure synchronized access to the cache especially when writing or deleting from the cache.

Following are the considerations in designing the L2-Cache:

when there are duplicate/overlapping requests only one worker (consumer) must update the cache (i.e download, convert, and upload to s3).
If a similar request is under process anywhere across the system, the worker (consumer) must wait for it to conclude.
Every worker must specify an estimated time to complete the request (i.e download, convert, and upload to s3) beyond which, another request must be allowed to process the data.

The synchronized access to the cache is achieved by using a distributed locking algorithm based on Redis called REDLOCK. The updates to cache (writes and deletes) are considered and treated as critical section. The redlock algorithm ensures that only one worker (among multiple threads of the same node or across multiple nodes) acquires a lock. For a given data we generate a lock name that is unique for the data requested - this ensures that even across multiple nodes when requested for the same data, they try to acquire the same lock. The Redlock ensures that only one thread gets the lock and hence gets to process the data and update the cache. After its completion, the other threads will find the data available in cache and hence does not result in a fresh download.

The following diagram illustrates the lock acquiring process.

Overall workflow of a data plot request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly