AAS deployment (#117)

* Broker updates (#108) * Updates to docker setup * testing adding event harvesters * Updated parsing of events to be strict to align with marshmallow version 3 * removed unnecesary debug message * Added keywords to ads harvester * Updated to Python 3.8 and latest invenio packages * Added metadata endpoint for metadata query * Updated Pipfile.lock and install using it * Updated metadata queries to ignore cases for journals and allow simple query strings on keywords * Fixed mistake * Added automatic harvesting after event is finished * Added error monitoring for harvesters and events * Skip retrying tasks and publish error * Fixed simple query syntax * Fixed bug with extra definition of query filter * Fixed bug where target scheme was overwritten due to duplicate variable names * Updated error message with payload on failed validation * Moved error handling to cover the full harvesting process * Moved error handling to cover the full harvesting process * Readded retries on failed harvest and event tasks * Added harvest taks monitoring * Updated error catching on DOI harvester * Updated harvester monitoring to add which harvester has been run * Added error monitoring on updating metadata * Added error monitoring on updating metadata * Added aggregations on metadata search on amount of softwares cited * Updated the reruns to have a progressive increase in time for each retry * Added slackclient to write status reports every week * Moved error catching to cli so the api fucntions transfer the error to the harvesters * Updated Piplock file * Updated metadata search to only show aggregated Target results Co-authored-by: Mattias Wångblad <[email protected]> * Added vscode files to gitignore * Added azure dev docker compose for lets encrypt ssl * Updated path for letsencrypt to account for symlinks * Updated Azure Dev HAproxy config to enable Letsencrypt SSL certificate * Added additional docker compose file for Azure dev instance to enable SSL through Lets Encrypt (#110) * Added azure dev docker compose for lets encrypt ssl * Updated path for letsencrypt to account for symlinks * Updated Azure Dev HAproxy config to enable Letsencrypt SSL certificate * Initial compose file * Broker updates2 (#111) * Upgraded to elasticsearch v7 * Updated formatting for slack monitoring messages * Added a github harvester * Updated pagination for metadata search * Added mappings for es7 * Fixed parsing bugs in github harvester * Fixed issues with adding github relations in zeondo harvester * Made sure to order the metadata aggregations correctly on count * Updated slack notifications to send to correct channel * Fixed 2 more bugs on github parsing of github urls * Fixed bug where the providers where dictionaried twice * Added check to ensure it won't fail if github tags doesn't exists * Updated payload on error monitoring to always be with the identifiers * Fixed bug mixing up source and target in github relations * Added possibility to have a github token for more queries per hour * Increased rate limit for logined users to allow more events per hour * Fixed bug setting github api token * Fixed bug setting github api token * Fixed bug setting github api token * Added message in case there is nothing to report on monitoring * Addec CLI commands to trigger monitor report * Addded unique key requirement on groupm2m subgroup_id column * Fixed bug on monitoring reporting * Fixed another montioring report bug on getting the errors to dict * Split the error monitoring into chunks because of Slack limitations * Added payload to error report * Fixed typo in error report * Fixed bug where github url ending with / would cause error * Added rollbacks to make sure event reporting is fixed * Fixed bug on rolling back DB errors * Fixed bug which caused monitoring message to fail on some occasions * Added CLi to rerun failed events * Added possibility to rerun specific id * Reomved erorrneus import * Fixed typo * Made ID optional in event rerun CLI * Added rerun CLI to harvester * Fixed bug on parsing github release id * Updated names of harvesters to be able to rerun them from CLI * Added the unique key on subgroup in groupm2m Co-authored-by: Mattias Wångblad <[email protected]> * Changed over to production SSL certificates * Inital Docker Compose production script * Keeping all networking in links rather than networks * typo fix * Changed networking for "elastic" and "bridge" * Troubleshooting elasticsearch cluster * es debug * es debug * es docker debug * es debug * es debug * es debug * es debug * Fixed networking for es * Aas deployment (#112) * Initial compose file * Changed over to production SSL certificates * Inital Docker Compose production script * Keeping all networking in links rather than networks * typo fix * Changed networking for "elastic" and "bridge" * Troubleshooting elasticsearch cluster * es debug * es debug * es docker debug * es debug * es debug * es debug * es debug * Fixed networking for es * Added keyword and journal filter to relation search * Added monitoring and harvesting to documentation and fixed some typos on the naming and comments * Set specific version of rabbitmq and increased timeout on acknowledging to 1 week to not imeout on retries * Added memory limits to rabbitmq * Broker updates (#115) * Removed retry of tasks after 1 week since that could fill up rabbitmq channels * Aas deployment (#114) * Initial compose file * Changed over to production SSL certificates * Inital Docker Compose production script * Keeping all networking in links rather than networks * typo fix * Changed networking for "elastic" and "bridge" * Troubleshooting elasticsearch cluster * es debug * es debug * es docker debug * es debug * es debug * es debug * es debug * Fixed networking for es * Added memory limits to rabbitmq * Added a 1s sleep on harvesters to avoid query overload * Changed retry of tasks to only once after 10min and added cronjob to rerun last two days of errors every night * Possibility to only rerun errors between timestamps on CLI * Added link between error_monitoring, event and harvester_event to avoid having double creation of errors * Fixed inconsistency in import * Added check on github link on event parsing * Allow null values of LicenseUrl on event json parsing * Lowered acknowledge time to 30min * Refactored to avoid circular references * Removed unused imports * Refactored to avoid circular references when checking github events * Another move to avoid circular references * Fixed bug pointing to wrong rerun function * Small fixes on the github validation * Changed error reporting to only take errors that have not successfully been rerun Co-authored-by: mattiasw <[email protected]> * Broker updates (#116) * Removed retry of tasks after 1 week since that could fill up rabbitmq channels * Aas deployment (#114) * Initial compose file * Changed over to production SSL certificates * Inital Docker Compose production script * Keeping all networking in links rather than networks * typo fix * Changed networking for "elastic" and "bridge" * Troubleshooting elasticsearch cluster * es debug * es debug * es docker debug * es debug * es debug * es debug * es debug * Fixed networking for es * Added memory limits to rabbitmq * Added a 1s sleep on harvesters to avoid query overload * Changed retry of tasks to only once after 10min and added cronjob to rerun last two days of errors every night * Possibility to only rerun errors between timestamps on CLI * Added link between error_monitoring, event and harvester_event to avoid having double creation of errors * Fixed inconsistency in import * Added check on github link on event parsing * Allow null values of LicenseUrl on event json parsing * Lowered acknowledge time to 30min * Refactored to avoid circular references * Removed unused imports * Refactored to avoid circular references when checking github events * Another move to avoid circular references * Fixed bug pointing to wrong rerun function * Small fixes on the github validation * Changed error reporting to only take errors that have not successfully been rerun * Make sure that the error message aren't larger than slack limits * Added unique constraint on groupm2m subgroup since it's never intended to have multiple of them Co-authored-by: mattiasw <[email protected]> * Update AUTHORS.rst Added new authors * Update Dockerfile Adding Ignore pipfile to only use Pipfile.lock for docker install * Added ignore pipfile to dockerfile * Added flask version to Pipfile * Added specific Invenio versions to ensure compatibility Co-authored-by: mattiaswangblad <[email protected]> Co-authored-by: Mattias Wångblad <[email protected]> Co-authored-by: mattiasw <[email protected]>
asclepias · Mar 4, 2022 · afcf095 · afcf095
1 parent 7fa422a
commit afcf095
Show file tree

Hide file tree

Showing 61 changed files with 4,175 additions and 992 deletions.
diff --git a/.gitignore b/.gitignore
@@ -60,3 +60,6 @@ target/
 
 # Vim swapfiles
 .*.sw?
+
+# VScode Docs
+.vscode/
diff --git a/AUTHORS.rst b/AUTHORS.rst
@@ -11,3 +11,6 @@ Authors
 - Chiara Bigarella
 - Krzysztof Nowak
 - Thomas P. Robitaille
+- Mattias Wångblad
+- Mubdi Rahman
+
diff --git a/Dockerfile b/Dockerfile
@@ -4,16 +4,16 @@
 #
 # Asclepias Broker is free software; you can redistribute it and/or modify it
 # under the terms of the MIT License; see LICENSE file for more details.
-FROM inveniosoftware/centos7-python:3.6
+FROM python:3.8
 
 RUN mkdir /app
 WORKDIR /app
 
 COPY Pipfile Pipfile
 COPY Pipfile.lock Pipfile.lock
-RUN pipenv install --deploy --system
-
+RUN pip install pipenv
+RUN pipenv install --deploy --system --ignore-pipfile
 ADD . /app
-RUN pip install .
+RUN pipenv install . --ignore-pipfile
 
 CMD ["asclepias-broker", "shell"]
diff --git a/Pipfile b/Pipfile
@@ -4,23 +4,31 @@ verify_ssl = true
 name = "pypi"
 
 [packages]
-celery = "<4.3"
+celery = "==5.1.2"
 idutils = "*"
-invenio = {version = ">=3.1.0,<3.2.0",extras = ["elasticsearch6", "postgresql"]}
-invenio-accounts = ">=1.1.1,<1.2.0"
-invenio-indexer = ">=1.0.1,<1.1.0"
-invenio-logging = ">=1.1.0,<1.2.0"
-invenio-oauth2server = ">=1.0.3,<1.1.0"
-invenio-queues = "==1.0.0a1"
-invenio-records-rest = ">=1.4.0,<1.5.0"
-invenio-rest = ">=1.0.0,<1.1.0"
+invenio = {version = "==3.4.1",extras = ["elasticsearch7", "postgresql"]}
+invenio-app = "==1.3.3"
+invenio-base = "==1.2.5"
+invenio-celery = "==1.2.3"
+invenio-accounts = "==1.4.9"
+invenio-indexer = ">=1.2.1"
+invenio-logging = "==1.3.0"
+invenio-oauth2server = "==1.3.4"
+invenio-queues = "==1.0.0a3"
+invenio-records-rest = "==1.9.0"
+invenio-search = "==1.4.2"
+invenio-rest = ">=1.2.3"
+invenio-theme = "==1.3.10"
+flask = "==2.0.2"
 jsonschema = "*"
-marshmallow = ">=2.0.0,<3.0.0"
+marshmallow = ">=3.0.0,<4.0.0"
 raven = {version = "*",extras = ["flask"]}
 requests = "*"
-uwsgi = "*"
+uwsgi = ">= 2.0.19"
 uwsgi-tools = "*"
 uwsgitop = "*"
+ipython = "*"
+slackclient = "*"
 
 [dev-packages]
 check-manifest = "*"
@@ -43,7 +51,7 @@ sphinxcontrib-httpdomain = "*"
 sphinxcontrib-httpexample = "*"
 
 [requires]
-python_version = "3.6"
+python_version = "3.8"
 
 [scripts]
 test = "python setup.py test"

diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/asclepias_broker/config.py b/asclepias_broker/config.py
@@ -21,9 +21,10 @@
 from invenio_records_rest.facets import range_filter, terms_filter
 from invenio_records_rest.utils import deny_all
 from invenio_search.api import RecordsSearch
+from celery.schedules import crontab
 
 from .search.query import enum_term_filter, nested_range_filter, \
-    nested_terms_filter
+    nested_terms_filter, simple_query_string_filter
 
 
 def _parse_env_bool(var_name, default=None):
@@ -125,8 +126,20 @@ def _parse_env_bool(var_name, default=None):
         'schedule': timedelta(minutes=60),
     },
     'reindex': {
-        'task': 'asclepias_broker.tasks.reindex_all_relationships',
-        'schedule': timedelta(hours=24)
+        'task': 'asclepias_broker.search.tasks.reindex_all_relationships',
+        'schedule':  crontab(hour=23, minute=0)
+    },
+    'notify': {
+        'task': 'asclepias_broker.monitoring.tasks.sendMonitoringReport',
+        'schedule':  crontab(hour=0, minute=0, day_of_week=0)
+    },
+    'rerun_harvester_errors': {
+        'task': 'asclepias_broker.monitoring.tasks.rerun_harvest_errors',
+        'schedule':  crontab(hour=22, minute=0)
+    },
+    'rerun_event_errors': {
+        'task': 'asclepias_broker.monitoring.tasks.rerun_event_errors',
+        'schedule':  crontab(hour=21, minute=0)
     },
 }
 
@@ -167,6 +180,36 @@ def _parse_env_bool(var_name, default=None):
         max_result_window=10000,
         error_handlers=dict(),
     ),
+    meta=dict(
+        pid_type='meta',
+        pid_minter='relid',
+        pid_fetcher='relid',
+        search_class=RecordsSearch,
+        indexer_class=None,
+        search_index='relationships',
+        search_type=None,
+        search_factory_imp='asclepias_broker.search.query.meta_search_factory',
+        # Only the List GET view is available
+        create_permission_factory_imp=deny_all,
+        delete_permission_factory_imp=deny_all,
+        update_permission_factory_imp=deny_all,
+        read_permission_factory_imp=deny_all,
+        links_factory_imp=lambda p, **_: None,
+        record_serializers={
+            'application/json': ('invenio_records_rest.serializers'
+                                 ':json_v1_response'),
+        },
+        # TODO: Implement marshmallow serializers
+        search_serializers={
+            'application/json': ('invenio_records_rest.serializers'
+                                 ':json_v1_search'),
+        },
+        list_route='/metadata',
+        item_route='/metadata/<pid(meta):pid_value>',
+        default_media_type='application/json',
+        max_result_window=10000,
+        error_handlers=dict(),
+    ),
 )
 RECORDS_REST_FACETS = dict(
     relationships=dict(
@@ -205,12 +248,41 @@ def _parse_env_bool(var_name, default=None):
                     'isRelatedTo': 'IsRelatedTo'
                 }
             ),
+            keyword=simple_query_string_filter('Source.Keywords_all'),
+            journal=nested_terms_filter('Source.Publisher.Name','Source.Publisher'),
         ),
         post_filters=dict(
             type=terms_filter('Source.Type.Name'),
             publication_year=range_filter(
-                'Source.PublicationDate', format='yyyy', end_date_math='/y'),
+                'Source.PublicationDate', format='yyyy', start_date_math='/y', end_date_math='/y'),
         )
+    ),
+
+    # The topHits agg can't be addded here due to limitations in elasticsearch_dsl aggs function so that is added in query.py
+    metadata=dict(
+        aggs=dict(
+            NumberOfTargets=dict(
+                cardinality=dict(field='Target.ID')
+            ),
+            publication_year=dict(
+                date_histogram=dict(
+                    field='Source.PublicationDate',
+                    interval='year',
+                    format='yyyy',
+                ),
+            ),
+        ),
+        filters=dict(
+            group_by=enum_term_filter(
+                label='group_by',
+                field='Grouping',
+                choices={'identity': 'identity', 'version': 'version'}
+            ),
+            keyword=simple_query_string_filter('Source.Keywords_all'),
+            journal=nested_terms_filter('Source.Publisher.Name','Source.Publisher'),
+            publication_year=range_filter(
+                'Source.PublicationDate', format='yyyy', start_date_math='/y', end_date_math='/y'),
+        ),
     )
 )
 # TODO: See if this actually works
@@ -228,6 +300,7 @@ def _parse_env_bool(var_name, default=None):
     }
 }
 RATELIMIT_STORAGE_URL = f'{REDIS_BASE_URL}/3'
+RATELIMIT_AUTHENTICATED_USER = '20000 per hour;500 per minute'
 
 APP_DEFAULT_SECURE_HEADERS['force_https'] = True
 APP_DEFAULT_SECURE_HEADERS['session_cookie_secure'] = True

diff --git a/asclepias_broker/events/api.py b/asclepias_broker/events/api.py
@@ -48,7 +48,7 @@ def validate_payload(cls, event):
         for payload in event:
             errors = RelationshipSchema(check_existing=True).validate(payload)
             if errors:
-                raise MarshmallowValidationError(errors)
+                raise MarshmallowValidationError(str(errors) + "payload" +  str(payload))
 
     @classmethod
     def handle_event(cls, event: dict, no_index: bool = False,
@@ -69,3 +69,16 @@ def handle_event(cls, event: dict, no_index: bool = False,
         else:
             task.apply_async()
         return event_obj
+
+    @classmethod
+    def rerun_event(cls, event: Event, no_index: bool, eager:bool = False):
+        event_uuid = str(event.id)
+        idx_enabled = current_app.config['ASCLEPIAS_SEARCH_INDEXING_ENABLED'] \
+            and (not no_index)
+        task = process_event.s(
+            event_uuid=event_uuid, indexing_enabled=idx_enabled)
+        if eager:
+            task.apply(throw=True)
+        else:
+            task.apply_async()
+        return event
diff --git a/asclepias_broker/events/cli.py b/asclepias_broker/events/cli.py
@@ -8,14 +8,18 @@
 """Events CLI."""
 
 from __future__ import absolute_import, print_function
+import datetime
 
 import json
 
 import click
 from flask.cli import with_appcontext
+from flask import current_app
 
 from ..utils import find_ext
 from .api import EventAPI
+from ..graph.tasks import process_event
+from .models import Event, EventStatus
 
 
 @click.group()
@@ -41,3 +45,47 @@ def load(jsondir_or_file: str, no_index: bool = False, eager: bool = False):
                 EventAPI.handle_event(data, no_index=no_index, eager=eager)
             except ValueError:
                 pass
+
+@events.command('rerun')
+@click.option('-i','--id', default=None)
+@click.option('-a', '--all', default=False, is_flag=True)
+@click.option('-e', '--errors', default=False, is_flag=True)
+@click.option('-p', '--processing', default=False, is_flag=True)
+@click.option('--no-index', default=False, is_flag=True)
+@click.option('--eager', default=False, is_flag=True)
+@with_appcontext
+def rerun(id: str = None, all: bool = False, errors: bool = True, processing: bool = False, no_index: bool = False, eager: bool = False):
+    """Rerun failed or stuck events."""
+    if id:
+        rerun_id(id, no_index, eager)
+        return
+    if all:
+        errors = True
+        processing = True
+    if processing:
+        rerun_processing(no_index, eager)
+        rerun_new(no_index, eager)
+    if errors:
+        rerun_errors(no_index, eager)
+
+def rerun_id(id:str, no_index: bool, eager:bool = False):
+        event = Event.get(id)
+        if event:
+            EventAPI.rerun_event(event, no_index=no_index, eager=eager)
+
+def rerun_processing(no_index: bool, eager:bool = False):
+        yesterday = datetime.datetime.now() - datetime.timedelta(days = 1)
+        resp = Event.query.filter(Event.status == EventStatus.Processing, Event.created < str(yesterday)).all()
+        for event in resp:
+            EventAPI.rerun_event(event, no_index=no_index, eager=eager)
+
+def rerun_new(no_index: bool, eager:bool = False):
+        yesterday = datetime.datetime.now() - datetime.timedelta(days = 1)
+        resp = Event.query.filter(Event.status == EventStatus.New, Event.created < str(yesterday)).all()
+        for event in resp:
+            EventAPI.rerun_event(event, no_index=no_index, eager=eager)
+
+def rerun_errors(no_index: bool, eager:bool = False):
+        resp = Event.query.filter(Event.status == EventStatus.Error).all()
+        for event in resp:
+            EventAPI.rerun_event(event, no_index=no_index, eager=eager)
diff --git a/asclepias_broker/events/models.py b/asclepias_broker/events/models.py
@@ -10,12 +10,14 @@
 import enum
 import uuid
 from typing import Union
+import datetime
 
 from invenio_accounts.models import User
 from invenio_db import db
 from sqlalchemy.schema import PrimaryKeyConstraint
 from sqlalchemy_utils.models import Timestamp
 from sqlalchemy_utils.types import JSONType, UUIDType
+from sqlalchemy import func
 
 from ..core.models import Identifier, Relationship
 
@@ -53,7 +55,14 @@ class Event(db.Model, Timestamp):
     def get(cls, id: str = None, **kwargs):
         """Get the event from the database."""
         return cls.query.filter_by(id=id).one_or_none()
-
+
+    @classmethod
+    def getStatsFromLastWeek(cls):
+        """Gets the stats from the last 7 days"""
+        last_week = datetime.datetime.now() - datetime.timedelta(days = 7)
+        resp = db.session.query(cls.status, func.count('*')).filter(cls.updated > str(last_week)).group_by(cls.status).all()
+        return resp
+
     def __repr__(self):
         """String representation of the event."""
         return f"<{self.id}: {self.created}>"

diff --git a/asclepias_broker/graph/models.py b/asclepias_broker/graph/models.py
@@ -163,6 +163,7 @@ class GroupM2M(db.Model, Timestamp):
     __tablename__ = 'groupm2m'
     __table_args__ = (
         PrimaryKeyConstraint('group_id', 'subgroup_id', name='pk_groupm2m'),
+        UniqueConstraint('subgroup_id', name='uq_groupm2m_subgroup_id'),
     )
     group_id = db.Column(
         UUIDType,