Skip to content

Commit

Permalink
AAS deployment (#117)
Browse files Browse the repository at this point in the history
* Broker updates (#108)

* Updates to docker setup

* testing adding event harvesters

* Updated parsing of events to be strict to align with marshmallow version 3

* removed unnecesary debug message

* Added keywords to ads harvester

* Updated to Python 3.8 and latest invenio packages

* Added metadata endpoint for metadata query

* Updated Pipfile.lock and install using it

* Updated metadata queries to ignore cases for journals and allow simple query strings on keywords

* Fixed mistake

* Added automatic harvesting after event is finished

* Added error monitoring for harvesters and events

* Skip retrying tasks and publish error

* Fixed simple query syntax

* Fixed bug with extra definition of query filter

* Fixed bug where target scheme was overwritten due to duplicate variable names

* Updated error message with payload on failed validation

* Moved error handling to cover the full harvesting process

* Moved error handling to cover the full harvesting process

* Readded retries on failed harvest and event tasks

* Added harvest taks monitoring

* Updated error catching on DOI harvester

* Updated harvester monitoring to add which harvester has been run

* Added error monitoring on updating metadata

* Added error monitoring on updating metadata

* Added aggregations on metadata search on amount of softwares cited

* Updated the reruns to have a progressive increase in time for each retry

* Added slackclient to write status reports every week

* Moved error catching to cli so the api fucntions transfer the error to the harvesters

* Updated Piplock file

* Updated metadata search to only show aggregated Target results

Co-authored-by: Mattias Wångblad <[email protected]>

* Added vscode files to gitignore

* Added azure dev docker compose for lets encrypt ssl

* Updated path for letsencrypt to account for symlinks

* Updated Azure Dev HAproxy config to enable Letsencrypt SSL certificate

* Added additional docker compose file for Azure dev instance to enable SSL through Lets Encrypt (#110)

* Added azure dev docker compose for lets encrypt ssl

* Updated path for letsencrypt to account for symlinks

* Updated Azure Dev HAproxy config to enable Letsencrypt SSL certificate

* Initial compose file

* Broker updates2 (#111)

* Upgraded to elasticsearch v7

* Updated formatting for slack monitoring messages

* Added a github harvester

* Updated pagination for metadata search

* Added mappings for es7

* Fixed parsing bugs in github harvester

* Fixed issues with adding github relations in zeondo harvester

* Made sure to order the metadata aggregations correctly on count

* Updated slack notifications to send to correct channel

* Fixed 2 more bugs on github parsing of github urls

* Fixed bug where the providers where dictionaried twice

* Added check to ensure it won't fail if github tags doesn't exists

* Updated payload on error monitoring to always be with the identifiers

* Fixed bug mixing up source and target in github relations

* Added possibility to have a github token for more queries per hour

* Increased rate limit for logined users to allow more events per hour

* Fixed bug setting github api token

* Fixed bug setting github api token

* Fixed bug setting github api token

* Added message in case there is nothing to report on monitoring

* Addec CLI commands to trigger monitor report

* Addded unique key requirement on groupm2m subgroup_id column

* Fixed bug on monitoring reporting

* Fixed another montioring report bug on getting the errors to dict

* Split the error monitoring into chunks because of Slack limitations

* Added payload to error report

* Fixed typo in error report

* Fixed bug where github url ending with / would cause error

* Added rollbacks to make sure event reporting is fixed

* Fixed bug on rolling back DB errors

* Fixed bug which caused monitoring message to fail on some occasions

* Added CLi to rerun failed events

* Added possibility to rerun specific id

* Reomved erorrneus import

* Fixed typo

* Made ID optional in event rerun CLI

* Added rerun CLI to harvester

* Fixed bug on parsing github release id

* Updated names of harvesters to be able to rerun them from CLI

* Added the unique key on subgroup in groupm2m

Co-authored-by: Mattias Wångblad <[email protected]>

* Changed over to production SSL certificates

* Inital Docker Compose production script

* Keeping all networking in links rather than networks

* typo fix

* Changed networking for "elastic" and "bridge"

* Troubleshooting elasticsearch cluster

* es debug

* es debug

* es docker debug

* es debug

* es debug

* es debug

* es debug

* Fixed networking for es

* Aas deployment (#112)

* Initial compose file

* Changed over to production SSL certificates

* Inital Docker Compose production script

* Keeping all networking in links rather than networks

* typo fix

* Changed networking for "elastic" and "bridge"

* Troubleshooting elasticsearch cluster

* es debug

* es debug

* es docker debug

* es debug

* es debug

* es debug

* es debug

* Fixed networking for es

* Added keyword and journal filter to relation search

* Added monitoring and harvesting to documentation and fixed some typos on the naming and comments

* Set specific version of rabbitmq and increased timeout on acknowledging to 1 week to not imeout on retries

* Added memory limits to rabbitmq

* Broker updates (#115)

* Removed retry of tasks after 1 week since that could fill up rabbitmq channels

* Aas deployment (#114)

* Initial compose file

* Changed over to production SSL certificates

* Inital Docker Compose production script

* Keeping all networking in links rather than networks

* typo fix

* Changed networking for "elastic" and "bridge"

* Troubleshooting elasticsearch cluster

* es debug

* es debug

* es docker debug

* es debug

* es debug

* es debug

* es debug

* Fixed networking for es

* Added memory limits to rabbitmq

* Added a 1s sleep on harvesters to avoid query overload

* Changed retry of tasks to only once after 10min and added cronjob to rerun last two days of errors every night

* Possibility to only rerun errors between timestamps on CLI

* Added link between error_monitoring, event and harvester_event to avoid having double creation of errors

* Fixed inconsistency in import

* Added check on github link on event parsing

* Allow null values of LicenseUrl on event json parsing

* Lowered acknowledge time to 30min

* Refactored to avoid circular references

* Removed unused imports

* Refactored to avoid circular references when checking github events

* Another move to avoid circular references

* Fixed bug pointing to wrong rerun function

* Small fixes on the github validation

* Changed error reporting to only take errors that have not successfully been rerun

Co-authored-by: mattiasw <[email protected]>

* Broker updates (#116)

* Removed retry of tasks after 1 week since that could fill up rabbitmq channels

* Aas deployment (#114)

* Initial compose file

* Changed over to production SSL certificates

* Inital Docker Compose production script

* Keeping all networking in links rather than networks

* typo fix

* Changed networking for "elastic" and "bridge"

* Troubleshooting elasticsearch cluster

* es debug

* es debug

* es docker debug

* es debug

* es debug

* es debug

* es debug

* Fixed networking for es

* Added memory limits to rabbitmq

* Added a 1s sleep on harvesters to avoid query overload

* Changed retry of tasks to only once after 10min and added cronjob to rerun last two days of errors every night

* Possibility to only rerun errors between timestamps on CLI

* Added link between error_monitoring, event and harvester_event to avoid having double creation of errors

* Fixed inconsistency in import

* Added check on github link on event parsing

* Allow null values of LicenseUrl on event json parsing

* Lowered acknowledge time to 30min

* Refactored to avoid circular references

* Removed unused imports

* Refactored to avoid circular references when checking github events

* Another move to avoid circular references

* Fixed bug pointing to wrong rerun function

* Small fixes on the github validation

* Changed error reporting to only take errors that have not successfully been rerun

* Make sure that the error message aren't larger than slack limits

* Added unique constraint on groupm2m subgroup since it's never intended to have multiple of them

Co-authored-by: mattiasw <[email protected]>

* Update AUTHORS.rst

Added new authors

* Update Dockerfile

Adding Ignore pipfile to only use Pipfile.lock for docker install

* Added ignore pipfile to dockerfile

* Added flask version to Pipfile

* Added specific Invenio versions to ensure compatibility

Co-authored-by: mattiaswangblad <[email protected]>
Co-authored-by: Mattias Wångblad <[email protected]>
Co-authored-by: mattiasw <[email protected]>
  • Loading branch information
4 people authored Mar 4, 2022
1 parent 7fa422a commit afcf095
Show file tree
Hide file tree
Showing 61 changed files with 4,175 additions and 992 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,6 @@ target/

# Vim swapfiles
.*.sw?

# VScode Docs
.vscode/
3 changes: 3 additions & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,6 @@ Authors
- Chiara Bigarella
- Krzysztof Nowak
- Thomas P. Robitaille
- Mattias Wångblad
- Mubdi Rahman

8 changes: 4 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@
#
# Asclepias Broker is free software; you can redistribute it and/or modify it
# under the terms of the MIT License; see LICENSE file for more details.
FROM inveniosoftware/centos7-python:3.6
FROM python:3.8

RUN mkdir /app
WORKDIR /app

COPY Pipfile Pipfile
COPY Pipfile.lock Pipfile.lock
RUN pipenv install --deploy --system

RUN pip install pipenv
RUN pipenv install --deploy --system --ignore-pipfile
ADD . /app
RUN pip install .
RUN pipenv install . --ignore-pipfile

CMD ["asclepias-broker", "shell"]
32 changes: 20 additions & 12 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,31 @@ verify_ssl = true
name = "pypi"

[packages]
celery = "<4.3"
celery = "==5.1.2"
idutils = "*"
invenio = {version = ">=3.1.0,<3.2.0",extras = ["elasticsearch6", "postgresql"]}
invenio-accounts = ">=1.1.1,<1.2.0"
invenio-indexer = ">=1.0.1,<1.1.0"
invenio-logging = ">=1.1.0,<1.2.0"
invenio-oauth2server = ">=1.0.3,<1.1.0"
invenio-queues = "==1.0.0a1"
invenio-records-rest = ">=1.4.0,<1.5.0"
invenio-rest = ">=1.0.0,<1.1.0"
invenio = {version = "==3.4.1",extras = ["elasticsearch7", "postgresql"]}
invenio-app = "==1.3.3"
invenio-base = "==1.2.5"
invenio-celery = "==1.2.3"
invenio-accounts = "==1.4.9"
invenio-indexer = ">=1.2.1"
invenio-logging = "==1.3.0"
invenio-oauth2server = "==1.3.4"
invenio-queues = "==1.0.0a3"
invenio-records-rest = "==1.9.0"
invenio-search = "==1.4.2"
invenio-rest = ">=1.2.3"
invenio-theme = "==1.3.10"
flask = "==2.0.2"
jsonschema = "*"
marshmallow = ">=2.0.0,<3.0.0"
marshmallow = ">=3.0.0,<4.0.0"
raven = {version = "*",extras = ["flask"]}
requests = "*"
uwsgi = "*"
uwsgi = ">= 2.0.19"
uwsgi-tools = "*"
uwsgitop = "*"
ipython = "*"
slackclient = "*"

[dev-packages]
check-manifest = "*"
Expand All @@ -43,7 +51,7 @@ sphinxcontrib-httpdomain = "*"
sphinxcontrib-httpexample = "*"

[requires]
python_version = "3.6"
python_version = "3.8"

[scripts]
test = "python setup.py test"
Expand Down
2,323 changes: 1,582 additions & 741 deletions Pipfile.lock

Large diffs are not rendered by default.

81 changes: 77 additions & 4 deletions asclepias_broker/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,10 @@
from invenio_records_rest.facets import range_filter, terms_filter
from invenio_records_rest.utils import deny_all
from invenio_search.api import RecordsSearch
from celery.schedules import crontab

from .search.query import enum_term_filter, nested_range_filter, \
nested_terms_filter
nested_terms_filter, simple_query_string_filter


def _parse_env_bool(var_name, default=None):
Expand Down Expand Up @@ -125,8 +126,20 @@ def _parse_env_bool(var_name, default=None):
'schedule': timedelta(minutes=60),
},
'reindex': {
'task': 'asclepias_broker.tasks.reindex_all_relationships',
'schedule': timedelta(hours=24)
'task': 'asclepias_broker.search.tasks.reindex_all_relationships',
'schedule': crontab(hour=23, minute=0)
},
'notify': {
'task': 'asclepias_broker.monitoring.tasks.sendMonitoringReport',
'schedule': crontab(hour=0, minute=0, day_of_week=0)
},
'rerun_harvester_errors': {
'task': 'asclepias_broker.monitoring.tasks.rerun_harvest_errors',
'schedule': crontab(hour=22, minute=0)
},
'rerun_event_errors': {
'task': 'asclepias_broker.monitoring.tasks.rerun_event_errors',
'schedule': crontab(hour=21, minute=0)
},
}

Expand Down Expand Up @@ -167,6 +180,36 @@ def _parse_env_bool(var_name, default=None):
max_result_window=10000,
error_handlers=dict(),
),
meta=dict(
pid_type='meta',
pid_minter='relid',
pid_fetcher='relid',
search_class=RecordsSearch,
indexer_class=None,
search_index='relationships',
search_type=None,
search_factory_imp='asclepias_broker.search.query.meta_search_factory',
# Only the List GET view is available
create_permission_factory_imp=deny_all,
delete_permission_factory_imp=deny_all,
update_permission_factory_imp=deny_all,
read_permission_factory_imp=deny_all,
links_factory_imp=lambda p, **_: None,
record_serializers={
'application/json': ('invenio_records_rest.serializers'
':json_v1_response'),
},
# TODO: Implement marshmallow serializers
search_serializers={
'application/json': ('invenio_records_rest.serializers'
':json_v1_search'),
},
list_route='/metadata',
item_route='/metadata/<pid(meta):pid_value>',
default_media_type='application/json',
max_result_window=10000,
error_handlers=dict(),
),
)
RECORDS_REST_FACETS = dict(
relationships=dict(
Expand Down Expand Up @@ -205,12 +248,41 @@ def _parse_env_bool(var_name, default=None):
'isRelatedTo': 'IsRelatedTo'
}
),
keyword=simple_query_string_filter('Source.Keywords_all'),
journal=nested_terms_filter('Source.Publisher.Name','Source.Publisher'),
),
post_filters=dict(
type=terms_filter('Source.Type.Name'),
publication_year=range_filter(
'Source.PublicationDate', format='yyyy', end_date_math='/y'),
'Source.PublicationDate', format='yyyy', start_date_math='/y', end_date_math='/y'),
)
),

# The topHits agg can't be addded here due to limitations in elasticsearch_dsl aggs function so that is added in query.py
metadata=dict(
aggs=dict(
NumberOfTargets=dict(
cardinality=dict(field='Target.ID')
),
publication_year=dict(
date_histogram=dict(
field='Source.PublicationDate',
interval='year',
format='yyyy',
),
),
),
filters=dict(
group_by=enum_term_filter(
label='group_by',
field='Grouping',
choices={'identity': 'identity', 'version': 'version'}
),
keyword=simple_query_string_filter('Source.Keywords_all'),
journal=nested_terms_filter('Source.Publisher.Name','Source.Publisher'),
publication_year=range_filter(
'Source.PublicationDate', format='yyyy', start_date_math='/y', end_date_math='/y'),
),
)
)
# TODO: See if this actually works
Expand All @@ -228,6 +300,7 @@ def _parse_env_bool(var_name, default=None):
}
}
RATELIMIT_STORAGE_URL = f'{REDIS_BASE_URL}/3'
RATELIMIT_AUTHENTICATED_USER = '20000 per hour;500 per minute'

APP_DEFAULT_SECURE_HEADERS['force_https'] = True
APP_DEFAULT_SECURE_HEADERS['session_cookie_secure'] = True
Expand Down
15 changes: 14 additions & 1 deletion asclepias_broker/events/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def validate_payload(cls, event):
for payload in event:
errors = RelationshipSchema(check_existing=True).validate(payload)
if errors:
raise MarshmallowValidationError(errors)
raise MarshmallowValidationError(str(errors) + "payload" + str(payload))

@classmethod
def handle_event(cls, event: dict, no_index: bool = False,
Expand All @@ -69,3 +69,16 @@ def handle_event(cls, event: dict, no_index: bool = False,
else:
task.apply_async()
return event_obj

@classmethod
def rerun_event(cls, event: Event, no_index: bool, eager:bool = False):
event_uuid = str(event.id)
idx_enabled = current_app.config['ASCLEPIAS_SEARCH_INDEXING_ENABLED'] \
and (not no_index)
task = process_event.s(
event_uuid=event_uuid, indexing_enabled=idx_enabled)
if eager:
task.apply(throw=True)
else:
task.apply_async()
return event
48 changes: 48 additions & 0 deletions asclepias_broker/events/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,18 @@
"""Events CLI."""

from __future__ import absolute_import, print_function
import datetime

import json

import click
from flask.cli import with_appcontext
from flask import current_app

from ..utils import find_ext
from .api import EventAPI
from ..graph.tasks import process_event
from .models import Event, EventStatus


@click.group()
Expand All @@ -41,3 +45,47 @@ def load(jsondir_or_file: str, no_index: bool = False, eager: bool = False):
EventAPI.handle_event(data, no_index=no_index, eager=eager)
except ValueError:
pass

@events.command('rerun')
@click.option('-i','--id', default=None)
@click.option('-a', '--all', default=False, is_flag=True)
@click.option('-e', '--errors', default=False, is_flag=True)
@click.option('-p', '--processing', default=False, is_flag=True)
@click.option('--no-index', default=False, is_flag=True)
@click.option('--eager', default=False, is_flag=True)
@with_appcontext
def rerun(id: str = None, all: bool = False, errors: bool = True, processing: bool = False, no_index: bool = False, eager: bool = False):
"""Rerun failed or stuck events."""
if id:
rerun_id(id, no_index, eager)
return
if all:
errors = True
processing = True
if processing:
rerun_processing(no_index, eager)
rerun_new(no_index, eager)
if errors:
rerun_errors(no_index, eager)

def rerun_id(id:str, no_index: bool, eager:bool = False):
event = Event.get(id)
if event:
EventAPI.rerun_event(event, no_index=no_index, eager=eager)

def rerun_processing(no_index: bool, eager:bool = False):
yesterday = datetime.datetime.now() - datetime.timedelta(days = 1)
resp = Event.query.filter(Event.status == EventStatus.Processing, Event.created < str(yesterday)).all()
for event in resp:
EventAPI.rerun_event(event, no_index=no_index, eager=eager)

def rerun_new(no_index: bool, eager:bool = False):
yesterday = datetime.datetime.now() - datetime.timedelta(days = 1)
resp = Event.query.filter(Event.status == EventStatus.New, Event.created < str(yesterday)).all()
for event in resp:
EventAPI.rerun_event(event, no_index=no_index, eager=eager)

def rerun_errors(no_index: bool, eager:bool = False):
resp = Event.query.filter(Event.status == EventStatus.Error).all()
for event in resp:
EventAPI.rerun_event(event, no_index=no_index, eager=eager)
11 changes: 10 additions & 1 deletion asclepias_broker/events/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,14 @@
import enum
import uuid
from typing import Union
import datetime

from invenio_accounts.models import User
from invenio_db import db
from sqlalchemy.schema import PrimaryKeyConstraint
from sqlalchemy_utils.models import Timestamp
from sqlalchemy_utils.types import JSONType, UUIDType
from sqlalchemy import func

from ..core.models import Identifier, Relationship

Expand Down Expand Up @@ -53,7 +55,14 @@ class Event(db.Model, Timestamp):
def get(cls, id: str = None, **kwargs):
"""Get the event from the database."""
return cls.query.filter_by(id=id).one_or_none()


@classmethod
def getStatsFromLastWeek(cls):
"""Gets the stats from the last 7 days"""
last_week = datetime.datetime.now() - datetime.timedelta(days = 7)
resp = db.session.query(cls.status, func.count('*')).filter(cls.updated > str(last_week)).group_by(cls.status).all()
return resp

def __repr__(self):
"""String representation of the event."""
return f"<{self.id}: {self.created}>"
Expand Down
1 change: 1 addition & 0 deletions asclepias_broker/graph/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,7 @@ class GroupM2M(db.Model, Timestamp):
__tablename__ = 'groupm2m'
__table_args__ = (
PrimaryKeyConstraint('group_id', 'subgroup_id', name='pk_groupm2m'),
UniqueConstraint('subgroup_id', name='uq_groupm2m_subgroup_id'),
)
group_id = db.Column(
UUIDType,
Expand Down
Loading

0 comments on commit afcf095

Please sign in to comment.