Hadrosaur makes it easy to track the completion status, errors, and logs of large amounts of resources (files, metadata, analytics, database imports, etc.).
You simply define your resource as a decorated Python function that can create files and save metadata using an identifier in a certain namespace. Later on, you can quickly fetch the status and results of previously computed resources.
This library uses a combination of LevelDB and the file system to track the state of your tasks.
pip install hadrosaur
Import the lib and initialize a project using a base directory. Files, metadata, and logs will all get stored under this directory.
from hadrosaur import Project
proj = Project('./base_directory')
Define a collection using a decorator around a function. This function's job is to generate a single resource for the collection given a unique ID and some arguments.
The collection should have a unique name, and its function must take these params:
ident
— an identifier (unique across the collection) for each computed resourceargs
— a dictionary of optional argumentsctx
— a Context object which holds some extra data you may find useful during computation:ctx.subdir
- the path of a directory in which you can store files for this resourcectx.logger
- a special Python logging instance that will write to a rotating log file stored in the resource directory, with some nice default formatting
@proj.resource('collection_name')
def compute_resource(ident, args, ctx):
ctx.logger.info("Starting up")
# Run some things...
# Maybe save stuff into ctx.subdir...
time.sleep(1)
# Return any JSON-serializable data for the resource, such as metadata, run results, filepaths, etc.
return {'ts': time.time()}
Use the proj.fetch(collection_name, ident)
method to compute and cache resources in a collection.
Keyword arguments:
args
-- an optional dict of extra arguments for the resource compute functionrecompute
-- force the resource to be re-computed, even if it has already been computed
What happens when you fetch a resource:
- If the resource has not yet been computed, the collection's compute function will be run.
- If the resource was already computed in the past, then the saved results will get returned instantly (unless
recompute=True
has been set in the keyword arguments). - If an error is thrown in the function, logs will be saved and the status will be updated
>> proj.fetch('collection_name', 'uniq_ident123', optional_args)
<Resource>
The resource object has the following properties:
resource.result
: any JSON-serializable data returned by the resource's compute functionresource.start_time
: The unix epoch (in milliseconds) of when the resource started being computederesource.end_time
: the unix epoch (in ms) of when the resource finished computing (or failed)resource.status
: whether the resource has been computed already ("completed"), is currently being computed ("pending"), has not yet been fetched at all ("unavailable"), or threw a Python error while running the function ("error")resource.paths
: A dictionary of all the filesystem paths associated with your resource, with the following keys:'base'
: The base directory that holds all data for the resource'error'
: A Python stacktrace of any error that occured while running the resource's function'log
': A line-by-line log file produced by the resource's logger (ctx.logger
)'status'
: Path to the current status ("unavailable", "completed", "pending", "error")'result'
: Path to a JSON file of serializable data returned by the resource's function'storage'
: Directory path of additional files written by the resource's function (ctx.subdir
)
To see status counts for a whole collection, use proj.stats('collection_name')
:
> proj.stats('collection_name')
{
'counts': {
'total': 100,
'pending': 75,
'completed': 20,
'error': 5,
'unavailable': 0
}
}
Use proj.stats()
without an argument to fetch the stats for all collections.
To get a list of resource IDs for a given status, use proj.fetch_by_status
:
> proj.fetch_by_status('collection_name', 'pending')
['1', '2', '3'..]
Use proj.status('collection_name', 'resource_id')
to see the status of a particular resource.
> proj.status('collection_name', 'resource_id')
"complete"
If an exception was raised during the execution of the function used to compute
a resource, then use proj.fetch_error
to see the error.
> proj.fetch_error('collection_name', 'resource_id')
"""Traceback (most recent call last):
File "/home/j/code/hadrosaur/hadrosaur/main.py", line 211, in fetch
result = func(ident, args, ctx)
File "/home/j/code/hadrosaur/test/test_general.py", line 26, in throw_something
raise RuntimeError('This is an error!')
RuntimeError: This is an error!"""
To see the run log (produced by ctx.logger
during function execution), then use proj.fetch_log
> proj.fetch_log('collection_name', 'resource_id')
"""
2020-02-05 16:15:35 INFO output here (test_general.py:25)
2020-02-05 16:15:35 INFO more output here (test_general.py:25)
"""