Skip to content

overview/overview-convert-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Base image for Overview converters.

How a Converter Works

A converter's job is to turn files of one type into files of another type. It does this in a loop. It receives jobs from an internal Overview HTTP server.

This base image provides portable executables that communicate with Overview. They make up a framework: they'll call your converter program, which you can write in any language.

Your converter will have a Dockerfile that looks like this:

FROM overview/overview-converter-framework AS framework
# multi-stage build

FROM alpine:3.7 AS build
... (build your executables, including `do-convert-single-file`)

FROM alpine:3.7 AS production
# Add ca-certificates to let container download from S3 https:// URLs
RUN apk add --update --no-cache ca-certificates
WORKDIR /app
# The framework provides the main executable
COPY --from=framework /app/run /app/run
# Your `do-convert` code can choose from a few different input and output
# formats. The framework provides many `/app/convert` implementations: pick
# the one that matches your `do-convert`.
COPY --from=framework /app/convert-single-file /app/convert
COPY --from=build /app/do-convert-single-file /app/do-convert-single-file

/app/run

This framework runs on a loop:

  1. Download a task from Overview as JSON.
  2. Open a stream to download the body of the input file.
  3. Stream the body to /app/convert MIME-BOUNDARY JSON and pipe the results to Overview.

/app/run handles all communication with Overview. In particular:

  • /app/run polls for tasks at POLL_URL. Overview's administrator must set POLL_URL for your container.
  • /app/run will retry if there is a connection error.
  • /app/run will never crash.
  • TODO /app/run will poll Overview to check if the task is canceled. It will notify /app/convert with SIGINT if the task is canceled.

/app/convert -- a.k.a., /app/convert-*

/app/convert is a program we provide, under a few different names. That is, when you create your program you'll choose one of the following implementations to copy into /app/convert in your image.

From /app/run's point of view, /app/convert will read the input stream and JSON command-line argument and produce a multipart/form-data output stream with MIME boundary MIME-BOUNDARY (in C lingo, argv[1]). /app/convert will never crash, and it will always output a data stream that Overview can handle.

Your code is invoked by /app/convert, following one of these strategies:

/app/convert-single-file

This version of /app/convert will:

  1. Write standard input to input.blob in a temporary directory and verify it's the correct size
  2. Run /app/do-convert-single-file JSON (your code) in the temporary directory
  3. Translate the stdout from your code into progress events or an error event
  4. When your code exits with status 0 and no error message, pipe output.json, output.blob -- and if they exist, output-thumbnail.jpg, output-thumbnail.png and output.txt -- and a done event

Special cases:

  • Cancelation: if /app/run sends a SIGINT signal, sends your program SIGINT. Your program should kill and wait for any child processes, then exit. Its standard output and standard error will be ignored.
  • Error: if /app/do-convert-single-file exits with non-zero return value, pipes an error event.

You must provide /app/do-convert-single-file. The framework will invoke /app/do-convert JSON. Your program can read input.blob in the current working directory. Your program must:

  1. Write progress messages to stdout, newline-delimited, that look like:
    • p1/2 -- "finished processing page 1 of 2"
    • b102/412 -- "finished processing byte 102 of 412"
    • 0.324 -- "finished processing 32.4% of input"
    • anything else at all -- "ERROR: [the line of text]"
  2. Write output.json, output.blob, and optionally output-thumbnail.jpg, output-thumbnail.png and/or output.txt.
  3. Exit with status code 0. Any other exit code is an error in your code.

Testing: /app/test-convert-single-file

You can test /app/do-convert-single-file by creating a Docker image with the special framework program, /app/test-convert-single-file. This is designed to integrate with automated build enviroments like Docker Hub.

Your Docker build stage doesn't need a CMD. It should include:

  • /app/test-convert-single-file -- and you should RUN [ "/app/test-convert-single-file" ]
  • /app/do-convert-single-file and everything it depends on -- /app/test-convert-single-file will invoke it once per test
  • /app/test/test-*: one directory per test, e.g. /app/test/test-with-ocr. Each test directory should contain:
    • input.blob
    • input.json -- the JSON passed to do-convert-single-file
    • stdout -- expected standard output from do-convert-single-file
    • 0.blob -- expected 0.blob output
    • 0.json -- expected 0.json output
    • 0.txt (optional) -- expected 0.txt output
    • 0-thumbnail.{png,jpg} (optional) -- expected output

test-convert-single-file will run do-convert-single-file in a separate directory per test. It will output in TAP format and exit with status code 1 if any test fails.

Copying failed-test files from the test suite

The test output is designed to help you correct your tests. For instance, here is example output from a test that fails because you did not write 0-thumbnail.jpg

Step 12/13 : RUN [ "/app/test-convert-single-file" ]
 ---> Running in f65521f3a30c
1..3
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
not ok 1 - test-jpg-ocr
    do-convert-single-file wrote /tmp/test-do-convert-single-file912093989/0-thumbnail.jpg, but we expected it not to exist
...

Upon seeing this error, you can docker cp f65521f3a30c:/tmp/test-do-convert-single-file912093989/0-thumbnail.jpg . to inspect the file in question (and perhaps make it the expected one).

Testing PDF conversion

PDF output is a common case. We use QPDF for file comparison, to ease debugging. Your Dockerfile must install QPDF -- e.g., apk --no-cache add qpdf -- before running RUN [ "/app/test-convert-single-file" ] if you are testing PDF output.

/app/convert-stream-to-mime-multipart

This version of /app/convert will:

  1. Create an empty temporary directory
  2. Run /app/do-convert-stream-to-mime-multipart MIME-BOUNDARY JSON (your code) within the temporary directory
  3. Stream the input file from Overview to your program's stdin and and pipe your program's stdout to Overview

Special cases:

  • Cancelation: if /app/run sends a SIGINT signal, sends your program SIGINT. Your program should kill and wait for any child processes, then exit. Its standard output and standard error will be ignored.
  • Error: if your program exits with non-zero return value, pipes an error event.
  • Buggy code: emits an error event if your program does not produce a error or done event or end with --MIME-BOUNDARY--.
  • Temporary files: if your program emits temporary files to its current working directory, they will be deleted.

You must provide /app/do-convert-stream-to-mime-multipart. The framework will invoke it with MIME-BOUNDARY and JSON as arguments. MIME-BOUNDARY will match the regex [a-fA-F0-9]{1,60}. Your program can read input.blob in the current directory.

Your program must write valid multipart/form-data output to stdout. For instance:

--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="0.json"\r\n
\r\n
{JSON for first output file}\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="0.blob"\r\n
\r\n
Blob for first output file\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="progress"\r\n
\r\n
{"pages":{"nProcessed":1,"nTotal":3}}\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="done"\r\n
\r\n
--MIME-BOUNDARY--

Rules:

  • Your output must end with a done or error element. A done element should be empty; an error element must include an error message.
  • Your output must be in order: 0.json, 0.blob, (optionally 0.png, 0.jpg and/or 0.txt), 1.json, 1.blob, ..., done.
  • You should output an accurate progress report before each N.json to help Overview's progressbar behave well.

Roll your own

Even more lightweight than /app/convert-stream-to-mime-multipart is to roll your own version of /app/convert. Beware, though:

  • Your own version of /app/convert must always output messages to Overview: especially a done or error event. Without those events, Overview will never finish processing the file: it will retry indefinitely.
  • Your own version of /app/convert must always exit successfully. The trickiest case, in our experience, is handling "out of memory." If your /app/convert does not exit successfully, Overview will retry indefinitely and the file will never be processed.
  • Your own version of /app/convert should output helpful error messages, so you can debug it easily.
  • Your own version of /app/convert should end quickly after receiving SIGUSR, because Overview will ignore all further output.
  • Your own version of /app/convert must ensure temporary files invoked during one invocation aren't read by the next invocation: that would leak users' documents to other users.

/app/convert-stream-to-mime-multipart is small and fast, and it solves these problems for you. You probably want it.

To Maintain This Repository

Coding

./dev will start a development loop that runs tests. Restart it if you edit Dockerfile.

Testing

docker build . will run all tests.

Tests are in ./test/*/suite.bats. They're run in bats, an ideal framework for testing programs that pipe data around.

Releasing

./release MAJOR.MINOR.PATCH will push to GitHub. Docker Hub will build the images for mass consumption.

License

This software is Copyright 2011-2018 Jonathan Stray and Copyright 2019-2020 Overview Computing Inc., and distributed under the terms of the GNU Affero General Public License. See the LICENSE file for details.