Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Offline batch inference guide (static assignment) #4144

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Michaelvll
Copy link
Collaborator

An initial guide for the offline batch inference with static assignment.

Future TODOs:

  • Dynamic pulling based batch inference
  • Batch inference with autoscaling (maybe based on SkyServe)
  • An interface for batch inference.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick look. P0 is discuss + clarify the concepts.


.. code-block::

metadata.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to show the first few lines of metadata file.


.. code-block:: bash

NUM_PARALLEL_GROUPS=16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"groups" is surprising/unfamiliar; do you mean "workers"?

.. _offline-batch-inference:

Large-Scale Offline Batch Inference
===================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall may need a figure to illustrate all concepts and how they relate: chunks / work items (is one item = one jsonl in this case?) / groups / workers, ..

At least a bullet list at the top to explain these. I'm currently confused by the above.


llm = LLM(model='meta-llama/Meta-Llama-3.1-7B-Instruct', tensor_parallel_size=1)

def batch_inference(llm: LLM, data_path: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename? It seems to be working on 1 work item / 1 batch only, and there's no 'batching'?

Scale Out to Multiple Nodes
---------------------------

To scale out the inference to multiple machines, we can group the data chunks into multiple pieces so that each machine can process one piece.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce concept: yet another one here 'piece'!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants