Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: KernelMemory - Simultaneous execution of AskDocument & ImportDocument #918

Open
aropb opened this issue Sep 3, 2024 · 18 comments
Open

Comments

@aropb
Copy link

aropb commented Sep 3, 2024

Description

System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'Cannot use this SafeLLamaContextHandle - it has been disposed'.
at LLama.Native.SafeLLamaContextHandle.ThrowIfDisposed()
at LLama.Native.SafeLLamaContextHandle.GetLogitsIth(Int32 i)
at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+MoveNext()
at LLama.StatelessExecutor.InferAsync(String prompt, IInferenceParams inferenceParams, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
at Microsoft.KernelMemory.Handlers.SummarizationHandler.SummarizeAsync(String content, IContext context)
at Microsoft.KernelMemory.Handlers.SummarizationHandler.SummarizeAsync(String content, IContext context)
at Microsoft.KernelMemory.Handlers.SummarizationHandler.InvokeAsync(DataPipeline pipeline, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Pipeline.InProcessPipelineOrchestrator.RunPipelineAsync(DataPipeline pipeline, CancellationToken cancellationToken)
at Microsoft.KernelMemory.Pipeline.BaseOrchestrator.ImportDocumentAsync(String index, DocumentUploadRequest uploadRequest, IContext context, CancellationToken cancellationToken)

Reproduction Steps

KernelMemory - Simultaneous execution of AskDocument & ImportDocument

Environment & Configuration

  • Operating system: Windows 11
  • .NET runtime version: 8.0.8
  • LLamaSharp version: 0.16.0
  • CUDA version (if you are using cuda backend): 12
  • CPU & GPU device: CPU or GPU, Intel Core Ultra 9 & RTX 3090

Known Workarounds

@martindevans
Copy link
Member

I haven't worked much on the kernel memory side of things, but from that call stack I can see it's trying to use an executor after it has been disposed inside the SummarizeAsync method.

I would guess that you're not meant to make a second call while a previous async call is still ongoing.

Right now I don't think anyone is working much on the KM integration, so if you'd like to dig in and PR a fix (e.g. a check that throws if you try to make a second call, or actually a true fix to allow simultaneous calls) I'll be happy to review it :)

@aropb
Copy link
Author

aropb commented Sep 4, 2024

I see a lot of issue on KM and LLamasharp. RAG is a very popular technology, it would be very good to bring it to mind.

Of all the problems 0.16.0, I am now most concerned about a 2-fold drop in performance from the GPU :(

@aropb
Copy link
Author

aropb commented Sep 4, 2024

I haven't worked much on the kernel memory side of things, but from that call stack I can see it's trying to use an executor after it has been disposed inside the SummarizeAsync method.

They don't do anything with the context at all. They don't create or delete it.

KM ---->
await foreach (string token in textGenerator.GenerateTextAsync(filledPrompt, new TextGenerationOptions()).ConfigureAwait(false))

LLama ---->

Please explain why the context is created in the constructor and immediately deleted, and then re-created each time in InferAsync?

public StatelessExecutor(LLamaWeights weights, IContextParams @params, ILogger? logger = null)
{
    ....
    Context = _weights.CreateContext(_params, logger);
    Context.Dispose();
}


public async IAsyncEnumerable<string> InferAsync(string prompt, IInferenceParams? inferenceParams = null, 
    [EnumeratorCancellation] CancellationToken cancellationToken = default(CancellationToken))
{
    if (!Context.NativeHandle.IsClosed)
    {
        Context.Dispose();
    }

    using LLamaContext context = _weights.CreateContext(_params, _logger);
    Context = context;

    .....
}

@martindevans
Copy link
Member

martindevans commented Sep 4, 2024

Please explain why the context is created in the constructor and immediately deleted

I can't remember why that was done, I think it was added ages ago as a hacky fix to work around an issue. I've been planning to overhaul the executors soon and hopefully that will get removed.

then re-created each time in InferAsync?

The StatelessExecutor creates a context for a request, uses it for that request, and then destroys it. That ensures it is completely stateless (nothing can survive between requests if the entire context is destroyed). Again, I think this could be done better now and it'll hopefully be replaced in an overhaul soon.

As to the bug, I don't think StatelessExecutor is safe to use concurrently (you'd probably end up with 2 requests trying to use the same context, and then when one finishes it will dispose that context, which kills the other ongoing work).

@aropb
Copy link
Author

aropb commented Sep 4, 2024

The solution to the problem has been found.

!!! But no, with such a decision, some garbage gets in response.

LLamaSharpTextGenerator.GenerateTextAsync The executor needs to be created every time:

    public IAsyncEnumerable<string> GenerateTextAsync(string prompt, TextGenerationOptions options, CancellationToken cancellationToken = default)
    {
        var executor = new StatelessExecutor(_weights, _modelParams);
        return executor.InferAsync(prompt, _inferenceParams, cancellationToken);
    }

For example:
https://github.com/microsoft/kernel-memory/blob/9b63662ed706e42ca4eb7c137ca0de60460e163e/extensions/LlamaSharp/LlamaSharp/LlamaSharpTextGenerator.cs

    public IAsyncEnumerable<string> GenerateTextAsync(string prompt, TextGenerationOptions options, CancellationToken cancellationToken = default)
    {
        var executor = new InteractiveExecutor(_context);
        return executor.InferAsync(prompt, _inferenceParams, cancellationToken);
    }

this option gives an error:
llama_decode failed: 'NoKvSlot'

:(

@martindevans
Copy link
Member

llama_decode failed: 'NoKvSlot' just means you don't have enough context space to process your whole request. Try test thing with a small request. I think you're on to something with the executor.

@aropb
Copy link
Author

aropb commented Sep 4, 2024

please explain what does context space mean?

@martindevans
Copy link
Member

When a context is created, you choose how many tokens can fit into that context (by setting it here). Every single token needs to fit into that space, so if you load up with a context of e.g. 1000 you cannot have more than 1000 tokens processed by the model in total.

@aropb
Copy link
Author

aropb commented Sep 5, 2024

I set ContextSize = 8192

@aropb
Copy link
Author

aropb commented Sep 5, 2024

The problem is that the Context is created alone in the constructor, and there are already several InteractiveExecutors in it. And that's the problem. In general, I see that there are large gaps with Executor.

@aropb
Copy link
Author

aropb commented Sep 5, 2024

Unfortunately, in its current form, all this does not work with KM in the multithreaded cases! When several users are working with the application at the same time or ImportDocument is running.

@aropb
Copy link
Author

aropb commented Sep 9, 2024

I've tried everything, it seems.
Even if you use two different StatelessExecutor objects (I use Services.AddTransient() for TestGeneration), garbage gets into the response at the same time. I don't understand why this is happening. I need help.

image

My code:

public sealed class MyTextGenerator : ITextGenerator, IDisposable
{
    private readonly LLamaWeights _weights; 
    private readonly ModelParams _modelParams;
    private readonly InferenceParams _inferenceParams;
    private LLamaContext _context;

    public int MaxTokenTotal { get; }

    public MyTextGenerator(LLamaWeights weights, ModelParams modelParams, InferenceParams inferenceParams)
    {
        MaxTokenTotal = (int)modelParams.ContextSize;
        _inferenceParams = inferenceParams;
        _weights = weights;
        _modelParams = modelParams;
    }

    public void Dispose()
    {
        _context?.Dispose();
    }

    private LLamaContext GetContext()
    {
       // for optimization used GPU memory: it is created when necessary, not in the constructor
       // One Context

        _context ??= _weights.CreateContext(_modelParams);
        return _context;
    }

    public IAsyncEnumerable<string> GenerateTextAsync(string prompt, TextGenerationOptions options, CancellationToken cancellationToken = default)
    {
       // Two Context in StatelessExecutor
       // 2 contexts are used, which means GPU memory is being used, which is not optimal!

        var executor = new StatelessExecutor(_weights, _modelParams);
        return executor.InferAsync(prompt, _inferenceParams, cancellationToken);
    }

    public int CountTokens(string text) => GetContext().Tokenize(text: text, addBos: true, special: true).Length;

    public IReadOnlyList<string> GetTokens(string text)
    {
        var context = GetContext();
        LLamaToken[] numericTokens = context.Tokenize(text: text, addBos: true, special: true);
        var decoder = new StreamingTokenDecoder(context);

        return numericTokens
            .Select(x => { decoder.Add(x); return decoder.Read(); })
            .ToList();
    }
}

// Another implementation option GenerateTextAsync

    public IAsyncEnumerable<string> GenerateTextAsync(string prompt, TextGenerationOptions options, CancellationToken cancellationToken = default)
    {
        var executor = new InteractiveExecutor(GetContext());
        return executor.InferAsync(prompt, _inferenceParams, cancellationToken);
    }

Even with the second consecutive call:

llama_get_logits_ith: invalid logits id 0, reason: batch.logits[0] != true
llama_decode failed: 'NoKvSlot'

This error can be corrected:

    private LLamaContext GetContext()
    {
        _context ??= _weights.CreateContext(_modelParams);
        _context.NativeHandle.KvCacheClear();
        return _context;
    }

But the simultaneous call still doesn't work:

CUDA error: operation not permitted when stream is capturing
CUDA error: operation failed due to a previous error during capture

I do not know what else to watch, do not load your own model for each call.

@zsogitbe
Copy link
Contributor

I am not surprised that your code does not work. You need to reimplement it completely. It does not have any sense to work with the context in several function while using a StatelessExecutor! StatelessExecutor is a very good option for KernelMemory, but you must not rely on the context. The context is created while doing the inference and then destroyed. This is also the best for multi-threaded applications because the GPU memory is released earlier.

Furthermore, multi-threading with GPU is more difficult with LLamaSharp than it seems at first sight because allocating GPU memory happens first always on the first GPU and if there is no space there after a second request fills GPU memory the first request cannot allocate GPU memory for the context and you will get NoKvslot messages (not enough GPU memory for the context). The best implementation for the moment depends on how many GUs you have, if you have only one small GPU, then you need to make sure that there is only one inference call to the GPU at once and then GPU memory must be release and then a new inference call can be added, etc. If you have several GPUs, then you need to place the inference calls on different GPUs, etc. If you have one large GPU, then you may place, for example, two inference calls at once on one GPU... Etc. you must build in some relatively complex logic into your code that uses the GPU.

We will not be able to help you with this, but if you understand the code, then this will be relatively straightforward to do. Try to first review some existing examples and make some yourself with different options.

@aropb
Copy link
Author

aropb commented Sep 10, 2024

This code is written based on:
https://github.com/SciSharp/LLamaSharp/blob/master/LLama.KernelMemory/LlamaSharpTextGenerator.cs

I started rewriting this code because it doesn't work from different threads at the same time.

public sealed class LlamaSharpTextGenerator
    : ITextGenerator, IDisposable
{
    private readonly StatelessExecutor _executor;

    private readonly LLamaWeights _weights;
    private readonly bool _ownsWeights;

    private readonly LLamaContext _context;
    private readonly bool _ownsContext;

    private readonly InferenceParams? _defaultInferenceParams;

    public int MaxTokenTotal { get; }

    /// <summary>
    /// Initializes a new instance of the <see cref="LlamaSharpTextGenerator"/> class from reused weights, context and executor.
    /// If executor is not specified, then a StatelessExecutor will be created with `context.Params`. So far only `StatelessExecutor` is expected.
    /// </summary>
    /// <param name="weights">A LLamaWeights object.</param>
    /// <param name="context">A LLamaContext object.</param>
    /// <param name="executor">An executor. Currently only StatelessExecutor is expected.</param>
    /// <param name="inferenceParams">Inference parameters to use by default</param>
    public LlamaSharpTextGenerator(LLamaWeights weights, LLamaContext context, StatelessExecutor? executor = null, InferenceParams? inferenceParams = null)
    {
        _weights = weights;
        _context = context;
        _executor = executor ?? new StatelessExecutor(_weights, _context.Params);
        _defaultInferenceParams = inferenceParams;
        MaxTokenTotal = (int)_context.ContextSize;
    }

    /// <inheritdoc/>
    public void Dispose()
    {
        if (_ownsWeights)
        {
            _weights.Dispose();
        }
        if (_ownsContext)
        {
            _context.Dispose();
        }
    }

    /// <inheritdoc/>
    public IAsyncEnumerable<string> GenerateTextAsync(string prompt, TextGenerationOptions options, CancellationToken cancellationToken = default)
    {
        return _executor.InferAsync(prompt, OptionsToParams(options, _defaultInferenceParams), cancellationToken: cancellationToken);
    }

    private static InferenceParams OptionsToParams(TextGenerationOptions options, InferenceParams? defaultParams)
    {
        if (defaultParams != null)
        {
            return defaultParams with
            {
                AntiPrompts = defaultParams.AntiPrompts.Concat(options.StopSequences).ToList().AsReadOnly(),
                Temperature = (float)options.Temperature,
                MaxTokens = options.MaxTokens ?? defaultParams.MaxTokens,
                FrequencyPenalty = (float)options.FrequencyPenalty,
                PresencePenalty =  (float)options.PresencePenalty,
                TopP = (float)options.NucleusSampling
            };
        }
        else
        {
            return new InferenceParams
            {
                AntiPrompts = options.StopSequences.ToList().AsReadOnly(),
                Temperature = (float)options.Temperature,
                MaxTokens = options.MaxTokens ?? 1024,
                FrequencyPenalty = (float)options.FrequencyPenalty,
                PresencePenalty = (float)options.PresencePenalty,
                TopP = (float)options.NucleusSampling,
            };
        }
    }

    /// <inheritdoc/>
    public int CountTokens(string text) => _context.Tokenize(text, special: true).Length;

    /// <summary>
    /// Get the list of tokens for the input text
    /// </summary>
    /// <param name="text">Input string to be tokenized</param>
    /// <returns>Read-only list of tokens for the input test</returns>
    /// <remarks>
    /// It throws if text is null and Includes empty stop token because addBos is left true to be consistent with the CountTokens implementation.</remarks>
    /// <see cref="CountTokens(string)"/>
    public IReadOnlyList<string> GetTokens(string text)
    {
        /* see relevant unit tests for important implementation notes regarding unicode */
        var numericTokens = _context.Tokenize(text, special: true);
        var decoder = new StreamingTokenDecoder(_context);
        return numericTokens
            .Select(x => { decoder.Add(x); return decoder.Read(); })
            .ToList();
    }
}

@zsogitbe
Copy link
Contributor

Again, it is more difficult than you think. The LLamaSharp library does not provide you with all possible options and sometime you need to extend the library for your needs and sometimes you need to suggest an improvement and post a PR.

This is one of the extensions that you will need to be able to use the StatelessExecutor in LlamaSharpTextGenerator.

public static IKernelMemoryBuilder WithLLamaSharpDefaults(this IKernelMemoryBuilder builder, LLamaSharpConfig config, LLamaWeights? weights=null)
{
	var parameters = new ModelParams(config.ModelPath)
	{
		ContextSize = config.ContextSize ?? 2048,
		Seed = config.Seed ?? 0,
		GpuLayerCount = config.GpuLayerCount ?? 20,
		Embeddings = true,
		MainGpu = config.MainGpu,
		SplitMode = config.SplitMode,
	};

	if (weights == null)
	{
		weights = LLamaWeights.LoadFromFile(parameters);
	}

	builder.WithLLamaSharpTextEmbeddingGeneration(new LLamaSharpTextEmbeddingGenerator(config, weights));
	builder.WithLLamaSharpTextGeneration(new LlamaSharpTextGenerator(config, weights));
	return builder;
}

Here, we do not pass the context to LlamaSharpTextGenerator because it will use the StatelessExecutor!

You will also need to remove the context from LlamaSharpTextGenerator and create a special version of LlamaSharpTextGenerator
that uses StatelessExecutor! it does not have any sense to have the context as input in the LlamaSharpTextGenerator constructor, since the StatelessExecutor is used (delete all mentions of the context). You can then also add a PR to correct any not logical code in the base library.

@aropb
Copy link
Author

aropb commented Sep 10, 2024

I've already rewritten that, of course. Please, I have already delved very deeply into the code :)

@aropb
Copy link
Author

aropb commented Sep 10, 2024

I rewrote the code:

public sealed class MyTextGenerator : ITextGenerator, IDisposable
{
    private readonly ModelParams _modelParams;
    private readonly InferenceParams _inferenceParams;
    private readonly LLamaWeights _weights;

    public int MaxTokenTotal { get; }

    public MyTextGenerator(LLamaWeights weights, ModelParams modelParams, InferenceParams inferenceParams)
    {
        MaxTokenTotal = (int)modelParams.ContextSize;
        _inferenceParams = inferenceParams;
        _weights = weights;
        _modelParams = modelParams;
    }

    public void Dispose()
    {
    }

    public IAsyncEnumerable<string> GenerateTextAsync(string prompt, TextGenerationOptions options, CancellationToken cancellationToken = default)
    {
        var executor = new StatelessExecutor(_weights, _modelParams);
        return executor.InferAsync(prompt, _inferenceParams, cancellationToken);
    }

    public int CountTokens(string text)
    {
        using var context = _weights.CreateContext(_modelParams);
        return context.Tokenize(text: text, addBos: true, special: true).Length;
    }

    public IReadOnlyList<string> GetTokens(string text)
    {
        using var context = _weights.CreateContext(_modelParams);

        LLamaToken[] numericTokens = context.Tokenize(text: text, addBos: true, special: true);
        var decoder = new StreamingTokenDecoder(context);

        return numericTokens
            .Select(x => { decoder.Add(x); return decoder.Read(); })
            .ToList();
    }
}

Using Services.AddTransient:

    public static IKernelMemoryBuilder WithTextGeneration(this IKernelMemoryBuilder builder, LLamaWeights weights, ModelParams modelParams, InferenceParams inferenceParams)
    {
        builder.Services.AddTransient<ITextGenerator, SeaTextGenerator>(ServiceProvider => new SeaTextGenerator(weights, modelParams, inferenceParams));
        return builder;
    }

When calling from different threads at the same time, such errors:

CUDA error: operation not permitted when stream is capturing
SafeLLamaContextHandle.llama_new_context_with_model => SafeLLamaContextHandle.llama_new_context_with_model
CUDA error: operation failed due to a previous error during capture
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
current device: 0, in function ggml_backend_cuda_buffer_clear at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:535
SafeLLamaContextHandle.llama_new_context_with_model => SafeLLamaContextHandle.llama_new_context_with_model
Error: current device: 0, in function ggml_backend_cuda_graph_compute at D:\a\LLamaSharp\LLamaSharp\ggml\src\ggml-cuda.cu:2632
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
cudaStreamEndCapture(cuda_ctx->stream(), &cuda_ctx->cuda_graph->graph)
SafeLLamaContextHandle.llama_decode => SafeLLamaContextHandle.llama_decode
cudaDeviceSynchronize()
SafeLLamaContextHandle.llama_new_context_with_model => SafeLLamaContextHandle.llama_new_context_with_model

@aropb
Copy link
Author

aropb commented Sep 10, 2024

You will also need to remove the context from LlamaSharpTextGenerator and create a special version of LlamaSharpTextGenerator
that uses StatelessExecutor! it does not have any sense to have the context as input in the LlamaSharpTextGenerator constructor, since the StatelessExecutor is used (delete all mentions of the context). You can then also add a PR to correct any not logical code in the base library.

It wasn't in my code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants