Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling new OpenAI models for MAGE #1587

Conversation

jankovidakovic
Copy link
Contributor

Enabling new OpenAI models to MAGE Code Agent

This PR enables MAGE to be used with all non-legacy OpenAI chat models.

Models

The list of models is taken from the OpenAI website, available here: https://platform.openai.com/docs/models. The list is likely to change (e.g. models become deprecated, new models are added), so updates to the list of models that MAGE supports are probably expected in the future.

One additional idea would be to dynamically generate the list of supported models by querying the API. Although we already do something (kinda) similar when checking whether the GPT-4 model is available, this suggestion seems more complicated to implement and it's unclear if it's actually needed at this time.

Plan generation vs the fixing step

Note that the model configuration is still enabled only for the later steps of ChatGPT prompting (fixing the generated Wasp/JS files). The initial step (plan generation) is fixed to use GPT-4. This makes sense (for now) as this is the most powerful model. In the future, though, we might want to expose this to the CLI configuration as well.

@jankovidakovic
Copy link
Contributor Author

BTW, I've thought about writing some unit tests for the feature, but I was unsure how to best do this.

I noticed that there aren't really any unit test for most of MAGE's functionality. Furthermore, the type of test I'm going for seems kinda complicated to write, since it would require the OpenAI API key to be set during testing, and would have to send HTTP requests.

Also, I've empirically verified (through inspecting the stdout) that the new models are being used when the API is queried.

Considering all this, does it make more sense to skip writing unit tests for something like this?

@Martinsos
Copy link
Member

Right, so there are multiple steps:

  1. Generation of Plan.
  2. Generation of everything else (+ fixing steps).

For (1) we use GPT_4 and that is fixed.
For (2) we use whatever they tell us to use, with 3.5 being the default.

You expanded our codebase with knowledge about more GPT models, which is great.
This allows users to pick any of these when specifying which model they want to use for (2).

One more thing we want to do is to start using the newest GPT 4 for step (1) by default. I believe you can set this up in Plan.hs, and maybe some more changes are needed, not sure -> that logic with fallbacking from GPT4 to 3.5 if 4 is not available that we have somewhere, that might also need updating, I am not sure without checking it more thoroughly.

Actually, I imagined this task to consist of two things:

  1. Update our default GPT4 model to be the newest one (4.5 turbo or what is the name).
  2. Allow them to specify any OpenAI model when specifying it via CLI (even if it is not on our list of models). If this is too complex, then we can just enrich that list with models as you did so they have choice of newer models.

What you did is a step in right direction, as it somewhat satisfies (2). What would be great now is to also do (1) and maybe make (2) even more flexible as I described.

@jankovidakovic
Copy link
Contributor Author

jankovidakovic commented Dec 2, 2023

When you say the "newest GPT-4 model", I assume you're referring to gpt-4-1106-preview.

From my experience with using the OpenAI API for various other things, the newsest GPT-4 is:

  • cheaper ( 3 times cheaper for input tokens, 2 times cheaper for output tokens)
  • allows for much longer input ( 128k is the limit for input+output tokens, but output tokens are capped at 4096 tokens)
  • faster ( generates output tokens significantly faster)
  • generally worse than gpt-4

To clarify the last point, in general gpt-4-1106-preview gives lower quality outputs than gpt-4.
The new model has a harder time attending to the specific information provided in the input, and respecting the fine-grained output structure.

This is not detrimental when you're summarizing books of 100k tokens, where the output doesn't have a predefined structure.
However, we are generating structured plans, and wasp and js code. For this reason, my intuition is that it would make little sense for us to pivot to using the newest model, simply because it will probably give worse results.

To verify this, I'll run some qualitative tests with using gpt-4-1106-preview for generating the plan.

  • I will generate TodoApp and MyPlants multiple times
  • I will report some statistics about errors occuring in the output
  • I will summarize the price difference between using the older and newer GPT-4 model

Does that sound okay?

@jankovidakovic
Copy link
Contributor Author

Regarding your second point, one option I looked more into the option of dynamically fetching (with HTTP) the list of currently available models, and then offering only those models to the user.

The problem, however, is that there's currently no way to only get a list of exclusively chat-completion models from OpenAI.

The endpoint that lists the models (https://api.openai.com/v1/models) returns all available OpenAI models (including image models, legacy GPT models, embedding and voice models, etc.).

We only use chat-completion models. We would therefore have to resort to some heuristic for filtering this list of models (e.g. some regex matching). Any update to the list of available models would also necessitate an update to this heuristic.

When compared to explicitly keeping a list of chat-completion models, the dynamic way of doing it would only give us an over-engineered way to offer some obscure legacy version of GPT models to the 0.01% of users who actually have a valid reason to try using it. It's more difficult to implement and to maintain.

This is quite an opinionated comment, but I'm open to further discussion about it. Let me know what you think!

@Martinsos
Copy link
Member

Martinsos commented Dec 6, 2023

GPT4 vs GPT4-1106

Ok @jankovidakovic makes sense -> don't spend too much time on the financial comparison, that is quite clear to me, but do some tests to see if there is any performance loss. If there is a non-significant impact, we will not adopt the new one for now. If there is not, we will adopt it.

Allow users to dynamically choose the GPT model

Sorry I should have explained it better: I didn't mean that we dynamically catch list of chat models from OpenAI (I agree we shouldn't bother with that for now), what I meant is that user can provide whatever string for a name of the chat model and we will try to use it. That might sound weird because how can they provide it via the web app, but I wasn't referring to the web app, but to the usage via the CLI, where they can actually specify a name of the model via string. However right now, we will check if that string is a model we know of, and will reject if we don't know about it. We could instead allow any model.

This is how they can specify the model via CLI currently:

wasp new-ai:disk MyAwesomeApp "Description of my awesome app." "{ \"defaultGptModel\": \"gpt-4\" }"

@Martinsos Martinsos closed this Dec 6, 2023
@Martinsos
Copy link
Member

Whoops I misclicked and closed the PR, reopening it.

@Martinsos Martinsos reopened this Dec 6, 2023
@jankovidakovic
Copy link
Contributor Author

Let me just clarify what you mean here.

Currently, the user of our CLI can only specify the model string that has a valid representation in our Model Enum (defined at src/Wasp/AI/OpenAI/ChatGPT.hs). If the user specifies an invalid string, the application fails (on Mage level) with Invalid GPT model error.

Instead of this, the request for change here is to allow the user to specify any string as the name of the OpenAI model?

So, for example, if the user writes {"defaultGptModel": "obviouslynotavalidmodelname"}, we want the wasp-ai app to fail at the OpenAI API level (where the API probably responds with 400 Bad Request or something similar)?

@Martinsos
Copy link
Member

Let me just clarify what you mean here.

Currently, the user of our CLI can only specify the model string that has a valid representation in our Model Enum (defined at src/Wasp/AI/OpenAI/ChatGPT.hs). If the user specifies an invalid string, the application fails (on Mage level) with Invalid GPT model error.

Instead of this, the request for change here is to allow the user to specify any string as the name of the OpenAI model?

So, for example, if the user writes {"defaultGptModel": "obviouslynotavalidmodelname"}, we want the wasp-ai app to fail at the OpenAI API level (where the API probably responds with 400 Bad Request or something similar)?

Yup, that would be it. We had users asking for this, and I think it makes sense. It gives them some freedom in using the models they want to use.

Although, can it really help them? I mean we know which model works the best, right? So why would they pick some other model? It makes sense only in case when new open-ai model comes out and we don't yet have it in our list of models (our enum). Or if they have their own pretrained model up there on OpenAI, but I don't think anybody will have on for Wasp.

Ok, so you know what, let's prioritize and let's do the following:

  1. Test which GPT4 is now the best for us and use that one for planning.
  2. What you did with offering additional models is enough for now, I think. We don't need to offer them complete freedom here. We can do that in the future if needed.

@jankovidakovic jankovidakovic force-pushed the feature/mage-add-new-models branch from 07fd15c to 2559d30 Compare December 11, 2023 17:19
@jankovidakovic
Copy link
Contributor Author

I think that the users probably asked for this because they wanted to use the newer OpenAI chat models.

Once this PR is merged, all of these models will have been enabled. Therefore, I agree with your point -- there is no need to support other strings as models for now. The additional note regarding the fine-tuned OpenAI models is interesting though, we might want to look into that in the future.

@jankovidakovic
Copy link
Contributor Author

jankovidakovic commented Dec 11, 2023

In the meantime, I've added the option to specify the GPT model used for the "Plan" part.

The model can be specified in the "planGptModel" input field in the CLI-supplied JSON. By default, GPT-4 is used (as before).

I don't necessarily think that we should expose this to the web app, but at least it enables us (the developers) to change the model versions freely without needing to run cabal build every time.
Still, is there some place in the documentation to add this?

Copy link
Member

@Martinsos Martinsos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice we are very close, but left some more comments!

One more thought: there is a piece of our code where we check if users have GPT4 available, and if not, we falledback to using GPT3.5 for planning, if I am correct. That made sense when GPT4 was yet coming out, but now everybody has access so it doesn't make much sense anymore to do that. Therefore, I would suggest we also remove that logic, it is redundant. You can find it in src/Wasp/AI/CodeAgent.hs. We can do this as separate PR, or as a part of this PR if you want, whatever you like better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two things confusing regarding naming:

  1. planGptModel vs defaultGptModel sounds a bit confusing, it is not clean what does default mean -> what is this one used for? plan is pretty clear, but default is a bit ambiguous.
  2. In functions below that give us defaultChatGptParams and defaultChatGptParmsForPlan, word default is starting to lose its meaning and it was a bit hard for me to figure out what is the relationship of this and defaultGptModel and so on.

We had some of these problems already before, but now they are getting emphasized with these new changes, and I think we should look to resolve them.

The fact is, we have two GPT models: one for generating plan, another for generating code / fixing code. We want each one to be specifiable via CLI. We also want to have defaults for each of them in case they are not specified.

Name ideas for the GPT model that produces plan: planningGptModel, planGptModel. Both seem quite ok.

Name ideas for the GPT model that does everything else:

  • deafultGptModel -> bad, confusing
  • baseGptModel -> not so bad, not super clear either, but general enough.
  • generationGptModel -> hm confusing I think, what is generation.
  • gptModel -> ha maybe not even so bad, at least it doesn't push you in the wrong direction, but then maybe baseGptModel is a bit better.
  • codegenGptModel -> hmm, not so bad. Although planning one also does some codegen. But this is pretty precise, I like that. Maybe too precise? What about fixing? I guess that falls under codegen, we shouldn't think about fixing.

I think going for planGptModel and baseGptModel could work well. With codegenGptModel being and alternative, but I think baseGptModel wins for a small bit.

Now that we got rid of default, we can have two variables, defaultPlanGptModel = GPT_4 and we can have defaultBaseGptModel = GPT_3_5_Turbo, and those will make code clearer.

Then, we can have functions:

baseChatGptParams :: NewProjectDetails -> ChatGPTParams
baseChatGptParams = undefined
  where
    defaultGptModel = GPT_3_5_Turbo

planChatGptParams :: NewProjectDetails -> ChatGPTParams
planChatGptParams = undefined
  where
    planChatGptParams = GPT_4

fixingChatGptParams :: ChatGPTParams -> ChatGPTParams
fixingChatGptParams = undefined

I did a bit of a trick here -> I think we can just get by with this one function for fixing, fixingChatGptParams, that alters other ChatGPTParams by subtracting 0.2 from their temp.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and let's set defaultGptModel for base case to be GPT_3_5_turbo_0613, right? To we make sure we don't experience a drop in performance once they make gpt-3.5 point to newer one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also pin down the version for GPT 4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about plannerGptModel and coderGptModel? Since we can presume that we have two "separate" GPT-based agents -- one that plans, and one that codes (and fixes code)? If not, I'm gonna go with planGptModel and baseGptModel.

waspc/src/Wasp/AI/OpenAI/ChatGPT.hs Show resolved Hide resolved
@Martinsos Martinsos deleted the branch wasp-lang:wasp-ai December 11, 2023 19:31
@Martinsos Martinsos closed this Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants