Evaluate prompt quality

Ensuring the quality and reliability of your prompt is critical when implementing Prompt API.

To evaluate your prompt quality, you should develop a comprehensive set of inputs and expected outputs for your use case.

To assess if your prompt meets your quality bar with each Gemini Nano model version, we recommend the following workflow:

Run your evaluation dataset and record the outputs.
Evaluate the results manually or use LLM-as-a-judge.
1. If the evaluation does not meet your quality bar, iterate on your prompt. For example, ask a more powerful LLM such as Gemini Pro to improve the prompt based on the desired output versus the actual output.

Prompt engineering boosts task performance, and iterating on your prompts is key. We recommend at least 3-5 iterations on the above steps. Note that this approach has limits, as optimizations will eventually provide diminishing returns.

Alternatively, to improve prompts quickly at scale, you can use the data-driven optimizer, which can target on-device models such as gemma-3n-e4b-it.

Safety

To ensure Gemini Nano returns safe results for users, multiple layers of protection are implemented to limit harmful or unintended results:

Native model safety: All Gemini models, including Gemini Nano, are trained to be safety-aware out of the box. This means safety considerations are built into the core of the model, not just added as an afterthought.
Safety filters on input and output: Both the input prompt and results generated by the Gemini Nano runtime are evaluated against our safety filters before providing the results to the app. This helps prevent unsafe content from slipping through, without any loss in quality.

However, since each app has its own criteria for what is considered safe content for users, you should assess the safety risks for your app's specific use case and test accordingly.

Additional resources

How good is your AI? Gen AI evaluation at every stage, explained - A blog post that describes how to use the Gen AI evaluation service.
Gen AI evaluation service overview - Documentation that describes how to assess generative AI models to support tasks like model comparisons, prompt improvement, and fine-tuning.
Run a computation-based evaluation pipeline - Documentation for how to evaluate model performance.

Evaluate prompt quality Stay organized with collections Save and categorize content based on your preferences.

Safety

Additional resources

Evaluate prompt quality