Evaluate prompt quality

Ensuring the quality and reliability of your prompt is critical when implementing Prompt API.

To evaluate your prompt quality, you should develop a comprehensive set of inputs and expected outputs for your use case.

To assess if your prompt meets your quality bar with each Gemini Nano model version, we recommend the following workflow:

  1. Run your evaluation dataset and record the outputs.
  2. Evaluate the results manually or use LLM-as-a-judge.
    1. If the evaluation does not meet your quality bar, iterate on your prompt. For example, ask a more powerful LLM such as Gemini Pro to improve the prompt based on the desired output versus the actual output.

Prompt engineering boosts task performance, and iterating on your prompts is key. We recommend at least 3-5 iterations on the above steps. Note that this approach has limits, as optimizations will eventually provide diminishing returns.

Safety

To ensure Gemini Nano returns safe results for users, multiple layers of protection are implemented to limit harmful or unintended results:

  • Native model safety: All Gemini models, including Gemini Nano, are trained to be safety-aware out of the box. This means safety considerations are built into the core of the model, not just added as an afterthought.
  • Safety filters on input and output: Both the input prompt and results generated by the Gemini Nano runtime are evaluated against our safety filters before providing the results to the app. This helps prevent unsafe content from slipping through, without any loss in quality.

However, since each app has its own criteria for what is considered safe content for users, you should assess the safety risks for your app's specific use case and test accordingly.

Additional resources