Report on experiments

  • Treatment campaigns are considered new campaigns and do not copy metrics from control campaigns.

  • Control and treatment campaigns accrue and retain their metrics separately throughout an experiment, even after promotion or graduation.

  • After promotion, changes from the treatment campaign are copied to the control campaign, but metrics remain associated with their original campaigns.

  • Experiment campaigns and base campaigns can be differentiated in search queries using campaign.experiment_type.

There are two main ways to report on experiments:

  • Direct experiment reporting: Query the experiment resource for metrics. This option provides metrics for control and treatment arms in a single response, along with statistical comparison data such as uplift and p-values. This is the only way to report on intra-campaign experiments.
  • Campaign reporting: Query the campaign resource for metrics, using campaign.experiment_type to distinguish between base and experiment campaigns. This option is only available for experiments that use separate control and treatment campaigns, such as system-managed experiments.

This guide focuses primarily on direct experiment reporting, which is compatible with all experiment types that support reporting.

Direct experiment reporting

You can query the experiment resource directly to retrieve performance metrics and statistical comparisons between your control and treatment arms.

Metrics and statistical significance

For core metrics such as clicks, impressions, cost, conversions, and conversion value, the experiment resource provides both treatment metrics (for example, metrics.clicks) and control metrics (for example, metrics.control_clicks) in the same row.

It also provides fields to help you evaluate the statistical significance of any difference between the arms:

  • metrics.*_p_value: The probability that the observed results would occur if the experiment had no actual effect on the metric. A lower p-value indicates higher statistical significance.
  • metrics.*_point_estimate: The estimated percentage lift (positive or negative) in the given metric for the treatment arm compared to the control arm. Together with margin_of_error, they describe a confidence interval with a prescribed confidence level for the difference being estimated. The quantity being estimated is (treatment / control - 1). The point estimate is the center of the confidence interval.
  • metrics.*_margin_of_error: The radius of the confidence interval, which is centered at point_estimate. It is calculated for a prescribed confidence level, which depends on the experiment type.

The following core metric fields are supported on the experiment resource, including a treatment group value, a control group value, and the stat fields listed previously:

  • clicks
  • impressions
  • cost_micros
  • conversions
  • cost_per_conversion
  • conversion_value
  • conversion_value_per_cost

For conversions, specifically, the following metrics fields are also available:

  • metrics.conversions_absolute_change_p_value: The p-value for the null hypothesis that the experiment has no effect on conversions absolute change. Ranges from 0 to 1.
  • metrics.conversions_absolute_change_point_estimate: The point estimate when estimating the experiment's effect on conversions absolute change.
  • metrics.conversions_absolute_change_margin_of_error: The margin of error when estimating the experiment's effect on conversions absolute change.

For assistance constructing valid queries to the experiment resource, use the Google Ads Query Builder tool.

Example query

The following GAQL query retrieves key metrics for an experiment:

SELECT
  experiment.experiment_id,
  experiment.name,
  experiment.type,
  metrics.clicks,
  metrics.control_clicks,
  metrics.clicks_point_estimate,
  metrics.clicks_margin_of_error,
  metrics.clicks_p_value,
  metrics.conversions,
  metrics.control_conversions,
  metrics.conversions_absolute_change_point_estimate,
  metrics.conversions_absolute_change_margin_of_error,
  metrics.conversions_absolute_change_p_value
FROM experiment
WHERE experiment.experiment_id = EXPERIMENT_ID

Interpret results

You can use the p-value, point estimate, and margin of error fields to determine if your experiment has yielded statistically significant results. For example, if conversions_absolute_change_p_value is below your chosen threshold (for example, 0.05 for 95% confidence) and conversions_absolute_change_point_estimate - conversions_absolute_change_margin_of_error is greater than zero, it indicates that the treatment arm is performing significantly better than the control arm in terms of conversions.

Here is a Python snippet demonstrating how to evaluate results based on p-value and lift estimates:

Java

This example is not yet available in Java; you can take a look at the other languages.
    

C#

This example is not yet available in C#; you can take a look at the other languages.
    

PHP

This example is not yet available in PHP; you can take a look at the other languages.
    

Python

def evaluate_experiment(
    client: GoogleAdsClient, customer_id: str, row: GoogleAdsRow
) -> None:
    """Evaluates the performance of the treatment experiment arm.

    Args:
        client: an initialized GoogleAdsClient instance.
        customer_id: a client customer ID.
        row: a GoogleAdsRow containing the experiment arm and metrics.
    """
    metrics = row.metrics
    experiment_resource_name = row.experiment.resource_name

    # 1. Evaluate conversion success as a primary success signal.
    # - Point Estimate: Represents the estimated average lift or difference in conversions.
    # - Margin of Error: Outlines the confidence interval bounds. Note that the margin_of_error provided by the API is calculated for a preset confidence level which is set based on the experiment type.
    # - Lower Bound: (Point Estimate - Margin of Error). If this value is above 0,
    #   we have statistical significance that performance has improved.
    conv_p_value = metrics.conversions_absolute_change_p_value
    conv_lift = metrics.conversions_absolute_change_point_estimate
    conv_error = metrics.conversions_absolute_change_margin_of_error
    conv_lower_bound = conv_lift - conv_error

    if conv_p_value <= P_VALUE_THRESHOLD:
        if conv_lower_bound > 0:
            print(
                "Significant Success: Conversions increased. Even at the lower"
                f" bound, the lift is {conv_lower_bound:.2f}. Promoting"
                " changes."
            )
            promote_experiment(client, customer_id, experiment_resource_name)
            return
        elif (conv_lift + conv_error) < 0:
            print(
                "Significant Decline: Even the upper bound"
                f" ({conv_lift + conv_error:.2f}) is below zero. Ending"
                " experiment."
            )
            end_experiment(client, customer_id, experiment_resource_name)
            return

    # 2. Evaluate click volume as a secondary signal.
    # This is helpful as an early indicator or for lower-volume accounts.
    click_p_value = metrics.clicks_p_value
    click_lift = metrics.clicks_point_estimate
    click_error = metrics.clicks_margin_of_error
    click_lower_bound = click_lift - click_error

    if click_p_value <= P_VALUE_THRESHOLD and click_lower_bound > 0:
        # We have a directional winner: high confidence in more traffic,
        # but not enough data to confirm conversion impact yet.
        print(
            f"Click volume is significantly up (+{click_lift*100:.1f}%). "
            "Graduating treatment for further manual analysis."
        )

        # Graduate if it's a separate campaign test.
        # This keeps the high-volume treatment running independently.
        # Intra-campaign experiments (like ADOPT_BROAD_MATCH_KEYWORDS and
        # ADOPT_AI_MAX) run directly within the base campaign, meaning there is only
        # a single campaign involved and no separate treatment campaign to graduate.
        # Therefore, graduation is not supported for intra-campaign experiments.
        experiment_type_name = row.experiment.type_.name
        if (
            experiment_type_name != "ADOPT_BROAD_MATCH_KEYWORDS"
            and experiment_type_name != "ADOPT_AI_MAX"
        ):
            graduate_experiment(client, customer_id, experiment_resource_name)
        else:
            print(
                "Intra-campaign trial detected: Graduation is not supported"
                " because there is only one campaign. Continuing to run to"
                " gather more conversion data."
            )
    else:
        # Both conversions and clicks are noisy.
        print(
            "Inconclusive: No significant lift in Conversions"
            f" (p={conv_p_value:.2f}) or Clicks (p={click_p_value:.2f})."
            f" Current estimated lift: {conv_lift:.2f} +/- {conv_error:.2f}."
            " Continue running."
        )
      

Ruby

This example is not yet available in Ruby; you can take a look at the other languages.
    

Perl

This example is not yet available in Perl; you can take a look at the other languages.
    

curl

Benefits over campaign reporting

Direct experiment reporting offers several advantages over querying campaign reports separately:

  1. Centralized metrics: Retrieve metrics for control and treatment in a single row.
  2. Statistical confidence data: Provides calculated p-values, point estimates, and margins of error.
  3. Efficiency: Removes the need to manually join or compare results from multiple reports.
  4. Intra-campaign support: It is the only way to compare control versus treatment for intra-campaign experiments, where traffic is split within a single campaign.

Campaign reporting

For experiments that create separate treatment campaigns (for example, SEARCH_CUSTOM), you can query the campaign resource and use campaign.experiment_type to identify BASE (control) and EXPERIMENT (treatment) campaigns. This approach is useful if you need to segment metrics at a more granular level (for example, by ad group or keyword) or view campaign metadata not available on the experiment resource. However, it requires you to perform performance comparisons and statistical calculations manually.

You cannot use campaign-level reporting to compare arms for intra-campaign experiments, as the traffic split happens internally within a single campaign. Querying campaign for an intra-campaign experiment only returns aggregated totals.

Best practices

  • Select an appropriate confidence level: Setting a lower p-value threshold can provide directional guidance faster, especially with lower budgets or conversion volumes. A 95% confidence (p-value <= 0.05) is considered the academic standard and may be better for more accurate results over a longer timeframe.
  • Run experiments for long enough: Run experiments for at least 4 weeks to account for weekly performance cycles, conversion delays, and learning periods.
  • Give time for ramp-up: For campaigns using automated bidding or testing new features, disregard the first 1-2 weeks of data to give time for bidding models and traffic levels to recalibrate to the split.
  • Use 50/50 splits: A 50/50 traffic split is generally the fastest way to achieve statistically significant results.
  • Schedule in advance: Set your experiment start date 3-7 days in the future to give time for ad review and approval processes.
  • You can only run one experiment per campaign at any given time.