Page Summary
-
Treatment campaigns are considered new campaigns and do not copy metrics from control campaigns.
-
Control and treatment campaigns accrue and retain their metrics separately throughout an experiment, even after promotion or graduation.
-
After promotion, changes from the treatment campaign are copied to the control campaign, but metrics remain associated with their original campaigns.
-
Experiment campaigns and base campaigns can be differentiated in search queries using
campaign.experiment_type.
There are two main ways to report on experiments:
- Direct experiment reporting: Query the
experimentresource for metrics. This option provides metrics for control and treatment arms in a single response, along with statistical comparison data such as uplift and p-values. This is the only way to report on intra-campaign experiments. - Campaign reporting: Query the
campaignresource for metrics, usingcampaign.experiment_typeto distinguish between base and experiment campaigns. This option is only available for experiments that use separate control and treatment campaigns, such as system-managed experiments.
This guide focuses primarily on direct experiment reporting, which is compatible with all experiment types that support reporting.
Direct experiment reporting
You can query the experiment resource directly to retrieve performance metrics
and statistical comparisons between your control and treatment arms.
Metrics and statistical significance
For core metrics such as clicks, impressions, cost, conversions, and conversion
value, the experiment resource provides both treatment metrics (for example,
metrics.clicks) and control metrics (for example, metrics.control_clicks) in
the same row.
It also provides fields to help you evaluate the statistical significance of any difference between the arms:
metrics.*_p_value: The probability that the observed results would occur if the experiment had no actual effect on the metric. A lower p-value indicates higher statistical significance.metrics.*_point_estimate: The estimated percentage lift (positive or negative) in the given metric for the treatment arm compared to the control arm. Together withmargin_of_error, they describe a confidence interval with a prescribed confidence level for the difference being estimated. The quantity being estimated is (treatment / control - 1). The point estimate is the center of the confidence interval.metrics.*_margin_of_error: The radius of the confidence interval, which is centered atpoint_estimate. It is calculated for a prescribed confidence level, which depends on the experiment type.
The following core metric fields are supported on the experiment resource,
including a treatment group value, a control group value, and the stat fields
listed previously:
clicksimpressionscost_microsconversionscost_per_conversionconversion_valueconversion_value_per_cost
For conversions, specifically, the following metrics fields are also available:
metrics.conversions_absolute_change_p_value: The p-value for the null hypothesis that the experiment has no effect on conversions absolute change. Ranges from 0 to 1.metrics.conversions_absolute_change_point_estimate: The point estimate when estimating the experiment's effect on conversions absolute change.metrics.conversions_absolute_change_margin_of_error: The margin of error when estimating the experiment's effect on conversions absolute change.
For assistance constructing valid queries to the experiment resource, use the
Google Ads Query Builder tool.
Example query
The following GAQL query retrieves key metrics for an experiment:
SELECT
experiment.experiment_id,
experiment.name,
experiment.type,
metrics.clicks,
metrics.control_clicks,
metrics.clicks_point_estimate,
metrics.clicks_margin_of_error,
metrics.clicks_p_value,
metrics.conversions,
metrics.control_conversions,
metrics.conversions_absolute_change_point_estimate,
metrics.conversions_absolute_change_margin_of_error,
metrics.conversions_absolute_change_p_value
FROM experiment
WHERE experiment.experiment_id = EXPERIMENT_ID
Interpret results
You can use the p-value, point estimate, and margin of error fields to determine
if your experiment has yielded statistically significant results. For example,
if conversions_absolute_change_p_value is below your chosen threshold (for
example,
0.05 for 95% confidence) and conversions_absolute_change_point_estimate -
conversions_absolute_change_margin_of_error is greater than zero, it indicates
that the treatment arm is performing significantly better than the control arm
in terms of conversions.
Here is a Python snippet demonstrating how to evaluate results based on p-value and lift estimates:
Java
This example is not yet available in Java; you can take a look at the other languages.
C#
This example is not yet available in C#; you can take a look at the other languages.
PHP
This example is not yet available in PHP; you can take a look at the other languages.
Python
def evaluate_experiment( client: GoogleAdsClient, customer_id: str, row: GoogleAdsRow ) -> None: """Evaluates the performance of the treatment experiment arm. Args: client: an initialized GoogleAdsClient instance. customer_id: a client customer ID. row: a GoogleAdsRow containing the experiment arm and metrics. """ metrics = row.metrics experiment_resource_name = row.experiment.resource_name # 1. Evaluate conversion success as a primary success signal. # - Point Estimate: Represents the estimated average lift or difference in conversions. # - Margin of Error: Outlines the confidence interval bounds. Note that the margin_of_error provided by the API is calculated for a preset confidence level which is set based on the experiment type. # - Lower Bound: (Point Estimate - Margin of Error). If this value is above 0, # we have statistical significance that performance has improved. conv_p_value = metrics.conversions_absolute_change_p_value conv_lift = metrics.conversions_absolute_change_point_estimate conv_error = metrics.conversions_absolute_change_margin_of_error conv_lower_bound = conv_lift - conv_error if conv_p_value <= P_VALUE_THRESHOLD: if conv_lower_bound > 0: print( "Significant Success: Conversions increased. Even at the lower" f" bound, the lift is {conv_lower_bound:.2f}. Promoting" " changes." ) promote_experiment(client, customer_id, experiment_resource_name) return elif (conv_lift + conv_error) < 0: print( "Significant Decline: Even the upper bound" f" ({conv_lift + conv_error:.2f}) is below zero. Ending" " experiment." ) end_experiment(client, customer_id, experiment_resource_name) return # 2. Evaluate click volume as a secondary signal. # This is helpful as an early indicator or for lower-volume accounts. click_p_value = metrics.clicks_p_value click_lift = metrics.clicks_point_estimate click_error = metrics.clicks_margin_of_error click_lower_bound = click_lift - click_error if click_p_value <= P_VALUE_THRESHOLD and click_lower_bound > 0: # We have a directional winner: high confidence in more traffic, # but not enough data to confirm conversion impact yet. print( f"Click volume is significantly up (+{click_lift*100:.1f}%). " "Graduating treatment for further manual analysis." ) # Graduate if it's a separate campaign test. # This keeps the high-volume treatment running independently. # Intra-campaign experiments (like ADOPT_BROAD_MATCH_KEYWORDS and # ADOPT_AI_MAX) run directly within the base campaign, meaning there is only # a single campaign involved and no separate treatment campaign to graduate. # Therefore, graduation is not supported for intra-campaign experiments. experiment_type_name = row.experiment.type_.name if ( experiment_type_name != "ADOPT_BROAD_MATCH_KEYWORDS" and experiment_type_name != "ADOPT_AI_MAX" ): graduate_experiment(client, customer_id, experiment_resource_name) else: print( "Intra-campaign trial detected: Graduation is not supported" " because there is only one campaign. Continuing to run to" " gather more conversion data." ) else: # Both conversions and clicks are noisy. print( "Inconclusive: No significant lift in Conversions" f" (p={conv_p_value:.2f}) or Clicks (p={click_p_value:.2f})." f" Current estimated lift: {conv_lift:.2f} +/- {conv_error:.2f}." " Continue running." )
Ruby
This example is not yet available in Ruby; you can take a look at the other languages.
Perl
This example is not yet available in Perl; you can take a look at the other languages.
curl
Benefits over campaign reporting
Direct experiment reporting offers several advantages over querying campaign reports separately:
- Centralized metrics: Retrieve metrics for control and treatment in a single row.
- Statistical confidence data: Provides calculated p-values, point estimates, and margins of error.
- Efficiency: Removes the need to manually join or compare results from multiple reports.
- Intra-campaign support: It is the only way to compare control versus treatment for intra-campaign experiments, where traffic is split within a single campaign.
Campaign reporting
For experiments that create separate treatment campaigns (for example,
SEARCH_CUSTOM), you can query the campaign resource and use
campaign.experiment_type to identify BASE (control) and EXPERIMENT
(treatment) campaigns. This approach is useful if you need to segment metrics at
a more granular level (for example, by ad group or keyword) or view campaign
metadata not available on the experiment resource. However, it requires you to
perform performance comparisons and statistical calculations manually.
You cannot use campaign-level reporting to compare arms for intra-campaign
experiments, as the traffic split happens internally within a single campaign.
Querying campaign for an intra-campaign experiment only returns aggregated
totals.
Best practices
- Select an appropriate confidence level: Setting a lower p-value threshold can provide directional guidance faster, especially with lower budgets or conversion volumes. A 95% confidence (p-value <= 0.05) is considered the academic standard and may be better for more accurate results over a longer timeframe.
- Run experiments for long enough: Run experiments for at least 4 weeks to account for weekly performance cycles, conversion delays, and learning periods.
- Give time for ramp-up: For campaigns using automated bidding or testing new features, disregard the first 1-2 weeks of data to give time for bidding models and traffic levels to recalibrate to the split.
- Use 50/50 splits: A 50/50 traffic split is generally the fastest way to achieve statistically significant results.
- Schedule in advance: Set your experiment start date 3-7 days in the future to give time for ad review and approval processes.
- You can only run one experiment per campaign at any given time.