After you collect your data, perform an exploratory data analysis (EDA) to find and address any data quality issues. This is a critical step in the marketing mix modeling (MMM) process because it lets you assess the data to confirm that it accurately represents the marketing efforts, customer responses, and other relevant metrics. By correcting issues discovered through the EDA process, you can improve the reliability of the model output.
The basic process for performing an EDA is:
- Run a data review to identify any missing or incomplete data.
- Fix missing values in your raw input files.
- Evaluate the accuracy of the data.
- Correct any anomalies, outliers, or inaccuracies in the data.
- Check the correlation between your KPI, media, and control variables.
There are many ways to approach EDA, and so Meridian doesn't provide the visualizations for this process. We recommend that you find the right balance for your needs between running a thorough granular analysis for greater confidence and a quick check of high-level data that gives less detailed insight.
Consider these guidelines as you produce your own visualizations to assist with your EDA:
Checking data completeness: Check for missing values in the data.You can create charts that show the percentage of data completeness for each variable (channel), then investigate the variables that show as incomplete.
To further refine your EDA, you can create visualizations that show the number of observations by year, month, week, and weekday. Look for unexpectedly lower observations for any time period.
Checking data accuracy: Ensure that data is accurate and free from anomalies or outliers that could skew results. Creating visualizations to check for accuracy can include comparing the share of media spend for each channel and checking the trend of a channel to identify anything unusual. You can compare these visualizations against the media plan or work with the marketing team to help identify whether the data is accurate and granular enough.
Checking channels size: look at the channel's share of spend. Channels with very small share of spend might be difficult to estimate. You might want to combine them with other channels.
Checking variability of channels' media execution: Channels with low variability in media execution (impressions, clicks, etc.) might be difficult to estimate. Consider using a custom prior, if you have relevant information for it.
Checking correlation between variables: Though correlation between KPI, media, and control variables is not required, creating visualizations to check for correlation can be helpful in the following use cases:
Measuring the correlation between media and control variables to see if there is any unexpected relationship. This can help you decide whether to keep or remove any media or control variable.
Identifying multicollinearity. When two or more variables in the media and control variables are highly correlated with each other, they create multicollinearity, which can cause regression models to have difficulty calculating the impact of the collinear variables. By identifying multicollinearity in your data review, you can decide which variables to include or exclude from your model.
After you have confidence that your data is accurate and complete, you can load the data using a supported format, and then create your model.
Automated Data Checks
Meridian features automated data checks that are meant to capture extreme data
issues that will lead to non-convergence or untrustworthy model results. These
checks are executed at sample_posterior calls or when initializing the
Meridian object. If any critical issue is found in the data, posterior
sampling won't execute. Instead, an error will be printed detailing the
critical issue and actions fixing it. These data checks save time and improve
model trustworthiness by alerting you to critical problems before full posterior
sampling. All automated data checks are performed on the automatically scaled
data used to fit the model. For more information on Meridian's scaling of data,
see Input Data.
The following critical checks are automatically performed on your dataset:
Pairwise Correlation
Pearson pairwise correlation is computed between all scaled treatment units (including scaled reach $\times$ frequency for RF and ORF channels) and scaled control variables.
For a geo model, pairwise correlation is first computed across all geos and times. That is, for any two variables $\mathbf{X}_1$ and $\mathbf{X}_2$, $Corr(\mathbf{X}_1, \mathbf{X}_2)$ is calculated, where
\[ \begin{align*} \mathbf{X}_1 &= ( x_{g_1, t_1, 1}, x_{g_1, t_2, 1}, \cdots, x_{g_2, t_1, 1}, x_{g_2, t_2, 1}, \cdots ) \\ \mathbf{X}_2 &= ( x_{g_1, t_1, 2}, x_{g_1, t_2, 2}, \cdots, x_{g_2, t_1, 2}, x_{g_2, t_2, 2}, \cdots ). \end{align*} \]
An
ERRORis triggered if a pair of variables have nearly perfect correlation (absolute value of their pairwise correlation exceeds the default threshold of 0.999 across all geos and times).f'Some variables have perfect pairwise correlation across all times and geos. For each pair of perfectly-correlated variables, please remove one of the variables from the model.\nPairs with perfect correlation: {var_pairs}'In this case, for each pair of variables listed in
{var_pairs}in the error message, remove one of the redundant variables fromInputDataand rerunsample_posterior.For a national model, an
ERRORis triggered if the absolute value of the pairwise correlation between a pair of variables is larger than 0.999 across all times. Again, remove one of the redundant variables mentioned in the error message from the model.f'Some variables have perfect pairwise correlation across all times. For each pair of perfectly-correlated variables, please remove one of the variables from the model.\nPairs with perfect correlation: {var_pairs}'
Multicollinearity
To assess multicollinearity, variance inflation factor (VIF) is computed for all scaled treatment units (including scaled reach $\times$ frequency for RF and ORF channels) and scaled control variables. A VIF estimates the extent to which the variance of an explanatory variable is inflated due to collinearity with other variables in the model. A VIF of 1 indicates no collinearity, while higher values suggest increasing levels of multicollinearity. High multicollinearity can increase the width of the coefficients' credible intervals, causing their posterior inference to be less reliable.
For a geo model, VIF is first computed for each variable across all geos and times. An
ERRORis triggered if any variable can be expressed nearly perfectly as a linear combination of other variables (VIF exceeds the default threshold of 1000).f'Some variables have extreme multicollinearity (VIF > 1000) across all times and geos. To address multicollinearity, please drop any variable that is a linear combination of other variables. Otherwise, consider combining variables.\nVariables with extreme VIF: {high_vif_vars}'In this case, either drop any redundant variable listed in
{high_vif_vars}in the error message that could be a linear combination of other variables, or combine these variables.For a national model, VIF is computed for each variable across all times. An
ERRORis triggered if the VIF of a variable exceeds the default threshold of 1000. Again, either drop or combine the redundant variables mentioned in the error message.f'Some variables have extreme multicollinearity (with VIF > 1000) across all times. To address multicollinearity, please drop any variable that is a linear combination of other variables. Otherwise, consider combining variables.\nVariables with extreme VIF: {high_vif_vars}'
Standard Deviation of KPI
This check computes the standard deviation of the scaled KPI across all geos and times for a geo model, or across all times for a national model. An
ERRORis triggered when the scaled KPI is almost completely constant, indicated by a standard deviation less than 1e-4. This means no signal in the response variable. You should check for data input errors, or reconsider the feasibility of statistical modeling with this dataset.f'{kpi} is constant across all geos and times, indicating no signal in the data. Please fix this data error.'Standard Deviation of Explanatory Variables
This check assesses the standard deviation of scaled controls and scaled treatments (including scaled reach for RF and ORF channels ). Since the Meridian model has time main effect $\mu_t$ (and geo main effect $\tau_g$ for geo-level data), we assess the variation of these scaled variables along the time dimension and geo dimension (if applicable) separately for the following reasons.
Variation across geo
The standard deviation of the scaled variables along the geo dimension is assessed only for geo-level datasets, since the national model only has one geo. An
ERRORoccurs when you have setknots = n_timesand you have a variable that doesn't vary across geo (this could be a national-level variable that exists in a geo-level dataset, for example). Whenknots = n_times, each time period is getting its own parameter. A national-level variable varies only across time, and not across geo. Therefore, the national-level variable is perfectly collinear with time and is redundant with a model that has a parameter for each time period. Redundant means that you can either keep the national-level variable or setknots < n_times. Which you choose depends on your interpretation goals.f'The following {data_name} variables do not vary across geos, making a model with n_knots=n_time unidentifiable. This can lead to poor model convergence. Since these variables only vary across time and not across geo, they are collinear with time and redundant in a model with a parameter for each time period. To address this, you can either: (1) decrease the number of knots (n_knots < n_time), or (2) drop the listed variables that do not vary across geos.'Variation across time
The standard deviation of the scaled variables along the time dimension is assessed for both geo-level and national-level datasets.
For a geo model, an
ERRORoccurs when you have a variable that doesn't vary across time, which is perfectly collinear with the geo main effect $\tau_g$. As this redundant variable leads to poor model convergence, you should drop the variable that does not vary across time.f'The following {data_name} variables do not vary across time making a model with geo main effects unidentifiable. This can lead to poor model convergence. Since these variables only vary across geo and not across time, they are collinear with geo and redundant in a model with geo main effects. To address this, drop the listed variables that do not vary across time.'For a national model, a variable that does't vary across time is a constant term that brings no signal and hurts model convergence. You should drop this constant variable from the model.
f'The following {data_name} variables do not vary across time, which is equivalent to no signal at all in a national model. This can lead to poor model convergence. To address this, drop the listed variables that do not vary across time.'