Filling In Missing Values From Date Requests

Nick Mihailovski, Google Analytics API Team – October 2009

This article discusses how to detect and backfill missing time series values in data returned from the Google Analytics Data Export API.


Before You Begin

The article assumes you know how the Google Analytics Data Export API works. The sample code is in Java but you can use the concepts in your language of choice. The code for this article is provided as open source and can be downloaded from project hosting.

After reading this article, you will learn:

  • How the Google Analytics Data Export API treats date dimensions.
  • How to structure your queries to group results and detect missing dates.
  • How to fill in the missing values using Java.

Introduction

Comparing data across a period of time provides context. For example, stating a website generated $1 million in revenue doesn’t mean much. But stating a website increased revenue by 10x quarter over quarter or year over year is indeed impressive. With the Google Analytics API, it’s easy to plot data over time by using the ga:date, ga:day and ga:month dimensions.

If your query only uses a date dimension, if any days in the date range collected zero data, the Google Analytics API will backfill dates and 0 values for metrics.

ga:datega:sessions
2010-03-01101
2010-03-020
2010-03-0369

However it becomes tricky if you query date along with other dimensions. If one of the dates has no data, the API will NOT return an entry for that date. It will just skip to the next available date that contains data.

ga:keywordga:datega:sessions
chair2010-03-0155
chair2010-03-0348

Ideally analysts would want to have the missing dates for a particular keyword filled in like the first example above

This article describe some best practices for pragmatically backfilling data.

Background

Let's first look at why this problem exists. There are 2 reasons.

  1. Google Analytics only processes data that is collected. If nobody came to a site on a particular day, then there is no data to process, so no data is returned.
  2. It's very difficult to determine how many additional dimensions and what values should be used for dates that have no data.

So instead of trying to define one process to rule them all, the Google Analytics API leaves the exercise of filling in data for queries that have multiple dimensions up to the developer. Lucky You :)

Program Overview

Here are the steps to backfill the data in the chart above.

  1. Modify the query to ensure dimensions are sorted opportunistically.
  2. Determine the expected dates from the date range.
  3. Iterate and backfill any missing dates.
  4. Fill in any remaining missing values.

Modify The Query

To backfill dates, we need to make sure the data returned from the API is in a format that makes it easy to detect when a date is missing. Here’s an example query to retrieve both ga:keyword and ga:date for the first 5 days in March:

DataQuery dataQuery = new DataQuery(new URL(BASE_URL));
dataQuery.setIds(TABLE_ID);
dataQuery.setStartDate("2010-03-01");
dataQuery.setEndDate("2010-03-05");
dataQuery.setDimensions("ga:keyword,ga:date");
dataQuery.setMetrics("ga:entrances");

Once the query is sent to the API, the results will contain a list of DataEntry objects. Each entry object represents a row of data and includes names and values for dimensions/metrics. Since no sort parameter was used, the results are returned in an arbitrary order.

ga:keywordga:datega:entrances
chair2010-03-0414
chair2010-03-0123
table2010-03-0418
table2010-03-0224
chair2010-03-0313

To make it easy to identify which dates are missing, we need to first group all dimensions together. This can be done by setting the sort parameter of the query to the dimensions used in the original query.

dataQuery.setSort("ga:keyword,ga:date");

Adding the sort parameter will make the API return the results in the desired order.

ga:keywordga:datega:entrances
chair2010-03-0123
chair2010-03-0313
chair2010-03-0414
table2010-03-0224
table2010-03-0418

The second step is to make sure that for every dimension, all dates are returned in ascending order. While the Google Analytics API provides a number of date dimensions, only ga:date can be sorted accurately across date boundaries (i.e., days, months, years). So if you want to backfill dates, make sure your query uses the ga:date dimension in both the dimensions and sort query parameters.

Once the sorted query is executed, all of the same landing pages will be returned next to each other and the dates will be in sequential order. The list of dates for a single landing page can be thought of as a time series and because they are in order, it is much easier to identify missing dates.

Determine Expected Dates

To detect missing dates, we need to compare the actual dates returned from the API to the expected dates in every time series. We can figure out what is expected by:

  1. Determining the expected start date from the API query.
  2. Counting the number expected days in the query date range.

Both values can be used together to determine each expected date by incrementing the start date by 1 for each day in the date range.

Determining the Expected Start Date

We can use the start-date query parameter as the expected start date of the series. Because the date format returned in the API response yyyyMMdd is different than the format of the query parameter yyyy-MM-dd, we need to first convert the date format before we can use it.

The setExpectedStartDate method converts the formats of the dates.

  private static SimpleDateFormat queryDateFormat = new SimpleDateFormat("yyyy-MM-dd");
  private static SimpleDateFormat resultDateFormat = new SimpleDateFormat("yyyyMMdd");

  public void setExpectedStartDate(String startDate) {
    try {
      calendar.setTime(queryDateFormat.parse(startDate));
      expectedStartDate = resultDateFormat.format(calendar.getTime());
    } catch (ParseException e) {
      handleException(e);
    }
  }

Counting the Number of Expected Days

To get the number of days in the date range, the program parses the start and end dates into Java Date objects. Then uses a Calendar object to figure out the time between both dates. One day is added to the difference in dates to make the count inclusive.

  private static final long millisInDay = 24 * 60 * 60 * 1000;

  public void setNumberOfDays(DataQuery dataQuery) {
    long startDay = 0;
    long endDay = 0;

    try {
      calendar.setTime(queryDateFormat.parse(dataQuery.getStartDate()));
      startDay = calendar.getTimeInMillis() / millisInDay;

      calendar.setTime(queryDateFormat.parse(dataQuery.getEndDate()));
      endDay = calendar.getTimeInMillis() / millisInDay;
    } catch (ParseException e) {
      handleException(e);
    }

    numberOfDays = (int) (endDay - startDay + 1);
  }

Now we have all the data we need to figure out which dates are missing.

Identify Each Time Series in the Results

Once the query is executed, the program goes through each DataEntry object in the API response. Because the query was initially sorted, the response will have a partial time series for each keyword. So we need to find the start of each time series, then go through each date and fill in missing data not returned by the API.

This program uses the dimensionValue and tmpDimensionValue variables to detect the start of each series.

Here is the entire code to handle the response. Filling in missing data is discussed below.

public void printBackfilledResults(DataFeed dataFeed) {
  String expectedDate = "";
  String dimensionValue = "";
  List<Integer> row = null;

  for (DataEntry entry : dataFeed.getEntries()) {
    String tmpDimValue = entry.getDimensions().get(0).getValue();

    // Detect beginning of a series.
    if (!tmpDimValue.equals(dimensionValue)) {
      if (row != null) {
        forwardFillRow(row);
        printRow(dimensionValue, row);
      }

      // Create a new row.
      row = new ArrayList<Integer>(numberOfDays);
      dimensionValue = tmpDimValue;
      expectedDate = expectedStartDate;
    }

    // Backfill row.
    String foundDate = entry.getDimension("ga:date").getValue();
    if (!foundDate.equals(expectedDate)) {
      backFillRow(expectedDate, foundDate, row);
    }

    // Handle the data.
    Metric metric = entry.getMetrics().get(0);
    row.add(new Integer(metric.getValue()));
    expectedDate = getNextDate(foundDate);
  }

  // Handle the last row.
  if (row != null) {
    forwardFillRow(row);
    printRow(dimensionValue, row);
  }
}

Backfill Any Missing Dates

For each entry in a series, the program stores the metric values (entrances) in an ArrayList called row. When a new time series is detected, a new row is created and the expected date is set to the expected start date.

Then for each entry, the program checks to see if the date value in the entry equals the expected date. If they are equal, the metric in the entry is added to the row. Otherwise, the program has detected missing dates which need to be backfilled.

The backfillRow method handles backfilling data. It accepts as parameters the expected and found dates as well as the current row. It then determines the number of days between the two dates (non-inclusive) and adds that many 0s to the row.

  public void backFillRow(String startDate, String endDate, List<Integer> row) {
    long d1 = 0;
    long d2 = 0;

    try {
      calendar.setTime(resultDateFormat.parse(startDate));
      d1 = calendar.getTimeInMillis() / millisInDay;

      calendar.setTime(resultDateFormat.parse(endDate));
      d2 = calendar.getTimeInMillis() / millisInDay;

    } catch (ParseException e) {
      handleException(e);
    }

    long differenceInDays = d2 - d1;
    if (differenceInDays > 0) {
      for (int i = 0; i < differenceInDays; i++) {
        row.add(0);
      }
    }
  }

When the method is done, the row has been backfilled with data and the current data can be added. The expected date is then incremented to one day after the found date using the getNextDate method.

public String getNextDate(String initialDate) {
  try {
    calendar.setTime(resultDateFormat.parse(initialDate));
    calendar.add(Calendar.DATE, 1);
    return resultDateFormat.format(calendar.getTime());

  } catch (ParseException e) {
    handleException(e);
  }
  return "";
}

Fill In Any Remaining Values

Once the series data has been processed into a row, we have to check that there are no more missing dates at the end of the series.

The forwardFillRow method simply calculates the difference between the number of days in the original query to the current size of the row, and adds that many 0s to the end of the row.

public void forwardFillRow(List<Integer> row) {
  int remainingElements = numberOfDays - row.size();
  if (remainingElements > 0) {
    for (int i = 0; i < remainingElements; i++) {
      row.add(0);
    }
  }
}

At this point, the program has filled in any missing values in the time series. Now that we have all the data, the program prints the dimension and metric values as a comma separated list.

Conclusion

Using this sample, you can easily backfill data on dates not returned by the API. As mentioned above, this solution can be adapted to any programming language. Developers can even adapt these techniques and apply them to handle multiple dimensions and multiple metrics. Now it's even easier than ever to start doing advanced analysis on times series returned by the Google Analytics API.