Introduction to data transformation

This document describes the different ways you can transform data in your BigQuery tables.

Methods of transforming data

You can transform data in BigQuery in the following ways:

  • Use data manipulation language (DML) to transform data in your BigQuery tables.
  • Use Dataform to develop, test, control versions, and schedule SQL workflows in BigQuery.

You can also review the change history of a BigQuery table to examine the transformations made to a table in a specified time range.

Transform data with DML

You can use data manipulation language (DML) to transform data in your BigQuery tables. DML statements are GoogleSQL queries that manipulate existing table data to add or delete rows, modify data in existing rows, or merge data with values from another table. DML transformations are also supported in partitioned tables.

You can run multiple DML statements concurrently, where BigQuery queues several DML statements that transform your data one after the other. BigQuery manages how concurrent DML statements are run, based upon the transformation type.

Transform data with Dataform

Dataform lets you manage data transformation in the extract, load, and transform (ELT) process for data integration. After extracting raw data from source systems and loading it into BigQuery, you can use Dataform to transform it into an organized, tested, and documented suite of tables. While in DML you take an imperative approach by telling BigQuery how exactly to transform your data, in Dataform you write declarative statements where Dataform then determines the transformation needed to achieve that state.

In Dataform, you can develop, test, and version control SQL workflows for data transformation from data source declarations to output tables, views, or materialized views. You can develop SQL workflows with Dataform core or pure JavaScript. Dataform core is an open source meta-language that extends SQL with SQLX and JavaScript. You can use Dataform core to manage dependencies, set up automated data quality testing, and document table or column descriptions within the code.

Dataform stores your SQL workflow code in repositories and uses Git to track file changes. Development workspaces in Dataform let you work on the contents of the repository without affecting the work of others who are working in the same repository. You can connect Dataform repositories to third-party Git providers, including Azure DevOps Services, BitBucket, GitHub, and GitLab.

You can run or schedule SQL workflows with Dataform release configurations and workflow configurations. Alternatively, you can schedule executions either with Cloud Composer, or with Workflows and Cloud Scheduler. During execution, Dataform executes SQL queries in BigQuery in order of object dependencies in your SQL workflow. After execution, you can use your defined tables and views for analysis in BigQuery.

To learn more about creating data transformation SQL workflows in Dataform, see Dataform overview and Overview of Dataform features.

What's next