Google Cloud Skills Boost — Insights from Data with BigQuery

Jackson Chen
5 min readJan 20, 2024

--

Introduction to SQL for BigQuery and Cloud SQL

In this lab, we will learn how to:

  • Load databases and tables into BigQuery.
  • Execute simple queries on tables to pull meaningful data from datasets.
  • Export a subset of data into a CSV file and store that file in a new Cloud Storage bucket.
  • Create a new Cloud SQL instance and load your exported CSV file as a new table.

Prerequisites

Weather Data in BigQuery

In this lab, we’ll:

  • Carry out interactive queries on the BigQuery console.
  • Combine and run analytics on multiple datasets.
  • Use the CORR function to determine positive, negative, or no correlation between datasets.

Analyzing Natality Data Using BigQuery and Vertex AI

In this lab, we analyze a large (137 million rows) natality dataset using BigQuery and Vertex AI.

The most essential keywords are SELECT and FROM:

  • Use SELECT to specify what fields you want to pull from your dataset.
  • Use FROM to specify what table or tables you want to pull our data from.

The WHERE keyword is another SQL command that filters tables for specific column values.

SELECT USER FROM example_table WHERE SHIPPED='YES'

A keyword that specifies the fields (e.g. column values) that you want to pull from your dataset.

SELECT

Specifies what table or tables to pull our data from.

FROM

Allows us to filter tables for specific column values.

WHERE

A fully-managed petabyte-scale data warehouse that runs on the Google Cloud.

BigQuery

Projects contain datasets, and datasets contain tables.

True

With BigQuery, you can access datasets shared publicly from other Google Cloud projects.

True

The GROUP BY keyword will aggregate result-set rows that share common criteria (e.g. a column value) and will return all of the unique entries found for such criteria.

The COUNT() function will return the number of rows that share the same criteria (e.g. column value).

SQL also has an AS keyword, which creates an alias of a table or column.

The ORDER BY keyword sorts the returned data from a query in ascending or descending order based on a specified criteria or column value.

Aggregates rows that share common criteria (e.g. a column value) and will return all of the unique entries found for such criteria.

GROUP BY

A SQL function will count and return the number of rows that share common criteria.

COUNT

Creates an alias of a table or column.

AS

Sorts the returned data from a query in ascending or descending order based on a specified criteria or column value.

ORDER BY

Here are a couple more SQL keywords that help us with data management. The first is the DELETE keyword.

We can also insert values into tables with the INSERT INTO keyword.

The last SQL keyword that you’ll learn about is UNION. This keyword combines the output of two or more SELECT queries into a result-set.

BigQuery: Qwik Start — Console

In this lab, we’ll do

  • Query a public dataset
  • Create a new dataset
  • Load data into a new table
  • Query a custom table

BigQuery is a fully-managed enterprise data warehouse that enables super-fast SQL queries.

True

BigQuery: Qwik Start — Command Line

Same as the last lab, but here we use command line.

You can access BigQuery using:

BigQuery REST API, Command line tool, Web UI

Which CLI tool is used to interact with BigQuery service?

bq

Exploring Your Ecommerce Dataset with SQL in Google BigQuery

In this lab, we use BigQuery to:

  • Access an ecommerce dataset
  • Look at the dataset metadata
  • Remove duplicate entries
  • Write and execute queries

Which UI tab will show you the data types?

Schema

How many rows are in the dataset?

Over 21 million

How many records have duplicates in all_sessions_raw?

615

Which part of the previous query deduplicates the records?

GROUP BY

How many distinct product names were returned in total?

633

The product with the most views got the most orders.

True

What is the difference between orders and quantity_product_ordered?

order is the number of orders, quantity_product_ordered is the number of items ordered

What product has the highest avg_per_order?

YouTube Bottle Infuser

Troubleshooting Common SQL Errors with BigQuery

This lab steps you through the logic of troubleshooting queries.

#standardSQL
SELECT FROM `data-to-inghts.ecommerce.rev_transactions` LIMIT 1000

What’s wrong with the previous query to view 1000 items?

  • We have not specified any columns in the SELECT
  • There is a typo in the dataset name
#standardSQL
SELECT * FROM [data-to-insights:ecommerce.rev_transactions] LIMIT 1000

What’s wrong with the new previous query to view 1000 items?

We are using legacy SQL

#standardSQL
SELECT FROM `data-to-insights.ecommerce.rev_transactions`

What is wrong with the previous query?

Still no columns defined in SELECT

#standardSQL
SELECT
fullVisitorId
FROM `data-to-insights.ecommerce.rev_transactions`

What is wrong with the previous query?

  • Without aggregations, limits, or sorting, this query is not insightful
  • The page title is missing from the columns in SELECT
#standardSQL
SELECT fullVisitorId hits_page_pageTitle
FROM `data-to-insights.ecommerce.rev_transactions` LIMIT 1000

How many columns will the previous query return?

1, a column named hits_page_pageTitle

#standardSQL
SELECT
COUNT(fullVisitorId) AS visitor_count
, hits_page_pageTitle
FROM `data-to-insights.ecommerce.rev_transactions`

What is wrong with the previous query?

  • The COUNT() function does not de-deduplicate the same fullVisitorId
  • It is missing a GROUP BY clause

Which city had the most distinct visitors? Ignore the value: ‘not available in this demo dataset’

Mountain View

#standardSQL
SELECT
geoNetwork_city,
SUM(totals_transactions) AS total_products_ordered,
COUNT( DISTINCT fullVisitorId) AS distinct_visitors,
SUM(totals_transactions) / COUNT( DISTINCT fullVisitorId) AS avg_products_ordered
FROM
`data-to-insights.ecommerce.rev_transactions`
WHERE avg_products_ordered > 20
GROUP BY geoNetwork_city
ORDER BY avg_products_ordered DESC

What is wrong with the previous query?

  • You cannot filter on aliased fields within the `WHERE` clause
  • You cannot filter aggregated fields in the `WHERE` clause (use `HAVING` instead)
#standardSQL
SELECT hits_product_v2ProductName, hits_product_v2ProductCategory
FROM `data-to-insights.ecommerce.rev_transactions`
GROUP BY 1,2

What is wrong with the previous query?

  • No aggregate functions are used
  • Large GROUP BYs really hurt performance (consider filtering first and/or using aggregation functions)
#standardSQL
SELECT
COUNT(hits_product_v2ProductName) as number_of_products,
hits_product_v2ProductCategory
FROM `data-to-insights.ecommerce.rev_transactions`
WHERE hits_product_v2ProductName IS NOT NULL
GROUP BY hits_product_v2ProductCategory
ORDER BY number_of_products DESC

What is wrong with the previous query which lists products?

The COUNT() function is not the distinct number of products in each category

Which category has the most distinct number of products offered?

(not set)

Explore and Create Reports with Looker Studio

This lab focuses on how to create new reports and explore your ecommerce dataset visually for insights.

--

--

Jackson Chen
Jackson Chen

No responses yet