Google Cloud Skills Boost — Insights from Data with BigQuery
Introduction to SQL for BigQuery and Cloud SQL
In this lab, we will learn how to:
- Load databases and tables into BigQuery.
- Execute simple queries on tables to pull meaningful data from datasets.
- Export a subset of data into a CSV file and store that file in a new Cloud Storage bucket.
- Create a new Cloud SQL instance and load your exported CSV file as a new table.
Prerequisites
Weather Data in BigQuery
In this lab, we’ll:
- Carry out interactive queries on the BigQuery console.
- Combine and run analytics on multiple datasets.
- Use the
CORR
function to determine positive, negative, or no correlation between datasets.
Analyzing Natality Data Using BigQuery and Vertex AI
In this lab, we analyze a large (137 million rows) natality dataset using BigQuery and Vertex AI.
The most essential keywords are SELECT
and FROM
:
- Use
SELECT
to specify what fields you want to pull from your dataset. - Use
FROM
to specify what table or tables you want to pull our data from.
The WHERE
keyword is another SQL command that filters tables for specific column values.
SELECT USER FROM example_table WHERE SHIPPED='YES'
A keyword that specifies the fields (e.g. column values) that you want to pull from your dataset.
SELECT
Specifies what table or tables to pull our data from.
FROM
Allows us to filter tables for specific column values.
WHERE
A fully-managed petabyte-scale data warehouse that runs on the Google Cloud.
BigQuery
Projects contain datasets, and datasets contain tables.
True
With BigQuery, you can access datasets shared publicly from other Google Cloud projects.
True
The GROUP BY
keyword will aggregate result-set rows that share common criteria (e.g. a column value) and will return all of the unique entries found for such criteria.
The COUNT()
function will return the number of rows that share the same criteria (e.g. column value).
SQL also has an AS
keyword, which creates an alias of a table or column.
The ORDER BY
keyword sorts the returned data from a query in ascending or descending order based on a specified criteria or column value.
Aggregates rows that share common criteria (e.g. a column value) and will return all of the unique entries found for such criteria.
GROUP BY
A SQL function will count and return the number of rows that share common criteria.
COUNT
Creates an alias of a table or column.
AS
Sorts the returned data from a query in ascending or descending order based on a specified criteria or column value.
ORDER BY
Here are a couple more SQL keywords that help us with data management. The first is the DELETE
keyword.
We can also insert values into tables with the INSERT INTO
keyword.
The last SQL keyword that you’ll learn about is UNION
. This keyword combines the output of two or more SELECT
queries into a result-set.
BigQuery: Qwik Start — Console
In this lab, we’ll do
- Query a public dataset
- Create a new dataset
- Load data into a new table
- Query a custom table
BigQuery is a fully-managed enterprise data warehouse that enables super-fast SQL queries.
True
BigQuery: Qwik Start — Command Line
Same as the last lab, but here we use command line.
You can access BigQuery using:
BigQuery REST API, Command line tool, Web UI
Which CLI tool is used to interact with BigQuery service?
bq
Exploring Your Ecommerce Dataset with SQL in Google BigQuery
In this lab, we use BigQuery to:
- Access an ecommerce dataset
- Look at the dataset metadata
- Remove duplicate entries
- Write and execute queries
Which UI tab will show you the data types?
Schema
How many rows are in the dataset?
Over 21 million
How many records have duplicates in all_sessions_raw?
615
Which part of the previous query deduplicates the records?
GROUP BY
How many distinct product names were returned in total?
633
The product with the most views got the most orders.
True
What is the difference between orders and quantity_product_ordered?
order is the number of orders, quantity_product_ordered is the number of items ordered
What product has the highest avg_per_order?
YouTube Bottle Infuser
Troubleshooting Common SQL Errors with BigQuery
This lab steps you through the logic of troubleshooting queries.
#standardSQL
SELECT FROM `data-to-inghts.ecommerce.rev_transactions` LIMIT 1000
What’s wrong with the previous query to view 1000 items?
- We have not specified any columns in the SELECT
- There is a typo in the dataset name
#standardSQL
SELECT * FROM [data-to-insights:ecommerce.rev_transactions] LIMIT 1000
What’s wrong with the new previous query to view 1000 items?
We are using legacy SQL
#standardSQL
SELECT FROM `data-to-insights.ecommerce.rev_transactions`
What is wrong with the previous query?
Still no columns defined in SELECT
#standardSQL
SELECT
fullVisitorId
FROM `data-to-insights.ecommerce.rev_transactions`
What is wrong with the previous query?
- Without aggregations, limits, or sorting, this query is not insightful
- The page title is missing from the columns in SELECT
#standardSQL
SELECT fullVisitorId hits_page_pageTitle
FROM `data-to-insights.ecommerce.rev_transactions` LIMIT 1000
How many columns will the previous query return?
1, a column named hits_page_pageTitle
#standardSQL
SELECT
COUNT(fullVisitorId) AS visitor_count
, hits_page_pageTitle
FROM `data-to-insights.ecommerce.rev_transactions`
What is wrong with the previous query?
- The COUNT() function does not de-deduplicate the same fullVisitorId
- It is missing a GROUP BY clause
Which city had the most distinct visitors? Ignore the value: ‘not available in this demo dataset’
Mountain View
#standardSQL
SELECT
geoNetwork_city,
SUM(totals_transactions) AS total_products_ordered,
COUNT( DISTINCT fullVisitorId) AS distinct_visitors,
SUM(totals_transactions) / COUNT( DISTINCT fullVisitorId) AS avg_products_ordered
FROM
`data-to-insights.ecommerce.rev_transactions`
WHERE avg_products_ordered > 20
GROUP BY geoNetwork_city
ORDER BY avg_products_ordered DESC
What is wrong with the previous query?
- You cannot filter on aliased fields within the `WHERE` clause
- You cannot filter aggregated fields in the `WHERE` clause (use `HAVING` instead)
#standardSQL
SELECT hits_product_v2ProductName, hits_product_v2ProductCategory
FROM `data-to-insights.ecommerce.rev_transactions`
GROUP BY 1,2
What is wrong with the previous query?
- No aggregate functions are used
- Large GROUP BYs really hurt performance (consider filtering first and/or using aggregation functions)
#standardSQL
SELECT
COUNT(hits_product_v2ProductName) as number_of_products,
hits_product_v2ProductCategory
FROM `data-to-insights.ecommerce.rev_transactions`
WHERE hits_product_v2ProductName IS NOT NULL
GROUP BY hits_product_v2ProductCategory
ORDER BY number_of_products DESC
What is wrong with the previous query which lists products?
The COUNT() function is not the distinct number of products in each category
Which category has the most distinct number of products offered?
(not set)
Explore and Create Reports with Looker Studio
This lab focuses on how to create new reports and explore your ecommerce dataset visually for insights.