Akash's Blog

In today’s world of information, data analytics is an integral part of every business decision. While analytics is such an important factor in business, the cost of analytics tools and technologies is equally important to ensure a high return on investment and minimum waste of resources.

BigQuery is one of the leading technology of data analytics for all-sized organizations. BigQuery not just helps to analyze the data but also helps organizations in making real-time decisions, reporting, and future predictions.

Architecture

BigQuery is a completely serverless enterprise data warehouse. In BigQuery storage and compute are decoupled so that both can scale independently on demand.

Such architecture offers flexibility to the customers. Customers don’t need to keep compute resources up and running all the time, also, customers no longer need to worry about system engineering and database operations while using BigQuery.

BigQuery has distributed and replicated storage along with a high-availability cluster for computation. We don’t need to provision instances of VMs to use BigQuery, it automatically allocates computing resources as per the need.

Table types

Standard

Structured data and well-defined schema

Clones

Writable copies of a standard table.
Lightweight, as BigQueyr stores the delta between the clone and its base table.

Snapshots

Point-in-time copies of the table.
BigQuery only stores the delta between a table snapshot and its base table.

Materialized views

Precomputed views periodically cache the results of the view query.

External

The only table metadata is kept in BigQuery storage.
Table definition points to an external data store, Cloud Storage.

Dataset

BigQuery organizes tables and other resources into logical containers called datasets.

Key features

Managed

You don’t need to provision storage resources or reserve units of storage.
Pay for the storage you use.

Durable

99.99999999999% (11 9’s) annual durability

Encrypted

Automatically encrypt before writing it to disk
Custom encryption is also possible.

Efficient

Efficient encoding format which optimized analytic workload.

Compressed

Proprietary columnar compression, automatic data sorting, clustering, and compaction.

Pricing Models

Analysis Pricing

Cost of processing queries (SQL, User defined functions, DML, DDL, BQML)

On-Demand	Standard	Enterprise	Enterprise Plus
Pay per byte	Pay per slot hour	Pay per slot hour	Pay per slot hour
$5 per TB	$40 for 100 Slots per hour	$40 for 100 Slots per hour (1 to 3 years commitment discount)	$40 for 100 Slots per hour (1 to 3 years commitment discount)
2000 Concurrent slots (shared among all queries of the project)	Can put a cap on spending.	Standard + Fixed cost setup	Enterprise + Multi-region redundancy and higher compliance.

Storage Pricing

Cost to store the data you load.

	Active (modified in the last 90 days)	Long Term (have not been modified in the last 90 consecutive days)
Logical (Uncompressed)	Starting at $0.02 per GB	Starting at $0.01 per GB
Physical (Compressed)	Starting at $0.04 per GB	Starting at $0.02 per GB

Data Ingestion and Extraction Pricing

For analytics, data need to be ingested into the data platform and may need to be extracted, too.

Data Ingestion	Data Extraction
Batch loading and exporting table data to cloud storage is free using a shared slot pool.	Batch export table data to the Cloud Storage is free when using a shared slot pool.
Streaming inserts, charged for rows successfully inserted. Individual rows are calculated using 1 KB as the minimum. ($0.01 per 200 MB)	Streaming reads uses the storage Read API. Starting at $1.10 per TB.
BigQuery Storage Write API ($0.025 per GB, first 2 TB per month are free)

Ingestion Pricing

Shared Slot Pool

By default not charged for batch loading from Cloud storage or local files using shared pools of slot.
No guarantee of the availability of the shared pool or the throughput.
For large data, the job may wait as slots become available.
If the target BigQuery dataset and Cloud Storage bucket are co-located, network egress while loading is not charged.

Obtain dedicated Capacity

If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
But this will not be free and you will lose access to the free pool as well.

Modes

Single Batch Operation.
Streaming data one record at a time in small batches.

Batch Loading	Storage Write API	Streaming Inserts
Free using shared Pool	$0.025 per GB	$0.01 per 200 MB
For guaranteed capacity choose editions reservations.	The first 2 TB per month is free	Charged for rows inserted. 1 KB as minimum row size.

Extraction Pricing

Shared Slot Pool

By default not charged for batch exporting as it uses a shared pool of slots.
No guarantee of the availability of the shared pool or the throughput.

Obtain dedicated Capacity

If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
But this will not be free and you will lose access to the free pool as well.

Storage Read API

Charged for the number of bytes read. This is calculated based on the data size which is calculated based on the size of each column’s data type.
Charged for any data read in a read session, even if a ReadRow call fails. If ReadRows call is canceled, you will be charged for data read before cancellation.
On-demand pricing with 300 TB per month for each billing account.
Exclusions

Bytes scanned from temporary tables are free and do not count toward 300 TB.
Associated egress cost is not included.

Important: To lower the cost,

Use partitioned and clustered tables.
Reduce the data read with WHERE clause to prune the partition.

What is free of cost in BigQuery?

Cached queries.
Batch loading or export of data.
Automatic re-clustering.
Delete (table, views, partitions, and datasets)
Queries that result in an error.
Metadata operations.

*BigQuery free tier offers 10 GB of storage and 1 TB of query processing per month.

Billing Models

Storage

Billing Models
- Logical
- Physical
  - Charged based on actual bytes stored.
  - If your data compresses well, with Physical storage you can save a good amount of storage and associated cost.

How data is stored?
- Data is stored in compressed format.
- When you run a query, the query engine distributes the work in parallel across multiple workers.
- Workers scan the relevant tables in storage, process the query and then gather the result.
- BigQuery executes queries completely in memory, using the petabit network to move data extremely fast to the worker nodes.

BigQuery stores data in columnar format.

Compute

Billing Models:

On-Demand

Charged for the number of bytes processed
First 1 TB is free every month

Editions Reservation

Charge for the number of slot_sec (one slot one second) used by the query.
slot is a unit of measure for BigQuery compute power.
Ex. query using 100 slots for 10 seconds will accrue 1000 slot_sec.

You can mix and match these models to suit different needs.

Decision flow for consumption model.

Custom Quotas

Set a maximum amount of bytes that can be processed by a single user on a given billing project.

When the limit is passed the user gets a ‘quota exceeded’ error.

Best Practices

Ingestion and extraction

Use a free shared slot pool whenever possible, only choose dedicated capacity in case of large data only.

Avro and Parquet file formats provide better performance on load.

Compressed files take longer to load in BigQuery. To optimize performance when loading, uncompress your data first.

In the case of Avro files, compressed files load faster than uncompressed files.

Use native connectors to read/write data to BigQuery for better performance rather than building custom integrations.

Storage Best Practices

Use long-term storage (50% Cheaper)

Use time travel wisely
- Recover from mistakes like accidental modification or deletion.
- Default tabl expiration for transient dataset
- Expiration for partition and table

Use snapshots for longer backups
- Time travel works for the past 7 days period only
- Snapshots of a particular time can be stored for as long time as you want.
- Minimise storage cost as BigQuery stores only bytes that are different between snapshots and its base table.
- Important: No initial storage cost for snapshot, change in the base table, and if the data also exists on a snapshot, you will be charged for storage of changed or deleted data.

Use clones for modifiable copies of production data
- Lightweight copy of the table.
- Independent of the base table, any changes made to the base will not reflect in the clone.
- The cost for table clone is changed data plus new data.

Archive data into a new BigQuery table

Move ‘cold data’ to Google Cloud Storage

Workload Management

Use multiple billing projects

Mixing and switching pricing models

Know how many slots you need

Use separate reservations for compute-intensive work
- Use baseline slots for critical reservations
- Use commitments for sustained usage

Take advantage of slots sharing

A dynamic approach to workload management

Compute cost optimization

Follow SQL Best practices
- Clustering
- Partitioning
- Select only the column you need and curate filtering, ordering, and sharding
- Denormalize if needed
- Choose the right function and pay attention to Javascript user-defined function
- Choose the right data types
- Optimize join and common table expressions
- Look for anti-patterns

Use BI Engine to reduce the computing cost

Use Materialized views to reduce costs

Keep a close watch on the cost

Budget alerts

BigQuery reports

BigQuery admin resource charts

Looker studio dashboard

Conclusion

BigQuery offers flexibility in choosing the best suitable options for your requirement for storage and computation. You should be conscious about opting for the right match, such that you won't end up with a situation of starvation or waste of resources.

References:

Labels

Archive

Saturday, June 17, 2023

BigQuery Pricing and Cost Optimization