In today’s world of information, data analytics is an integral part of every business decision. While analytics is such an important factor in business, the cost of analytics tools and technologies is equally important to ensure a high return on investment and minimum waste of resources.
BigQuery is one of the leading technology of data analytics for all-sized organizations. BigQuery not just helps to analyze the data but also helps organizations in making real-time decisions, reporting, and future predictions.
Architecture
BigQuery is a completely serverless enterprise data warehouse. In BigQuery storage and compute are decoupled so that both can scale independently on demand.
Such architecture offers flexibility to the customers. Customers don’t need to keep compute resources up and running all the time, also, customers no longer need to worry about system engineering and database operations while using BigQuery.
BigQuery has distributed and replicated storage along with a high-availability cluster for computation. We don’t need to provision instances of VMs to use BigQuery, it automatically allocates computing resources as per the need.
Table types
Standard
Structured data and well-defined schema
Clones
Writable copies of a standard table.
Lightweight, as BigQueyr stores the delta between the clone and its base table.
Snapshots
Point-in-time copies of the table.
BigQuery only stores the delta between a table snapshot and its base table.
Materialized views
Precomputed views periodically cache the results of the view query.
External
The only table metadata is kept in BigQuery storage.
Table definition points to an external data store, Cloud Storage.
Dataset
BigQuery organizes tables and other resources into logical containers called datasets.
Key features
Managed
You don’t need to provision storage resources or reserve units of storage.
Pay for the storage you use.
Durable
99.99999999999% (11 9’s) annual durability
Encrypted
Automatically encrypt before writing it to disk
Custom encryption is also possible.
Efficient
Efficient encoding format which optimized analytic workload.
Compressed
Proprietary columnar compression, automatic data sorting, clustering, and compaction.
Pricing Models
Analysis Pricing
Cost of processing queries (SQL, User defined functions, DML, DDL, BQML)
On-Demand | Standard | Enterprise | Enterprise Plus |
Pay per byte | Pay per slot hour | Pay per slot hour | Pay per slot hour |
$5 per TB | $40 for 100 Slots per hour | $40 for 100 Slots per hour (1 to 3 years commitment discount) | $40 for 100 Slots per hour (1 to 3 years commitment discount) |
2000 Concurrent slots (shared among all queries of the project) | Can put a cap on spending. | Standard + Fixed cost setup | Enterprise + Multi-region redundancy and higher compliance. |
Storage Pricing
Cost to store the data you load.
Active (modified in the last 90 days) | Long Term (have not been modified in the last 90 consecutive days) | |
---|---|---|
Logical (Uncompressed) | Starting at $0.02 per GB | Starting at $0.01 per GB |
Physical (Compressed) | Starting at $0.04 per GB | Starting at $0.02 per GB |
Data Ingestion | Data Extraction |
---|---|
Batch loading and exporting table data to cloud storage is free using a shared slot pool. | Batch export table data to the Cloud Storage is free when using a shared slot pool. |
Streaming inserts, charged for rows successfully inserted. Individual rows are calculated using 1 KB as the minimum. ($0.01 per 200 MB) | Streaming reads uses the storage Read API. Starting at $1.10 per TB. |
BigQuery Storage Write API ($0.025 per GB, first 2 TB per month are free) |
Shared Slot Pool
- By default not charged for batch loading from Cloud storage or local files using shared pools of slot.
- No guarantee of the availability of the shared pool or the throughput.
- For large data, the job may wait as slots become available.
- If the target BigQuery dataset and Cloud Storage bucket are co-located, network egress while loading is not charged.
Obtain dedicated Capacity
- If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
- But this will not be free and you will lose access to the free pool as well.
Modes
- Single Batch Operation.
- Streaming data one record at a time in small batches.
|
---|
Shared Slot Pool
- By default not charged for batch exporting as it uses a shared pool of slots.
- No guarantee of the availability of the shared pool or the throughput.
Obtain dedicated Capacity
- If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
- But this will not be free and you will lose access to the free pool as well.
Storage Read API
- Charged for the number of bytes read. This is calculated based on the data size which is calculated based on the size of each column’s data type.
- Charged for any data read in a read session, even if a ReadRow call fails. If ReadRows call is canceled, you will be charged for data read before cancellation.
- On-demand pricing with 300 TB per month for each billing account.
- Exclusions
- Bytes scanned from temporary tables are free and do not count toward 300 TB.
- Associated egress cost is not included.
- Important: To lower the cost,
- Use partitioned and clustered tables.
- Reduce the data read with WHERE clause to prune the partition.
What is free of cost in BigQuery?
- Cached queries.
- Batch loading or export of data.
- Automatic re-clustering.
- Delete (table, views, partitions, and datasets)
- Queries that result in an error.
- Metadata operations.
*BigQuery free tier offers 10 GB of storage and 1 TB of query processing per month.
Billing Models
- Billing Models
- Logical
- Charged based on logical bytes stored.
- Default billing model.
- Physical
- Charged based on actual bytes stored.
- If your data compresses well, with Physical storage you can save a good amount of storage and associated cost.
- How data is stored?
- Data is stored in compressed format.
- When you run a query, the query engine distributes the work in parallel across multiple workers.
- Workers scan the relevant tables in storage, process the query and then gather the result.
- BigQuery executes queries completely in memory, using the petabit network to move data extremely fast to the worker nodes.
- BigQuery stores data in columnar format.
- Billing Models:
- On-Demand
- Charged for the number of bytes processed
- First 1 TB is free every month
- Editions Reservation
- Charge for the number of slot_sec (one slot one second) used by the query.
- slot is a unit of measure for BigQuery compute power.
- Ex. query using 100 slots for 10 seconds will accrue 1000 slot_sec.
Decision flow for consumption model.
Set a maximum amount of bytes that can be processed by a single user on a given billing project.
When the limit is passed the user gets a ‘quota exceeded’ error.
Best Practices
Ingestion and extraction
- Use a free shared slot pool whenever possible, only choose dedicated capacity in case of large data only.
- Avro and Parquet file formats provide better performance on load.
- Compressed files take longer to load in BigQuery. To optimize performance when loading, uncompress your data first.
- In the case of Avro files, compressed files load faster than uncompressed files.
- Use native connectors to read/write data to BigQuery for better performance rather than building custom integrations.
Storage Best Practices
- Use long-term storage (50% Cheaper)
- Use time travel wisely
- Recover from mistakes like accidental modification or deletion.
- Default tabl expiration for transient dataset
- Expiration for partition and table
- Use snapshots for longer backups
- Time travel works for the past 7 days period only
- Snapshots of a particular time can be stored for as long time as you want.
- Minimise storage cost as BigQuery stores only bytes that are different between snapshots and its base table.
- Important: No initial storage cost for snapshot, change in the base table, and if the data also exists on a snapshot, you will be charged for storage of changed or deleted data.
- Use clones for modifiable copies of production data
- Lightweight copy of the table.
- Independent of the base table, any changes made to the base will not reflect in the clone.
- The cost for table clone is changed data plus new data.
- Archive data into a new BigQuery table
- Move ‘cold data’ to Google Cloud Storage
Workload Management
- Use multiple billing projects
- Mixing and switching pricing models
- Know how many slots you need
- Use separate reservations for compute-intensive work
- Use baseline slots for critical reservations
- Use commitments for sustained usage
- Take advantage of slots sharing
- A dynamic approach to workload management
Compute cost optimization
- Follow SQL Best practices
- Clustering
- Partitioning
- Select only the column you need and curate filtering, ordering, and sharding
- Denormalize if needed
- Choose the right function and pay attention to Javascript user-defined function
- Choose the right data types
- Optimize join and common table expressions
- Look for anti-patterns
- Use BI Engine to reduce the computing cost
- Use Materialized views to reduce costs
Keep a close watch on the cost
- Budget alerts
- BigQuery reports
- BigQuery admin resource charts
- Looker studio dashboard
Conclusion
- https://cloud.google.com/blog/products/data-analytics/introducing-new-bigquery-pricing-editions
- https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview