Showing posts with label BigQuery. Show all posts
Showing posts with label BigQuery. Show all posts

BigQuery Pricing and Cost Optimization

In today’s world of information, data analytics is an integral part of every business decision. While analytics is such an important factor in business, the cost of analytics tools and technologies is equally important to ensure a high return on investment and minimum waste of resources.

BigQuery is one of the leading technology of data analytics for all-sized organizations. BigQuery not just helps to analyze the data but also helps organizations in making real-time decisions, reporting, and future predictions.

Architecture

BigQuery is a completely serverless enterprise data warehouse. In BigQuery storage and compute are decoupled so that both can scale independently on demand.

Such architecture offers flexibility to the customers. Customers don’t need to keep compute resources up and running all the time, also, customers no longer need to worry about system engineering and database operations while using BigQuery.

BigQuery has distributed and replicated storage along with a high-availability cluster for computation. We don’t need to provision instances of VMs to use BigQuery, it automatically allocates computing resources as per the need.

  • Table types

    • Standard

      • Structured data and well-defined schema

    • Clones

      • Writable copies of a standard table.

      • Lightweight, as BigQueyr stores the delta between the clone and its base table.

    • Snapshots

      • Point-in-time copies of the table.

      • BigQuery only stores the delta between a table snapshot and its base table.

    • Materialized views

      • Precomputed views periodically cache the results of the view query.

    • External

      • The only table metadata is kept in BigQuery storage.

      • Table definition points to an external data store, Cloud Storage.

  • Dataset

    • BigQuery organizes tables and other resources into logical containers called datasets.

  • Key features

    • Managed

      • You don’t need to provision storage resources or reserve units of storage.

      • Pay for the storage you use.

    • Durable

      • 99.99999999999% (11 9’s) annual durability

    • Encrypted

      • Automatically encrypt before writing it to disk

      • Custom encryption is also possible.

    • Efficient

      • Efficient encoding format which optimized analytic workload.

    • Compressed

      • Proprietary columnar compression, automatic data sorting, clustering, and compaction.

Pricing Models

Analysis Pricing

Cost of processing queries (SQL, User defined functions, DML, DDL, BQML)

On-DemandStandardEnterpriseEnterprise Plus
Pay per bytePay per slot hourPay per slot hourPay per slot hour
$5 per TB$40 for 100 Slots per hour$40 for 100 Slots per hour (1 to 3 years commitment discount)$40 for 100 Slots per hour (1 to 3 years commitment discount)
2000 Concurrent slots (shared among all queries of the project)Can put a cap on spending.Standard + Fixed cost setupEnterprise + Multi-region redundancy and higher compliance.

Storage Pricing

Cost to store the data you load.


Active (modified in the last 90 days)Long Term (have not been modified in the last 90 consecutive days)
Logical (Uncompressed)Starting at $0.02 per GBStarting at $0.01 per GB
Physical (Compressed)Starting at $0.04 per GBStarting at $0.02 per GB

Data Ingestion and Extraction Pricing
For analytics, data need to be ingested into the data platform and may need to be extracted, too.
Data IngestionData Extraction
Batch loading and exporting table data to cloud storage is free using a shared slot pool.Batch export table data to the Cloud Storage is free when using a shared slot pool.
Streaming inserts, charged for rows successfully inserted. Individual rows are calculated using 1 KB as the minimum. ($0.01 per 200 MB)Streaming reads uses the storage Read API. Starting at $1.10 per TB.
BigQuery Storage Write API ($0.025 per GB, first 2 TB per month are free)

Ingestion Pricing

Shared Slot Pool

  • By default not charged for batch loading from Cloud storage or local files using shared pools of slot.
  • No guarantee of the availability of the shared pool or the throughput.
  • For large data, the job may wait as slots become available.
  • If the target BigQuery dataset and Cloud Storage bucket are co-located, network egress while loading is not charged.

Obtain dedicated Capacity

  • If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
  • But this will not be free and you will lose access to the free pool as well.

Modes

  • Single Batch Operation.
  • Streaming data one record at a time in small batches.
Batch LoadingStorage Write APIStreaming Inserts
Free using shared Pool$0.025 per GB$0.01 per 200 MB
For guaranteed capacity choose editions reservations.The first 2 TB per month is freeCharged for rows inserted. 1 KB as minimum row size.


Extraction Pricing

Shared Slot Pool

  • By default not charged for batch exporting as it uses a shared pool of slots.
  • No guarantee of the availability of the shared pool or the throughput.

Obtain dedicated Capacity

  • If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
  • But this will not be free and you will lose access to the free pool as well.

Storage Read API

  • Charged for the number of bytes read. This is calculated based on the data size which is calculated based on the size of each column’s data type.
  • Charged for any data read in a read session, even if a ReadRow call fails. If ReadRows call is canceled, you will be charged for data read before cancellation.
  • On-demand pricing with 300 TB per month for each billing account.
  • Exclusions
    • Bytes scanned from temporary tables are free and do not count toward 300 TB.
    • Associated egress cost is not included.
  • Important: To lower the cost,
    1. Use partitioned and clustered tables.
    2. Reduce the data read with WHERE clause to prune the partition.

What is free of cost in BigQuery?

  • Cached queries.
  • Batch loading or export of data.
  • Automatic re-clustering.
  • Delete (table, views, partitions, and datasets)
  • Queries that result in an error.
  • Metadata operations.

*BigQuery free tier offers 10 GB of storage and 1 TB of query processing per month.

Billing Models

Storage
  • Billing Models
    • Logical
      • Charged based on logical bytes stored.
      • Default billing model.
    • Physical
      • Charged based on actual bytes stored.
      • If your data compresses well, with Physical storage you can save a good amount of storage and associated cost.
  • How data is stored?
    • Data is stored in compressed format.
      • When you run a query, the query engine distributes the work in parallel across multiple workers.
      • Workers scan the relevant tables in storage, process the query and then gather the result.
      • BigQuery executes queries completely in memory, using the petabit network to move data extremely fast to the worker nodes.
      • BigQuery stores data in columnar format.
    Compute

    • Billing Models:
      1. On-Demand
        • Charged for the number of bytes processed
        • First 1 TB is free every month
      2. Editions Reservation
        • Charge for the number of slot_sec (one slot one second) used by the query.
        • slot is a unit of measure for BigQuery compute power.
        • Ex. query using 100 slots for 10 seconds will accrue 1000 slot_sec.
    You can mix and match these models to suit different needs.

    Decision flow for consumption model.

    Custom Quotas

    Set a maximum amount of bytes that can be processed by a single user on a given billing project.

    When the limit is passed the user gets a ‘quota exceeded’ error.

    Best Practices

    Ingestion and extraction

    • Use a free shared slot pool whenever possible, only choose dedicated capacity in case of large data only.
    • Avro and Parquet file formats provide better performance on load.
    • Compressed files take longer to load in BigQuery. To optimize performance when loading, uncompress your data first.
    • In the case of Avro files, compressed files load faster than uncompressed files.
    • Use native connectors to read/write data to BigQuery for better performance rather than building custom integrations.

    Storage Best Practices

    • Use long-term storage (50% Cheaper)
    • Use time travel wisely
      • Recover from mistakes like accidental modification or deletion.
      • Default tabl expiration for transient dataset
      • Expiration for partition and table
    • Use snapshots for longer backups
      • Time travel works for the past 7 days period only
      • Snapshots of a particular time can be stored for as long time as you want.
      • Minimise storage cost as BigQuery stores only bytes that are different between snapshots and its base table.
      • Important: No initial storage cost for snapshot, change in the base table, and if the data also exists on a snapshot, you will be charged for storage of changed or deleted data.
    • Use clones for modifiable copies of production data
      • Lightweight copy of the table.
      • Independent of the base table, any changes made to the base will not reflect in the clone.
      • The cost for table clone is changed data plus new data.
    • Archive data into a new BigQuery table
    • Move ‘cold data’ to Google Cloud Storage

    Workload Management

    • Use multiple billing projects
    • Mixing and switching pricing models
    • Know how many slots you need
    • Use separate reservations for compute-intensive work
      • Use baseline slots for critical reservations
      • Use commitments for sustained usage
    • Take advantage of slots sharing
    • A dynamic approach to workload management

    Compute cost optimization

    • Follow SQL Best practices
      • Clustering
      • Partitioning
      • Select only the column you need and curate filtering, ordering, and sharding
      • Denormalize if needed
      • Choose the right function and pay attention to Javascript user-defined function
      • Choose the right data types
      • Optimize join and common table expressions
      • Look for anti-patterns
    • Use BI Engine to reduce the computing cost
    • Use Materialized views to reduce costs

    Keep a close watch on the cost

    • Budget alerts
    • BigQuery reports
    • BigQuery admin resource charts
    • Looker studio dashboard

    Conclusion

    BigQuery offers flexibility in choosing the best suitable options for your requirement for storage and computation. You should be conscious about opting for the right match, such that you won't end up with a situation of starvation or waste of resources.

    References:
    • https://cloud.google.com/blog/products/data-analytics/introducing-new-bigquery-pricing-editions
    • https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview

    Code Smells

    Code smells are the indicators of potential flaws in the design and structure, not necessarily prevents the functionality but signifies the ...