Akash's Blog

Tuesday, June 27, 2023

Extreme Programming

This post is about my basic understanding of Extreme Programming, in short XP. After joining Thoughtworks, I came to know about XP. The team I joined is upholding the XP practices very proudly and ensuring that it remains as much followed as possible. Having said that, I was trying to connect the dots between different aspects of XP including pair programming, quick feedback, shorter iterations, test-driven development and so and so forth.

All the new joiners in Thoughtworks are given the following book so that they can start exploring XP in more detail and be able to relate to it when added to the XP Team.

After reading this book I realized multiple things about XP which I want to share with you. I'm still exploring more about it, yet decided to share whatever I understood from this interesting book.

Extreme programming is a guideline for effective collaboration and ensured quality. It's a methodology to improve not just as an individual but mainly as a team.

Extreme programming is "extreme" because it takes the effectiveness of principles, practices, and values to its extreme. It asks you to do your best and then deal with the consequences, that's extreme. Programmers get past with "I know better than everyone else and all I need is to be left alone to be the greatest."

Further in the post, I will use the abbreviation XP for Extreme Programming.

XP is based on the 4Ps,

Philosophy of communication, courage, respect, feedback, and simplicity.
Principles of translating values to practices.
Practices to express the values correctly.
People share values and practices.

XP is about

Estimating your work accurately
Creating a rapid feedback loop
Delivering quality the first time

XP says Stay Aware. Adapt. Change.

XP includes three major components, Values, Principles, and Practices.

We will talk about each in detail.

Values

Roots of things we like and don't like in any situation.

XP embraces five values,

Communication

Effective communication between people involved in the project whether internal or external.
Choosing the right medium for communication. Creating lengthy unmanageable documents for everything is not effective all the time, quick team meetings and one on one discussions may suffice in many cases.
Ensuring inclusivity and respect while communicating for better outcomes.

Simplicity

Simplicity needs to be planned while complexity is accidental.
Many engineers fall into this rabbit hole of making things complex rather than simply showing off or pretending to be experts. When it comes to Software, keeping things simple rewards very well in the long run to the team and organization.
Thinking in a direction to avoid wasted complexity as much as possible.

Feedback

Keep sharing and receiving feedback.
Feedback about your day-to-day work, involvement in the team, and overall effectiveness as a team member.
Giving genuine feedback is equally important, be honest and respectful.

Courage

Choose to do the right thing even when feel afraid.
The team should create an environment to ensure everyone feels safe to speak on any aspect of work during meetings.

Respect

Everyone on the team is important for the success of the project.
I am an important member, so as you. Remember always.
Ensuring respect in case of difficult situations by focusing on problem rather than person.

Principles

The principles are a bridge between practices and values.

XP embraces principles like,

Humanity

Create an environment where team members feel safe to speak up.
Each member feels included.
When it comes to accomplishment, rather focus on the team's accomplishment compared to individual-level accomplishment.

Economics

Software development is even more valuable when it actually earns money.
Focusing on quality but keeping focus on generating values economically will help team and organization to grow.

Mutual Benefit

Values and practices must benefit the team now and future as well.
The benefits of XP should not be limited to the team but should be extended to the customer.

Self-Similarity

Avoid reinventing the wheel, try to see if your existing setup or structure can be replicated for some other use or not.
The more it gets replicated, the more feedback you will get for improving it in the long run.

Improvement

Perfection is the enemy of good.
XP focuses on delivering important things in smaller iterations.
Rather than giving everything perfect after months or years, XP recommends iteratively delivering things for better feedback and less overhead.

Diversity

Work together as a team for a given problem, ensuring each opinion and thought of all team members are taken into account.
The team should comprise all required skillsets.

Reflection

After every iteration, reflect and find what went well and what can be improved.
Appreciate good work and discuss the area of improvement.

Flow

Create a flow of deploying smaller increments of value frequently with the best quality.

Opportunity

Problems are opportunities, defects or failures can be treated as an opportunity to find the gaps and correcting.
Ensure the root cause of the problem is fixed. Fix the defect and ensure it should never resurface again.

Redundancy

Do not remove redundancy that serves a valid purpose.
Some of the practices may sound redundant but keeping them running would still help. For example, pair programming and continuous integration give early feedback and help to avoid silly errors, which don't reduce the importance of any of the practices individually.

Failure

When you don't know what to do, go for risking failure.
You are stuck at the point where you have to choose one of the two possible approaches for the implementation and can't figure out which one is better suitable. Go ahead and try both.
Have the courage to try and fail rather than play safe.

Quality

You won't be able to figure out the best possible way every time.
Do whatever best you can to deliver quality.
Keep looking for better and embrace change.

Baby Steps

Shorter cycles are like baby steps, which help to quickly correct and get early feedback.
Small steps ensure minimum overhead and faster rollback if needed.

Accepted Responsibility

XP focuses more on accepting the required responsibility rather than enforcing it to any of the team members.

Practices

Things you objectively do day to day. Practices are situation-dependent you may need to adapt to new practices based on the situation, values on the other hand do not have to change like practices.

XP introduces the following primary practices,

Sit Together

The more face time the more humane and productive the project.
Sitting together embraces belonging and effective collaboration.

Whole Team

Include people with all the skills and perspectives necessary.

Informative Workspace

Your workspace should reflect your work.
Someone walking into your team's space should get clarity about what's going on by looking at cards on the wall.

Energized work

Work as many hours as you can genuinely productive.
Do not work more to complete more, work effectively to accomplish necessary for the given day.
More work doesn't mean great work.

Pair Programming

Collaborate in pairs for effective efforts and better quality.
Be open-minded while pairing, listen more.
Learn to get the opinions and perspectives of your pair attentively.

Stories

Divide major releases into smaller releases.
Divide each release into smaller stories.
If required divide stories into smaller tasks.
Ensure to pick the most important items after discussing them with the customer.
Early estimation of stories will help to plan well in iteration.

Weekly cycle

Start with writing automated system tests and work during the week to make them pass.

Quarterly cycle

Identify bottleneck, and find a theme(s) for the quarter.
Reflect and plan with the big picture.
Mostly project manager should do this activity.

Slack

Few met commitments go a long way toward rebuilding relationships.
Deliver what you can with ensured quality and value rather than delivering n number or things with compromised quality and bugs.

Ten-Minute Build

The automated build must finish within 10 minutes for rapid feedback.

Continuous integration

Continuously integrating your changes with automated builds.

Test First Programming (TDD)

Write system-level tests first for better quality and implementation.
Your tests should depict the story and requirements. Making them pass should ensure completion.

Incremental Design

Invest in the design of the system every day, and keep improving.
Design decisions are subjective and depend on the nature of the project as well.

These are primary practices, there are certain corollary practices that may and may not be applicable in all cases but still adds a lot of value to your software development. We may talk about it in separate blog post.

XP iteration may look like the following,

Identify and estimate high-priority items to be delivered in this iteration.
Convert the requirements into stories
Effectively estimate the stories
Start with writing failing tests which likely going to pass by the end of the iteration.
Work in pair to make failing tests pass.
Continuously integrate and get feedback
Reflect after each iteration

This is just the beginning of XP, however, the more your team will practice it the more effective and fruitful it will going to be in the longer run.

Remember in XP,

"Make a change, observe the effects; then digest the change, turning it into a solid habit."

Monday, June 26, 2023

Deep vs Shallow Copy

In programming language like Java, everything revolves around objects. Lifecycle of object from creation to garbage collection is something that keeps happening continuously in any realtime application.

There are different ways you can use to create and Object of a Class. However, the object creation with new keyword is the straightforward one.

Car mercedes = new Mercedes();

While we keep creating objects on-demand, in some realtime scenarios we may need to create new object which is a copy of existing object. In that case new Object should hold same state as current Object is holding.

You may ask, what kind of scenarios require copying existing object?

Well, it completely depends on the software you are developing but in any application where you see copy option, whether it can be copying table row, copying form etc. such cases are good candidates to use object copying mechanism.

There are two approaches you can use to copy object,

Shallow copy

Deep copy

There are different methods to implement copying approach,

Using clone method

Copy constructor

Static factory method

There are certain third party libraries available which provides methods to copy the objects.

For example, BeanUtils#copyProperties(Object destination, Object source)

Which option to choose majorly depends on your requirement.

Shallow copy

All the fields of existing object is copied to new object.

Consider following diagram, while copying Car object, company object reference is reused in the copy object, which simply means shallow copy only copies values (variable and object references).

Basically, it doesn’t create copy of objects referenced inside the object we want to copy, thus it is called shallow copy.

Following is a shallow copy example using copy constructor method,

class Car {

    private String model;

    private Company company;

    public Car(String model, Company company) {
        this.model = model;
        this.company = company;
    }

    public Car(Car carToCopyFrom) {
        this(carToCopyFrom.model, carToCopyFrom.company);
    }

}

Deep copy

All the fields along with the objects referenced by existing object are copied to new object.

Consider following diagram, company object is also copied and reference to this new object is used in car object. Note that all the referenced objects at any level (direct or indirect) are copied and referred in copy object.

Following is a deep copy example using copy constructor method,

class Car {

    private String model;

    private Company company;

    public Car(String model, Company company) {
        this.model = model;
        this.company = company;
    }

    public Car(Car carToCopyFrom) {
        this(carToCopyFrom.model, new Company(carToCopyFrom.company.getName()));
    }
}

Saturday, June 17, 2023

BigQuery Pricing and Cost Optimization

In today’s world of information, data analytics is an integral part of every business decision. While analytics is such an important factor in business, the cost of analytics tools and technologies is equally important to ensure a high return on investment and minimum waste of resources.

BigQuery is one of the leading technology of data analytics for all-sized organizations. BigQuery not just helps to analyze the data but also helps organizations in making real-time decisions, reporting, and future predictions.

Architecture

BigQuery is a completely serverless enterprise data warehouse. In BigQuery storage and compute are decoupled so that both can scale independently on demand.

Such architecture offers flexibility to the customers. Customers don’t need to keep compute resources up and running all the time, also, customers no longer need to worry about system engineering and database operations while using BigQuery.

BigQuery has distributed and replicated storage along with a high-availability cluster for computation. We don’t need to provision instances of VMs to use BigQuery, it automatically allocates computing resources as per the need.

Table types

Standard

Structured data and well-defined schema

Clones

Writable copies of a standard table.
Lightweight, as BigQueyr stores the delta between the clone and its base table.

Snapshots

Point-in-time copies of the table.
BigQuery only stores the delta between a table snapshot and its base table.

Materialized views

Precomputed views periodically cache the results of the view query.

External

The only table metadata is kept in BigQuery storage.
Table definition points to an external data store, Cloud Storage.

Dataset

BigQuery organizes tables and other resources into logical containers called datasets.

Key features

Managed

You don’t need to provision storage resources or reserve units of storage.
Pay for the storage you use.

Durable

99.99999999999% (11 9’s) annual durability

Encrypted

Automatically encrypt before writing it to disk
Custom encryption is also possible.

Efficient

Efficient encoding format which optimized analytic workload.

Compressed

Proprietary columnar compression, automatic data sorting, clustering, and compaction.

Pricing Models

Analysis Pricing

Cost of processing queries (SQL, User defined functions, DML, DDL, BQML)

On-Demand	Standard	Enterprise	Enterprise Plus
Pay per byte	Pay per slot hour	Pay per slot hour	Pay per slot hour
$5 per TB	$40 for 100 Slots per hour	$40 for 100 Slots per hour (1 to 3 years commitment discount)	$40 for 100 Slots per hour (1 to 3 years commitment discount)
2000 Concurrent slots (shared among all queries of the project)	Can put a cap on spending.	Standard + Fixed cost setup	Enterprise + Multi-region redundancy and higher compliance.

Storage Pricing

Cost to store the data you load.

	Active (modified in the last 90 days)	Long Term (have not been modified in the last 90 consecutive days)
Logical (Uncompressed)	Starting at $0.02 per GB	Starting at $0.01 per GB
Physical (Compressed)	Starting at $0.04 per GB	Starting at $0.02 per GB

Data Ingestion and Extraction Pricing

For analytics, data need to be ingested into the data platform and may need to be extracted, too.

Data Ingestion	Data Extraction
Batch loading and exporting table data to cloud storage is free using a shared slot pool.	Batch export table data to the Cloud Storage is free when using a shared slot pool.
Streaming inserts, charged for rows successfully inserted. Individual rows are calculated using 1 KB as the minimum. ($0.01 per 200 MB)	Streaming reads uses the storage Read API. Starting at $1.10 per TB.
BigQuery Storage Write API ($0.025 per GB, first 2 TB per month are free)

Ingestion Pricing

Shared Slot Pool

By default not charged for batch loading from Cloud storage or local files using shared pools of slot.
No guarantee of the availability of the shared pool or the throughput.
For large data, the job may wait as slots become available.
If the target BigQuery dataset and Cloud Storage bucket are co-located, network egress while loading is not charged.

Obtain dedicated Capacity

If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
But this will not be free and you will lose access to the free pool as well.

Modes

Single Batch Operation.
Streaming data one record at a time in small batches.

Batch Loading	Storage Write API	Streaming Inserts
Free using shared Pool	$0.025 per GB	$0.01 per 200 MB
For guaranteed capacity choose editions reservations.	The first 2 TB per month is free	Charged for rows inserted. 1 KB as minimum row size.

Extraction Pricing

Shared Slot Pool

By default not charged for batch exporting as it uses a shared pool of slots.
No guarantee of the availability of the shared pool or the throughput.

Obtain dedicated Capacity

If shared slots are not available or your data is large, you have the option to obtain dedicated capacity by assigning jobs to editions reservation.
But this will not be free and you will lose access to the free pool as well.

Storage Read API

Charged for the number of bytes read. This is calculated based on the data size which is calculated based on the size of each column’s data type.
Charged for any data read in a read session, even if a ReadRow call fails. If ReadRows call is canceled, you will be charged for data read before cancellation.
On-demand pricing with 300 TB per month for each billing account.
Exclusions

Bytes scanned from temporary tables are free and do not count toward 300 TB.
Associated egress cost is not included.

Important: To lower the cost,

Use partitioned and clustered tables.
Reduce the data read with WHERE clause to prune the partition.

What is free of cost in BigQuery?

Cached queries.
Batch loading or export of data.
Automatic re-clustering.
Delete (table, views, partitions, and datasets)
Queries that result in an error.
Metadata operations.

*BigQuery free tier offers 10 GB of storage and 1 TB of query processing per month.

Billing Models

Storage

Billing Models
- Logical
- Physical
  - Charged based on actual bytes stored.
  - If your data compresses well, with Physical storage you can save a good amount of storage and associated cost.

How data is stored?
- Data is stored in compressed format.
- When you run a query, the query engine distributes the work in parallel across multiple workers.
- Workers scan the relevant tables in storage, process the query and then gather the result.
- BigQuery executes queries completely in memory, using the petabit network to move data extremely fast to the worker nodes.

BigQuery stores data in columnar format.

Compute

Billing Models:

On-Demand

Charged for the number of bytes processed
First 1 TB is free every month

Editions Reservation

Charge for the number of slot_sec (one slot one second) used by the query.
slot is a unit of measure for BigQuery compute power.
Ex. query using 100 slots for 10 seconds will accrue 1000 slot_sec.

You can mix and match these models to suit different needs.

Decision flow for consumption model.

Custom Quotas

Set a maximum amount of bytes that can be processed by a single user on a given billing project.

When the limit is passed the user gets a ‘quota exceeded’ error.

Best Practices

Ingestion and extraction

Use a free shared slot pool whenever possible, only choose dedicated capacity in case of large data only.

Avro and Parquet file formats provide better performance on load.

Compressed files take longer to load in BigQuery. To optimize performance when loading, uncompress your data first.

In the case of Avro files, compressed files load faster than uncompressed files.

Use native connectors to read/write data to BigQuery for better performance rather than building custom integrations.

Storage Best Practices

Use long-term storage (50% Cheaper)

Use time travel wisely
- Recover from mistakes like accidental modification or deletion.
- Default tabl expiration for transient dataset
- Expiration for partition and table

Use snapshots for longer backups
- Time travel works for the past 7 days period only
- Snapshots of a particular time can be stored for as long time as you want.
- Minimise storage cost as BigQuery stores only bytes that are different between snapshots and its base table.
- Important: No initial storage cost for snapshot, change in the base table, and if the data also exists on a snapshot, you will be charged for storage of changed or deleted data.

Use clones for modifiable copies of production data
- Lightweight copy of the table.
- Independent of the base table, any changes made to the base will not reflect in the clone.
- The cost for table clone is changed data plus new data.

Archive data into a new BigQuery table

Move ‘cold data’ to Google Cloud Storage

Workload Management

Use multiple billing projects

Mixing and switching pricing models

Know how many slots you need

Use separate reservations for compute-intensive work
- Use baseline slots for critical reservations
- Use commitments for sustained usage

Take advantage of slots sharing

A dynamic approach to workload management

Compute cost optimization

Follow SQL Best practices
- Clustering
- Partitioning
- Select only the column you need and curate filtering, ordering, and sharding
- Denormalize if needed
- Choose the right function and pay attention to Javascript user-defined function
- Choose the right data types
- Optimize join and common table expressions
- Look for anti-patterns

Use BI Engine to reduce the computing cost

Use Materialized views to reduce costs

Keep a close watch on the cost

Budget alerts

BigQuery reports

BigQuery admin resource charts

Looker studio dashboard

Conclusion

BigQuery offers flexibility in choosing the best suitable options for your requirement for storage and computation. You should be conscious about opting for the right match, such that you won't end up with a situation of starvation or waste of resources.

References:

Labels

Archive