Fayez Abu Awad

September 14, 2022

Cloud Cost Optimization: Definition and Strategies

There are numerous advantages to migrating to the cloud. Enterprise-grade infrastructure and services are available to all businesses, not just large corporations with large IT budgets. However, users of cloud infrastructure and service providers such as AWS, Google Cloud, and Azure will need to understand how to optimize cloud costs at some point. Thinking big leads to planning, providing unique insights, and potentially big wins, and not every small business stays small.

‍

What is the definition of Cloud Cost Optimization?

Finding ways to run applications in the cloud, perform work or provide value to the business at the lowest possible cost, and use cloud providers as cost-effectively as possible is what cloud cost optimization is all about. Optimization as a practice encompasses everything from simple business management to complex scientific and engineering fields such as operations research, decision science and analytics, modeling, and forecasting.

‍

Why is Cost Optimization Important?

Every organization attempting to achieve some goal, profit or otherwise, must reduce overhead, or the cost of goods and services produced.

Consider a corporate data center ecosystem and a web app "stack" made up of a web frontend, an application layer, and a database backend. Every application component and communication channel must be sized to meet a high-demand event, such as payday or Black Friday. The web stack could consist of 20 web servers behind a load balancer, 20 application servers behind a second load balancer, and a database cluster. This infrastructure could be replicated in a geographically separate data center in an active-active or active-standby configuration.

By quickly replicating the existing on-premises application to the cloud without any redesign for cloud computing capabilities, one ends up paying for servers that spend the majority or all of their time idle. A data center web app shares communication infrastructure with other apps, such as leased lines, routers, switches, load balancers, and firewalls, and this shared infrastructure is ultimately sized to handle the highest traffic expected, with some upper limit of available bandwidth and latency constrained by cost.

For future user and workload growth, storage and backup systems, power and cooling, and physical space for all IT would be planned. Simple "lift and shift" to the cloud is more expensive than necessary, and once the dust settles, the benefits of cloud cost optimization become clear because minimizing overhead is a fundamental business process.

A new company may begin by designing better cloud systems with the pay-as-you-go cost model in mind, but changes, entropy, and a lack of cost awareness will create opportunities for cost optimization.

Early Days of Cloud Cost Optimization

As hundreds of data center-filling applications were moved to the cloud, our software engineers realized that infrastructure as code could be managed automatically. AWS's Trusted Advisor tool provides cost-cutting recommendations. Although simple scripts can be written to manage cloud resources as needed, managed automation allowed us to report and act on cloud resources at scale. Unused resource cleanup was always a priority, and there was some speculation about why cost controls were required and how they could be implemented. Because most development environments use resources only during the day, we automated the ability to shut down resources in bulk at the end of the business day and restart them the following morning.

We can search for and reduce or eliminate inefficiency and waste by reporting resource properties, usage, and Cloudwatch metrics for resources. We discovered that small savings realized for hundreds or thousands of cloud resources could add up to thousands or millions of dollars over the course of a year after reviewing the pricing models.

‍

Strategies for Reducing Cloud Costs

Here are some ideas for creating a cloud cost management strategy or program. To achieve maximum and long-term results, these strategies should be considered on an ongoing basis.

‍

What should I do?

Design for the cloud
Refinement of operations
Reservations for capacity and volume discounts

Cloud Cost Management: How to Implement It

Management, organization, communication, and education
Plan and prepare to track expenditures by cost center.
Review of Billing and Pricing
Software and Automation Requirements

‍

Design for the Cloud

Create more cost-effective systems to replace existing ones. The cloud native concept is to take advantage of any cost advantage that can be gained by leveraging cloud-specific capabilities. Auto-scaling is an example of this. It is unheard of for a traditional load-balanced server pool to be charged only for the servers that are active. Every server purchased for the pool is paid for in advance and on an ongoing basis; this includes server hardware as well as data center space, power, and connectivity. Being billed only for the servers that are actively running in the pool is a significant cloud native advantage. Cloud auto-scaling ensures that the capacity paid for is not significantly greater than the capacity used.

The AWS Well Architected Tool makes recommendations based on cloud architectural best practices. AWS also offers numerous architectural examples in the form of whitepapers and documentation, as well as experts who can provide thoughtful system design advice. Leveraging cloud provider resources like these is an excellent strategy for optimizing cloud native design and lowering cloud costs.

Cloud native design is an excellent way to reduce costs, but it requires knowledge and experience. Existing cloud infrastructure designs, like open-source software, provide guidance. Rather than being radically innovative and unique, most cloud infrastructure designs are variations on existing designs.

In general, seek clarification on functional vs. non-functional requirements; performance is not always the top priority! Cloud-based DevOps enables faster delivery and innovation, but not necessarily cost savings. Engineers have cost control in the cloud. Optimizing solely for cost compromises performance and/or quality, and lowest cost is rarely the primary goal for a new product or service. Cloud native design is a component of designing and evaluating designs from a cost standpoint.

‍

Book a free consultation with our certified Cloud team

‍

Operational Refinement

The other approach is to maximize the cost efficiency of existing systems without making design changes, such as cleaning up unused resources and rightsizing. Rightsizing analysis compares resource usage to capacity to determine whether you are overpaying for unused capacity or capabilities. This is typically a service-specific study in which one or more specific qualities of the service are considered. For example, one might wonder why every EC2 in the organization has a 50 GB boot volume when only a small percentage of them do. Cleanups and rightsizing at the enterprise level should use automatic remediation whenever possible. It is extremely valuable to report in order to provide high visibility for these no-regrets cost-saving opportunities.

‍

Volume Discounts and Capacity Reservation

Reserved instances and on-demand capacity reservation feel similar to returning to the corporate datacenter, where capacity forecasting and budgeting are required, but the key difference is that the forecasting is primarily for "baseline" utilization, as the goal is to pay upfront for capacity that will definitely be used. Service providers provide significant discounts for capacity paid for in advance, resulting in even better cost optimization. Furthermore, cloud service providers may offer volume discounts to larger customers, so knowing your options is important.

‍

Management, organization, communication, and education

Ultimately, people decide whether or not to optimize cloud costs. Create a program to review, monitor, and control cloud costs, empowering technical, financial, and managerial team members across all business lines to collaborate, share accountability, and champion the cause. A cost-cutting program, for example, could establish a consultancy that holds seminars and trains the engineering community on critical topics. For example, in our program, engineering teams were brought into a training and working session for one or more days, giving them the opportunity to learn about and implement a variety of optimizations, from simple to complex. A training course was also created to raise cloud cost awareness throughout the organization, and it was even integrated into the newcomer onboarding process.

‍

Review of Billing and Pricing

Fortunately, cloud provider billing provides specific details about what is being paid for. The road map to cost savings is a high-level breakdown or itemization of costs. The most money will most likely be spent on compute, storage, and value-added managed services like RDS. Prioritize the highest-spending services for a thorough examination. AWS EC2 (compute), for example, is frequently the highest-spend category on the bill. Prioritize cost reduction for the teams that spend the most. Perhaps the savings realized by the highest spenders exceed the budget of the lowest spenders. Understanding the pricing of everything the cloud vendor offers in great detail is extremely beneficial because it allows for better decisions about what should be purchased or avoided. In AWS, for example, decreasing one instance size within the same class of instance, such as m5.2xlarge to m5.xlarge, reduces the rate by 50%.

‍

Plan and prepare to track expenditures by cost center

Teams that are individually responsible for their own cloud budgeting and spending require a method to track it. Each cost center could have its own AWS account, making reporting easier. When a single account has multiple application teams, each with its own budget, there must be a way to tie the costs to the teams in charge. In this case, it is critical to impose a standardized method for determining ownership of cloud resources. When resource naming standards fail, additional properties in the form of resource tags or labels are usually required. Consider resource tags to be the cloud's equivalent of barcode labels. Resource tags are arbitrary key:value pairs that can be added to cloud resources to describe them. For example, a tag such as "Department" or "Cost Center" can be used to describe the owner of any cloud resource. If you have the opportunity, mandate a tagging standard that applies to everything because retroactive tagging is difficult. If a problem component in a production environment lacks ownership, difficult decisions must be made. The minimum viable product for a tagging standard is cost centers aligned to how granular the cost reporting must be, which can range from the individual user to the department.

A configuration management database is common in large enterprises, and a tag identifying that a cloud resource belongs to an app that has been permanently shut down is very useful. Metadata about cloud resources can also be used to determine the intensity of cost-cutting efforts, with the most critical resources receiving the most leeway for underutilization and idleness. Finally, an "Owner Contact" tag for each resource is useful when cost centers are large and a resource discussion is required.

‍

Software and Automation Requirements

‍

Financial

Understanding trends and progress with cloud cost optimization requires graphing cloud spending over time by expense type and cost center. Data and graphs made available to everyone in the organization on a daily, monthly, and yearly basis promote healthy transparency and competition.

‍

Monitoring

For effective optimization, monitoring metrics for all aspects of cloud resource utilization vs. capacity are required. Capital One created a tool for right-sizing EC2 instances based on history for the "four corners of utilization metrics": CPU, memory, disk, and network. The emphasis is on recent history, but utilization peaks are captured and considered when recommending an instance type. Alerting for unusual cost spikes or thresholds exceeded is extremely useful.

‍

Administration, reporting, and cleanup

Make certain that everything in your cloud is being used and not going unnoticed or forgotten! Examining the overall cloud fleet composition yields insights for capacity reservations as well as potential opportunities for cost reduction. For example, knowing that more than half of your EC2 fleet is m4.4xlarge suggests a 75% cost savings if those instances are reserved for the coming year. It also begs the question, "Is the m4.4xlarge being used because it is the correct size for the workload, or is some widely copied infrastructure template spitting out m4.4xlarge because it only uses one size?" Because the EMR service has minimum instance size restrictions, some larger instance types will be required in some cases.

Automation allows for the implementation of cost-saving "levers," or changes that have no impact on infrastructure design. The examples below are specific to AWS, but the principles should apply to other cloud providers as well.

Off-hour Shutdown for ASG, EC2, RDS
ASG Dimmer - reduce ASG size during non-working hours in nonprod
EC2 Fleet Upgrade - push to latest generation instances. Promote cost-efficient platforms like AMD, ARM
S3 - Educate teams to understand their specific S3 data access requirements, and to design bucket lifecycles to expire objects or move them to less-expensive access tiers
Idle resources - Study a resource and decide what properties, events, and metrics constitute an unused resource. Create automation to find and remove idle resources.

Recognize that two types of automation are beneficial:

Events trigger actions. AWS Lambdas can execute a process in response to an API event
Batch operations. Scheduled or ad hoc.

Cloud architects, engineers, and developers who are well-trained, if not experienced, can have a direct impact on costs. A company that is transitioning to cloud technology cannot assume that the learn-as-you-go approach for its employees will result in robust and cost-effective solutions. Cost awareness and cloud native design principles are introduced in cloud provider training. Also, keep an eye out for new information from cloud providers. New services, whitepapers, and best practices for cloud cost management are constantly changing the landscape. Professional certifications, such as AWS Solutions Architect, are a wise investment for the entire team. The "Cloud Native" philosophy entails applying a broad and deep understanding of cloud provider service and resource products, specifically what functionality is provided and how cloud spending is optimized. Managed service offerings, such as RDS, outperform self-managed solutions by lowering complexity, toil, and thus labor costs.

‍

Cloud Cost Optimization

When it comes to cloud cost optimization, there are many levers to pull and a lot of data to consider, so keep in mind that it is a series of processes that must be managed over time.

Observing the cloud and its associated spending for an organization allows one to get a sense of the situation and prioritize actions. Remove any unused or forgotten resources. Conduct rightsizing reviews to determine whether the appropriate amount of capacity is being paid for. Examine the tradeoffs between spending, performance, reliability, redundancy, and spare capacity when it comes to basic services like computers and storage. Consult with cloud vendors for recommendations, capacity reservations, and volume discounts.

Cost efficiency should be considered when designing and engineering cloud applications using the cloud-native philosophy. Bring together and use the perspectives and strengths of management, finance, analytics, and engineering to achieve a common goal of cost efficiency.

Need help reducing your Cloud costs? Contact us today for a free consultation.