Slash Your Cloud Data Costs: 11 Proven SQL Optimization Techniques

Fri Oct 17 2025
Technology
SQL
Topic
Analytics Engineering
Data Platforms
Want to reduce your cloud data warehouse costs by 20–40%? This guide walks you through 11 practical SQL query optimizations that apply across platforms like Google BigQuery, Snowflake, Amazon Redshift, Databricks SQL Warehouse and Microsoft Fabric. Whether you're managing dashboards, analytics workloads, or batch pipelines, these tips will help you cut costs fast without rewriting your entire stack.

Cloud data platforms like Microsoft Fabric, Google BigQuery, AWS Redshift, and Snowflake have transformed how companies analyze and manage data. With the promise of scalability and pay-as-you-go pricing, they have enabled businesses to move fast and scale insights globally.

But this flexibility has a catch: you pay for what you use. And often, you’re using (and paying for) far more than you realize.

Every SQL query consumes platform resources - primarily CPU (compute), memory, and storage I/O - and your cloud data platform bills you for that usage. Thus, the more data a query scans or processes, the higher the cost, so poorly optimized queries can quietly inflate your bill:

  • Studies show 30% of cloud data warehouse spend is wasted due to inefficient queries.[1]
  • A single poorly written query that runs multiple times a day can cost hundreds or even thousands of euros per month.[2]

👉 Optimizing SQL queries is one of the fastest and most effective levers to reduce costs without compromising insights.

1. Select Only the Data You Need

SQL
-- Inefficient
SELECT * EXCEPT(fare_amount) FROM NYC_Taxi_Dataset;

-- Optimized
SELECT VendorID, trip_distance, payment_type
FROM NYC_Taxi_Dataset;

High-Impact, Cloud-Agnostic SQL Tactics for Immediate Cost Reduction

  • Cost saving: 60–90% for wide tables with dozens of columns.

2. Filter Early and Efficiently

  • Apply filters (WHERE) as close to the source as possible.

  • Always filter directly on partitioned fields (like DATE, INT, BOOL) instead of applying functions to them. This way, the database can skip irrelevant partitions (partition pruning), making queries faster and cheaper.

  • Example:
SQL
-- Slower: applying a function means the database must check
-- every row, so it cannot skip partitions efficiently
WHERE YEAR(partition_date) = 2025

-- Faster: allows partition pruning, database scans only 2025
-- partitions
WHERE partition_date BETWEEN '2025-01-01' AND '2025-12-31'
  • Cost saving: 30 - 70% depending on how much data is filtered out.

Here are practical techniques to make your SQL cheaper and faster — listed roughly from highest to lowest potential cost savings.

To validate the order of impact, a benchmark was performed on the NYC Taxi Trip Records dataset. Data from July 2024 to December 2024 was combined into a single dataset of roughly 20 million rows and 19 columns, and a series of queries were executed to compare costs before and after optimization. The observed reductions in compute usage and query runtime were then used to rank the techniques below, providing a realistic view of potential cost savings in practice.

  • The mistake: SELECT * scans every column, even if you only need two.
  • The fix: Always specify columns explicitly, which will also help with the readability of your code when others are diving into your query.
  • Example:

3. Optimize Joins and Join Order

  • Join only what’s necessary: aggregate and/or filter data before joining.

  • Prefer INNER JOIN over OUTER JOIN when possible.

  • Use integer keys (INT64) instead of strings — integers are smaller and quicker to compare, which makes queries faster and cheaper.

  • Avoid CROSS JOINS to prevent blowing up the number of table rows.

  • Example:
SQL
-- Less efficient
SELECT c.name, SUM(n.amount) AS total_amount
FROM NYC_Taxi_Dataset n
LEFT OUTER JOIN Customers c ON n.customer_id = c.customer_id
WHERE n.tpep_pickup_datetime >= '2024-01-01'
GROUP BY c.name;

-- Optimized
WITH filter_and_aggregated_data AS (
    SELECT customer_id, SUM(amount) as total_amount
    FROM NYC_Taxi_Dataset
    WHERE order_date >= '2024-01-01'
    GROUP BY customer_id
)

SELECT c.name, f.total_amount
FROM filter_and_aggregated_data f
INNER JOIN Customers c
ON f.customer_id = c.customer_id;
  • Cost saving: 20–50% depending on dataset size.

4. Avoid Row-by-Row Processing (UDFs, Loops, Cursors)

  • Scalar user-defined functions (=UDFs) often block parallelism and force row-by-row execution.

  • Replace with set-based operations (e.g: UPDATE, INSERT)  or window functions.

  • Example: Replace cursor-based ranking with:
SQL
-- Less efficient: Scalar UDF executes per row
CREATE OR ALTER FUNCTION dbo.AddAirportFee(@fee DECIMAL(10,2),
    @passengers INT)
RETURNS DECIMAL(10,2)
AS
BEGIN
    IF @passengers = 2
        SET @fee = @fee + 3;
    RETURN @fee;
END;
GO

UPDATE NYC_Taxi_Dataset
SET Airport_fee = dbo.AddAirportFee(Airport_fee,
    passenger_count);

-- OPTIMIZED: Executes all matching rows in one go
UPDATE NYC_Taxi_Dataset
SET Airport_fee = Airport_fee + 3
WHERE passenger_count = 2;
  • Cost saving: 20–50% for heavy analytical queries.

5. Table Design: Partitioning and Clustering

  • Partition tables on high-cardinality fields like date.
    → This ensures each partition contains a manageable amount of data, optimizing query performance and cost. That being said, partitioning is most beneficial when tables are larger than ~100 GB — for smaller datasets, the benefits are minimal and may not justify the extra metadata overhead.

  • Cluster tables on common filter fields (shop, product_group, brand).
    → When queries use these clustered fields as filters, BigQuery scans less data, significantly improving performance — similar to a person opening only the correctly labeled cabinet. However, clustering is most effective (and cost-efficient) for large tables, typically over 1 TB. For smaller tables, the clustering overhead might outweigh the speed gains.

  • Keep table size manageable by deleting data older than 2 years during each load.
    → This reduces storage costs and keeps queries fast and responsive over time.

  • Example:  Using partitioning and clustering
SQL
CREATE TABLE IF NOT EXISTS NYC_Taxi_Dataset_clustered_partitioned

-- Partition the table by the 'partition_date' column (usually a
-- DATE)
-- 👍 This splits the data into smaller, date-based segments.
-- Each query will then only scan the relevant partitions, which
-- reduces cost and improves performance.
PARTITION BY partition_date

-- Cluster the data by 'tpep_pickup_datetime'
-- 👍 This organizes data within each partition based on the
-- pickup timestamp,
-- making queries that filter or sort by this column much faster.
CLUSTER BY tpep_pickup_datetime
AS
SELECT *
FROM NYC_Taxi_Dataset
  • Cost saving: 30 - 45% by combining partitioning + clustering/indexing [1] [2]

6. Optimize Counting and Existence Checks

  • COUNT(*) is slower than EXISTS when you only need to know if data exists.

  • Example:
SQL
-- Slower
SELECT COUNT(*)
FROM NYC_Taxi_Dataset
WHERE vendorID = 2;

-- Faster
SELECT CASE WHEN EXISTS (
    SELECT 1
    FROM NYC_Taxi_Dataset
    WHERE vendorID = 2 )
THEN CAST(1 AS BIT)
ELSE CAST(0 AS BIT)
END AS vendor_has_clients;
  • Use APPROX_COUNT_DISTINCT() instead of COUNT(DISTINCT …) when exact precision isn’t needed.

  • Cost saving: 10–30%.

7. Use Appropriate Data Types and Keys

  • Use the smallest possible data type:

    • INT instead of BIGINT if values fit.

    • DATE instead of DATETIME if you don’t need time granularity.

  • Join on numeric surrogate keys (e.g.: ID INT64) — a surrogate key is an artificially created unique identifier with no business meaning — instead of long text fields.

  • Cost saving: 10–20% (especially in join-heavy queries).

8. Use CTEs Carefully

  • Common Table Expressions (CTEs) are recomputed each time they’re referenced.

  • If used multiple times, store results in a temporary table instead.

  • Cost saving: 5–15% (depends on query complexity).

 

9. Reduce Ordering and Sorting Costs

  • Don’t use ORDER BY when writing tables. Use it only in the final query stage if needed for visualization.

  • Sorting large datasets is expensive and often unnecessary. UI/preview sorting on tools like BigQuery, Databricks or MS Fabric SQL endpoints is free for the small rows shown, but sorting the entire dataset on the server will cost compute.
  • Cost saving: 5–15%.

10. Leverage Indexes and Statistics

  • Keep table statistics up to date so the query optimizer can pick faster, cheaper plans. Update after big data changes — some systems do this automatically; others need commands like ANALYZE, COMPUTE STATISTICS, or VACUUM.

  • Example:
SQL
-- Update stats
UPDATE STATISTICS Orders;
  • Use columnstore indexes for analytic workloads

  • Example:
SQL
-- Create a rowstore clustered index on Orders table
-- This will physically sort the table by order_date
CREATE CLUSTERED INDEX IX_Orders_OrderDate
ON Orders(order_date);

11. Query Modularity

  • Break one massive query into smaller steps.

  • Benefits: easier debugging, better optimizer decisions, and often reduced cost.

  • Cost saving: Harder to quantify - but can avoid runaway costs when queries fail mid-execution.

Conclusion

SQL query optimization isn’t just a technical best practice — it’s a direct cost-control strategy. By combining general SQL hygiene with cloud-native features, organizations can:

  • Cut cloud warehouse bills by 20–40%.

  • Improve dashboard performance and user experience.

  • Reduce wasted compute and improve sustainability.

💡 Want to know how we can help your company reduce cloud operation costs?
Schedule a call with us — let’s work together to make your queries leaner, faster, and cheaper.

Written by 

Ally Mashaura

Analytics Engineer at Xomnia

Technology
SQL
Topic
Analytics Engineering
Data Platforms
crossmenuchevron-down