May 19, 2022

Run ANALYZE. Run ANALYZE. Run ANALYZE.

dj ot magnifying glass
dj ot magnifying glass

We’ve said it before, we’ll say it again: one of the first things you should do, if not the first, when trying to solve a slow-running PostgreSQL query problem is run ANALYZE. A recent interaction with an OtterTune user is a good case in point.

About PostgreSQL ANALYZE

Getting statistics about how your queries and jobs are operating is essential. The PostgreSQL ANALYZE command collects statistics about specific table columns, entire tables, or entire databases, and stores the results in the pg_statistic system catalog. The PostgreSQL query planner then uses these statistics to help determine the most efficient execution plans for queries.

By default, ANALYZE examines every table in the current database. If you supply a table name as a parameter, ANALYZE examines only that table. You can also give ANALYZE a list of column names, in which case it collects only the statistics for those columns.

OtterTune user with slow-running load-transform jobs

Recently, a PostgreSQL user contacted us via the OtterTune Community Slack channel. This OtterTune user had an ETL workload running scripts that bulk-loaded data, and then ran queries to pre-aggregate values and store them. They had been experiencing slow performance problems for months.

They were running Amazon RDS PostgreSQL. The Load & Transform jobs ran daily, and they processed roughly 4 million rows of data. The jobs took about one hour to run, on average. OtterTune tuning improved the data normalization time by 30%, but that was only part of the overall job.

In response to the user’s help request, I dug into the issue. The first thing I did was ask the user to send us EXPLAIN ANALYZE output on his queries.

We saw that his database was not using an index on the table and doing a sequential scan of ~3.5 million rows. Hence, queries were running very slowly. We investigated further and noticed that query plans were different from those the customer ran manually — very much so.

When run manually, the analysis used:

Hash Cond: (mrt_contributions.project_id = mrt_projects.project_id)

Our guess was that the user did not run ANALYZE after the bulk loading, so the stats were not updated. This resulted in the optimizer choosing not to use an index. We asked the user to run ANALYZE after loading the data in the dbt script and uncovered an important issue in this line:

Seq Scan on mrt_contributions (cost=0.00..242575.00 rows=1 width=4) 
(actual time=0.006..1955.159 rows=3474601 loops=1)

We immediately recognized the problem: the row was 1 in the estimation, but the actual number of rows was 3474601. That’s why the optimizer was choosing seq scan in the query plan, because it thought there was only one row.

From 52 minutes to 34 seconds

The customer quickly learned the importance of running ANALYZE and is now creating a script so ANALYZE runs at regular intervals, automatically. We know from experience that running ANALYZE quickly pinpoints and provides a path to solving most issues. Said this customer: “Proof is in the pudding. Calling ANALYZE after building the models dropped the query times dramatically.”

Dramatically, indeed. The job run time went from 52 minutes to 34 seconds.

elt job duration with vs without postgresql analyze