DjangoCon 2011 – Secrets of PostgreSQL Performance

After a slightly dry but informative talk from Russell Keith-Magee on the current state of the Django Software Foundation (which I forgot to live-blog) I’m back with the next talk of the night by Frank Wiles

PostgreSQL is effectively the default RDBMS for Django. Learn the dark arts of optimizing this powerful database to be blazingly fast on your own hardware or in the cloud.

You can view the slides at http://media.revsys.com/talks/djangocon/2011/secrets-of-postgresql-performance.pdf

Updates below:

19.52

Frank has finished talking now. Extremely well thought out talk. Good combination of technical tips and general insight.

19.48

Break large tables into smaller tables by date or another partitioning scheme that fits your data.

19.47

Using Triggers can be quicker than Signals. Consider using if you do a lot of repetitive stuff.

19.47

Statistics

default_statistics_target = 10

ALTER <table> ALTER <column> SET STATISTICS <integer>;

19.46

Performance Tools:

  • Django Debug Toolbar – Easily see # of queries and diagnose problems
  • Slow Query Log (log_min_duration = 1000)
  • pgfouine – Will show you the most frequent by time queries
  • EXPLAIN

19.44

Partial Indexes:

CREATE INDEX open_tickets_idx ON tickets (OPEN) WHERE OPEN='t';

19.43

Multicolumn Indexes can help if you’re constantly doing searches on multiple columns.

19.41

Instead of using count() you can have a simple key-value table that has a count of your counts. Update via triggers in PG or save signals

19.40

Common Django Specific problems:

  • Use select_related()
  • Queries in loops (STAY AWAY!)
  • Proper indexing (Index the proper things, don’t index columns you select infrequently)
  • count() is slow. PG has statistics on it’s tables which you can use instead.

19.37

Mount your filesystem with noatime

19.36

“Your choice of filesystem matters. Don’t use a journaled filesystem for your WAL. Use XFS for maximum performance”

19.34

When working “In the Cloud”

  • Remember: You’re sharing disks and you don’t even know how!
  • You can get the best of both worlds with Rackspace and their Cloud Connect product
  • Using EBS volumes and software RAID is best (but somewhat scary) option on AWS

19.33

“If you can afford SSDs, you can ignore the rest of this talk”

19.32

Use Tablespaces:

  • Allows you to specify where you want specific tables, indexes on different disks/volumes
  • Consider putting archive or legacy data onto slower drives/volumes
  • Separate your data and indexes onto different volumes (if possible)

19.31

Use pgbouncer to pool connections

19.30

Hardware Considerations:

  • Have as much RAM as possible (best bang for your buck)
  • Disks, have lots of disks (Faster disks make a huge difference. Raid 1+0 is good, Raid 5 bad. Separate out the WAL onto it’s own disk)
  • After previous two worry about CPU speed. You’re unlike to be CPU bound

19.28

Other parameters

wal_buffers – set to 16MB and forget it

checkpoint_segments – Increase to at least 10

maintenance_work_mem – 50MB for every GB of Ram (stats tools)

synchronous_commit – Turn off with data loss risks (Potential for losing some data. Default is it’ll pool connections for every 600ms)

19.25

The “Big 3” Tuning Parameters

shared_buffers – Set to 25% of available RAM and move up/down 5% to find the sweet spot

effective_cache_size – Planning hint that tells PG how much RAM it can expect for OS disk cache

work_mem – Per processes amount of ORDER BY space

19.24

“Do Smart Things”

  • Cache and Cache some more
  • Watch your query counts (1.3 has assert num queries, use it!)

19.23

“Don’t do dumb things”.

Don’t…

  • Assume PostgreSQL is like MySQL
  • Database server doing double duty
  • Disk Contention
  • Retrieve more data than you need

19.21

“Measure Everything Before and After you try it”

19.21

And we’re off. Frank Wiles takes the stage.