tldr; I'm looking for ways to monitor server performance to alert when individual, small, fast queries get a bad plan that will in aggregate hurt the performance of the business.
Over the past week we've suddenly had a few stored procedures grab bad query plans, and drag the server to slow downs. It hasn't been the same query, but it has involved the same tables for one group (which I imagined was a statistics problem), and when we FREEPROCCACHE'd the server another query got a bad plan, and has picked it up at random again.
Being the guy who is held responsible for the server, I'm being asked to do things like ensure no query will ever get a bad plan again, or that I react to it before the users notice the performance degradation.
I have monitoring tools, I'm monitoring long running queries – but this is a case of many queries that run inside 10ms changing to taking 5s, and that's not immediately noticeable, it barely even blips the performance monitoring CPU and memory-wise (apart from more frequent buffer cache dumps perhaps).
For any individual known query I can set up a baseline and an agent job that alerts me, or FREEPROCCACHEs its own plan when it gets a bad plan – but what is my best way to determine this for a busy server?
Should I be baselining run stats for all queries and checking the query stats for all? Can BlitzCache help me out? Will this sort of checking be too intrusive on the performance of the server?