Outside the Big SAN Box: Identifying Storage and SAN Latency in SQL Server (Dear SQL DBA)

Note: This is a “listen-able” 36 minute video. You can also listen to this as a podcast – learn how at littlekendra.com/dearsqldba.

Here’s today’s question: Dear SQL DBA,

What do you say to a SAN admin when you think that the billion dollar SAN *may* be the bottleneck and you just want to look into it. What are the technical things I need to say to make them believe there might be something to my questions?

Sincerely,

Outside the Big SAN Box

This question is near and dear to my heart

I’ve been called in a lot as a consultant with the question, “Is the storage slowing us down? If so, what do we do?” This is really mystifying to a lot of people just because they don’t know where to look, and SQL Server gives out so many metrics.

The good news is that this doesn’t require a whole lot of rocket science. You just need to know where to look.

Your Mission: Show evidence of when disk latency impacts performance, and how much it hurts

Do your homework collecting data from the SQL Server and (sometimes, maybe) Windows.

If it’s not “emergency slow”, look for potential workarounds, such as adding memory to reduce the amount of reads — that’s always cheaper than speeding up storage. Take care of due diligence looking at index fixes, too.

When those won’t do it, perform an analysis of exactly where faster storage would help you the most, document what it will help, and ask for help speeding up the workload.

Make all of your notes and data available to the SAN admin, but write up a short TLDR summary.

Political tips

I like that you’re concerns about this already! You know this is a sensitive topic, and that things can go wrong when you bring this up.

Talk about “disk latency” that you’re seeing.

Some people say things like, “The SAN is slow.” That’s like the SAN admin saying, “The SQL Server is dumb.”  Things tend to go badly after you call someone’s baby ugly (whether or not the baby is actually ugly).

In reality, the problem could be part of the N of SAN– the network. Saying “the SAN is slow” is a really general thing, and that’s part of why it’s not really helpful. I’ve had slow storage incidents that were solved by replacing a single cable.

“Emergency Slow” – Look in the SQL Server Error Log for “15 second” warnings

Filter the SQL Server log looking for messages with “longer than 15 seconds” in them.

The full message is like this:

SQL Server has encountered [#] occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [blah blah blah] in database [blah].

This message means SQL Server sent off an I/O request and waited 15 seconds (1…. 2…. 3…. ) before getting anything back.

The # of occurrences is important. Was the SQL Server even doing a lot?

This message will only occur once every 5 minutes, so if you see the messages repeat, the latency may have been continuous.

If you see these messages, this is severe latency. Stop and send the times to the storage admins right away. Note that the SQL Server had severe storage latency at these times, and ask if any scheduled maintenance was occurring.

If there was no maintenance, ask for help starting a ticket up to investigate what’s causing this. This is a severe latency problem. For the storage with the problem, start by asking:

  • How many servers / computers share that storage
  • How active was it at that time on the SAN side

Gather read/write latency metrics from SQL Server’s “virtual file stats” dynamic management view

Sys.dm_io_virtual_file_stats is your friend here!

You want to know about latency when SQL Server reads from storage. This isn’t all reads, because when it can read from memory alone, it avoids the trip to storage.

Look at how much is read and written to in samples when performance is poor. 

  • Volume of reads and writes by file (I look at MB or GB because those units make sense to me)
  • Latency – how much are you waiting?

Lots of reads but very low latency is evidence that you’re using the storage, but things are going swimmingly. And that may be the case– be open to that!  The beauty of this DMV is that it can tell you if the problem is elsewhere. (If so, start here.)

This DMV reports on all physical reads, including IO done by:

  • Backups
  • DBCC CHECKDB
  • Index maintenance

That means that the data since start up is diluted– it contains maintenance windows as well as periods of time when just not much was going on. It’s a data point, but what you really want to know is what does the data look like in 5 minute periods when performance was poor and when maintenance wasn’t running (unless your problem is slow maintenance).

There are a bunch of free scripts to sample this, if you don’t feel like writing your own. Here are two:

Analysis: Which files have latency, and is it acceptable?

Typically throw out samples where very few reads and writes are done. A tiny amount of reads being slow is usually not an issue.

What’s impacted the most?

  • Reads from user database data files?
  • Writes to user database log files?
  • Reads and writes in tempdb?
  • Something else?

The latency numbers tell you how severe the latency is impacting you

Some latency is acceptable. From Microsoft’s “Analyzing I/O Characteristics and Sizing Storage Systems for SQL Server Database Applications

  • Data file latency: <= 20 ms (OLTP) <=30 ms (DW)
  • Log files: <= 5 ms

These are pretty aggressive targets.

Your storage is not “on fire” if you see periodic 100 ms write latency to your data files.

One point on writes to data files: SQL Server writes to memory and a write ahead transaction log. Seeing a 100ms insert on writes on a data file does not mean a user is waiting 100ms when inserting one row.

However, seeing a 100ms read latency from a data file may mean that a user is waiting on that.

Things to think about:

  • Adding memory frequently reduces read latency from storage for data files
    • Data file reads don’t have to be blazing fast when there’s sufficient memory to make reads from storage fairly rare and the workload is largely reads
  • In many environments, 100 ms is tolerable as read latency. It’s all about performance standards.

Heavy write latency to transaction logs can usually only be sped up either by speeding up storage or (sometimes) changing application patterns if you’ve got a lot of tiny small commits.

  • Both of these options are typically somewhat expensive, unfortunately

Helpful: Cover your (developer’s) butt with a sanity check on indexes

Arguably, storage shouldn’t be slowing you down, whether or not you have good indexes

However, it’s polite to do a sanity check, because indexing can dramatically reduce the number of physical reads you do (sometimes)

Note: you shouldn’t have those “15 second” latency warnings in the SQL Server error log, no matter what your indexes are.

If you’ve never done an indexing health check and can request for a developer to do a quick assessment, that’s fine too– you don’t have to do it yourself.

Sometimes helpful politically: Gather disk latency counters from Windows

DBAs are usually suspicious of the SAN, and SAN administrators are usually suspicious of the SQL Server.

If this dynamic is a problem in your environment, Windows is kind of a neutral middle ground that helps make the SQL Server info more convincing / less threatening. It will “back up” the virtual file stats data if you need that.

Latency counters on the PhysicalDisk Object:

  • Physical Disk – Avg Disk sec/Read
  • Physical Disk – Avg Disk sec/Write

Recap!

  • Talk about measured storage latency and its impact on the SQL Server. Sticking to the data points helps keep people from getting defensive / taking it personally.
  • Look for “emergency slow” first, and triage those as a critical issue
  • Use samples from sys.dm_io_virtual_file_stats to identify if storage latency impacts you
  • Analyze it to identify which databases are impacted, and reads vs writes
  • Consider whether adding memory is a cheaper way to reduce the latency
  • Remember that backups and maintenance do physical IO and will show in virtual file stats
  • Do an indexing sanity check if possible as part of due diligence
  • Collecting windows physical disk latency counters may help “bridge the gap” for some SAN admins

What if my SAN admin doesn’t know what to do?

  • Your storage vendor would love to help (for a cost) if your SAN admins aren’t sure how to troubleshoot the latency
  • If the SAN is a recent implementation and what you’re seeing doesn’t live up to what was advertised, you may be able to get some of their time based on that difference (but your mileage will vary)

Previous Post
Target Recovery Interval and Indirect Checkpoint – New Default of 60 Seconds in SQL Server 2016
Next Post
Will Query Store Work in a Read Only Database?

Related Posts

9 Comments. Leave new

Kendra,

Are there any agent alerts that can be set up that correspond to the “15 second” warning?

Thanks!

Reply

    Great question!

    The easiest way I can think of is to look for Msg 833 in the Windows Event Log. Many people have a system wide monitoring tools like SCOM that can easily be configured to look for a specific error number in the Windows log, here’s the documentation on the error number: https://support.microsoft.com/en-us/kb/2137408

    You should also be able to create an alert for error number 833 in the SQL Server agent (sql server event alert, base the alert on error number instead of severity). I haven’t done it this way myself though, so now I’m scratching my head wondering if I can reproduce a 15 second error on my VM to test it! I do see from a quick search online that some others are reporting it does work successfully.

    Reply
Kevin Fries
June 16, 2016 3:51 pm

Perfect timing. I just scanned through some logs earlier this week and found a 15 second warning on something we’ve outsourced to a vendor for the application but the DB is still within our datacenter. I was asked to look at another (non-)issue but I’ll be circling back on this one.

Reply
Argenis Fernandez
July 22, 2016 9:49 am

“These are pretty aggressive targets”

Yeah, not so much nowadays. IMO, Everyone should be shooting for < 2ms access to the transaction log and < 5ms for the data files, but, super important! bandwidth. Lots of people don't talk about total throughput. Shameless plug: please listen to my recently recorded interview here: http://sqldatapartners.com/2016/07/13/episode-53-storage-options/

Reply

    Haha, say the man who works for a flash storage vendor 😉

    Honestly, I agree with you for environments where performance is a big deal.

    For environments where performance just needs to be basically OK and there’s relatively light load, and people are investigating very BAD performance, I just don’t people to see a 100ms average read latency and think that’s the cause of incredibly bad application performance.

    Also, your shameless plug is totally welcome, I’m going to check that out.

    Reply
      Argenis Fernandez
      July 22, 2016 10:05 am

      The problem is that

      High latency against a volume can be very much OK when doing throughput intensive things like backup/restore or read-ahead scans, because there’s a lot of IO queuing. Folks need to watch out for those red herrings 🙂

      Reply

I’m a little confused when you said small reads causing a 15s wait is usually not a problem.

In my workplace we see it the other way around – large reads and writes triggering that message are likely the integrity checks, backups, and indexing – they smash the storage and so of course it’s going to be slow.

But a 15s wait when not much is happening means the SAN is probably being killed elsewhere and everyone will be feeling it.

What do you think?

Reply

    I think it’s a wording issue. I always consider the 15 second latency message to be a serious problem.

    The number of occurrences sampled just gives a suggestion of what the impact was on that server at the time sampled. If it’s one IO request in tempdb, it is possible nobody even noticed, and very unlikely this instance was the cause.

    I would still consider it a significant problem and want to find root cause, as impact if it recurs can be far worse.

    There’s also the fact that the number of IOs impacted is only counted every five minutes, so the impact could have gotten worse in the 4.9999 minutes after that sample.

    Definitely didn’t mean to trivialize it, just explain interpretation a bit.

    Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Menu