PerfMonitor Users Guide

PerfMonitor is a tool for quickly and easily collecting ETW performance data and generating useful reports.   It is based on the Event Tracing for Windows (ETW) feature of the operating system which can collect information machine wide a a variety of useful events (see advanced data collection on what data is alread built into the OS and CLR).    It is the powerful technology the the windows performance group uses almost exclusively to track and understand the performance of windows, and the basis for their XPerf tool.   The data collected by PerfMonitor is the same data that XPerf uses, so PerfMonitor can be used to simplify data collection and simple reporting and XPerf can be used for more advanced analysis on the same data.    PerfMonitor also has the ability to decode symbolic information associated with .NET Runtime code, making PerfMonitor more valuable for .NET investigations.   The source code for PerfMonitor as well as the TraceEvent data parsing library on which is based are availalble at bcl.codeplex.com, so user-defined data analysis is also relatively easy. 

See also FAQ, TroubleShooting, ETW Background, Understanding the Analysis Report

Getting PerfMonitor

PerfMonitor was designed to be easy to deploy and use.   To deploy PerfMonitor simply copy the PerfMonitor.exe to the computer you wish to use it on.    No additional files or installation step is needed.   You can get the latest copy of PerfMonitor.exe from the PerfMonitor download page.   Simply click in the PerfMonitor.bin.zip file to download it,  click through the dialog boxes, and then copy the exe to your hard drive.   You are then ready to go!

Quick Start

Open an Administrator command window (Start->All Programs ->Accessories -> Command Prompt -> Right Click -> Run as Administrator), change your directory to where you placed the perfMonitor.exe and then type the command

perfMonitor  runAnalyze perfMonitor CpuTest

This command tells perfMonitor to

You can then browse the resulting report.   The source code for CpuTest is useful when analyzing this data.  See Understanding the Analysis Report for more on the report.


PerfMonitor Step By Step: Collecting ETW Data

The 'runAnalyze' command used in the quick start was designed to make doing a common set of operations easy.  However it is just one possible 'path' from data collection to analysis.  PerfMonitor allows these steps to be done individually, which allows for more advanced scenarios.     The first step in most profiling scenario is simply to turn on and off data collection.  These are what the 'start' and 'stop' operations doo

The start command turns on (machine wide) logging, and the stop command turns logging off.    Only one session can be on simultaneously.  If you try to start a session when one is already started, it closes the existing session and opens an new one.   The effect of turning on a session is immediate and take effect on live processes as well as new processes that are created.  A filename can be specifed (suggested extention is .ETL), but it will default to 'perfMonitorOutput.etl if not provided.  ETW allows a huge array of events to be turned on, which can be a daunting task, so perfMonitor simplifies this  by choosing a good set of default events, that are useful in a broad variety of performance invesigations but are not too expensive to collect.   This includes OS kernel operations associated with processes, threads, DLLs, page faults, diskIO, and CPU sampling as well as CLR events associted with Garbage Collections, the CLR Thread pool, lock contention, exceptions, and compilation of managed code to native code.    While PerfMonitor chooses a good set of default events to collect, though various options it gives you full control over what events are collected see Advanced Data Collection for more. 

Once logging has been started, it will continue until it is explicitly stopped.   Thus scripts that uses 'start' should be very robust to error handling to insure that the 'stop' command is exectuted under all cases (inlcuding a user hitting Ctrl-C).   For this reason you should try to use the 'run' or 'collect' commands instead of 'start' and 'stop' whenever possible. 

Due to a limitation in ETW implementation, kernel event can not be collected into the same ETL file as all other events.   PerfMonitor works around this by making two separate files (eg *.etl and *.kernel.etl).   Thus typically perfMonitor generates two files.  Whenver PerfMonitor is manipuating and ETL file called X.etl, looks fvor any 'shadow' file like X.*.etl and combines the files into one when doing its processing .   Thus in many scenarios you can ignore the fact that PerfMonitor generates mulitple files.  You can also merge these files into a single file (see perfMonitor merge command below).

The Run command

The combination of starting ETW logging, running a command, and then stopping, is so common the 'run' command was created to do precisely this.   Thus the command

Is equivalent to

Note that data collection is still machine wide, so you will get information about the entire machine at the time that Foo.exe was also running.   One important value to the 'run' command is that it tries pretty hard to insure tha the 'stop' command is issued under any error condition or if the user interrupts the command (Ctrl-C).  

The Collect command

When you wish to profile a process which is already running or can't be started 'by hand', you can't use the 'run' command.   While you could use the 'start' and 'stop', we wish to discource that (too easy to leave ETW on),   Instead there is a 'collect' command which turns on logging, and waits for the user to type a 'S' to stop collection.   Thus

Will start logging and wait for the user to type 'S' and the enter key.   You can abort the collection with Ctrl-C.  

When data is collected with 'collect' (or when perfMonitor 'stop' is used), by default PerfMonitor does a 'ClrRundown'.   A 'ClrRundown' makes a request to every process running the .NET Runtime to dump symbolic information so that stack traces involving managed code can be decoded.   This is only needed if the process of interested is managed AND it has not exited.   This is why it is not necessary to do a rundown when you use the 'run' command (since the process already did a rundown when it died). 

Rundown does take time and file space, so if you know that the process of interest is either not a .NET program or will have died before collection has ended you can specify the /noRundown flag to 'collect' to avoid this overhead. 

Doing analysis on a different machine (or using XPerf): Merging

As already mentioned, due to limitations in ETW, data from PerfMonitor typically consists of two files: an .ETL file and a .Kernel.ETL file.   In addition, these files do not contain all the information necessary to decode the events on another machine (information missing are the EXACT version of exectuable files, as well as the mappings of hard drives to drive letters).  Thus the raw information is not well suited to being transered to another maching for analysis.     The perfmonitor 'merge' command is designed to fix this.   By typing By typing

PerfMonitor will combine filename.ETL (defaults to perfmonitor.etl) and fileName.kernel.ETL into just fileName.ETL (deleting the originals if successful)  Moreover it will include all information needed to decode the ETL file on another machine.   Thus after merging the single file can now be transfered to another machine for off-line analysis.  It also works equally well on the machine where the data was collected (it just takes an extra step to create). 

The ETL files created by PerfMonitor are identical to files that would be created by the windows XPerf tool.   Thus a useful scenario is to use PerfMonitor to collect information on a remote machine (because it is one EXE that is easy to deploy anywhere), and then use XPerf to analyze the resulting data on another machine.  

Long running scenarios: Circular Buffering

Typically ETW will generate 2-4 Meg of data for every second of collection (although this can vary widely depending on the events being logged, and what applications are doing during the log).    It is also true that log files larger than 500 Meg or so are quite unwieldly.   Thus it is best if data is collected for less than 100 seconds.   This can be a problem if you are trying to understand a performance problem that is intermittant.    This is where the /Circular:sizeMB option is useful.   By specifying /Circular:100 it caps the file size at 100 meg (probably enough for 25-50 seconds of trace), so logging can be left on indefinately.   When the problem occurs, as long as you can stop the logging within about 25 seconds, you should have data on the performance anomoly. 


PerfMonitor Step By Step: Analyzing ETW Data

Once you have ETL files created by PerfMonitor, you can run a variety of analyses on them by using the 'analyze' command

Will perform a set of useful performance evaluations on the data in filename.ETL (defauts to perfMonitor.etl) for the process designated by nameOrPID.   NameOrPID is either the process name (defined to be the name of the EXE without path or extension), or the (decimal) process ID.    The 'analyze' will work if you don't specify a process to focus on, but typically is not what you want.   You can discover what processes were started in a given ETL file by doing

Which gives a list of process that started while data was being collected.    This can then be used to determine the correct /process: qualifer to use. 

As a convenience, the command the command

Will perform the run command (start data collection, run the command, stop), and then do an 'analyze' command, using the process name from the command as the focus process.   This was the command used in the QuickStart, and is quite common.   bsp;

Understanding the Analysis Report

The report that is generated by the 'analyze' command is meant to be self-explainitory, however a certain amount of performance background is valuable in interpreting the data.  This is what is presented here.   This section assumes some familiarity with ETW,  Please see ETW Background if you are new to ETW. 

The main report has three sections

  1. CPU - A breakdown of CPU used.  If a large amount of the clock time is CPU time, this is the place to start.  
  2. Garbage Collection (GC) - Shows the total memory used, as well as how much of this is used by the garbage collected heap.   
  3. Just In Time (JIT) compilation - Shows the costs associated with compiling IL to native code.   This mostly happens on startup, and thus is only interesting for startup scenarios.   If there is too much JIT compilation going on, NGen Tool can be used to avoid the issue. 

See also Understanding CPU Performance Data, Understanding GC Performance Data, Understanding Just In Time Compiler Data, FAQ, TroubleShooting, ETW Background, Home.

Understanding CPU Performance Data

The data shown by default in the PerfMonitor stack viewer are stack traces taken every millisecond on each processor on the system.   Every millisecond, on every CPU whatever process is running on that CPU is stopped and the operating system 'walks the stack' associated with the running code collecting the return address of every method on the stack.   Stackwalking may not be perfect.   It is possible that the OS can't find the next frame (leading to broken stacks stacks) or that an optimizing compiler has removed a method call (see missing frames), which can make analysis more difficult.   However for the most part the scheme works well, and has low overhead (typically less than 5% slowdown), so monitoring can be done on 'production' systems. 

How many samples do you need? you need?

Because the samples are taken every millisecond per processor, each sample represents 1 millisecond of CPU time.   However exactly where the sample is taken is effectively 'random', and so it is really 'unfair' to 'charge' the full millisecond to the routine that happened to be running at the time the sample was taken.   While this is true, it is also true that as more samples are taken this 'unfairness' decreases as the square root of the number of samples.   If a method has just 1 or 2 samples is could be just random chance that it happened in that particular method, but methods with 10 samples are likely to have truly used between 7 and 13 samples (30% error).  Routines with 100 samples are likely to be within 90 and 110 (10% error).    For 'typical' analysis this means you want at least 1000 and preferablly more like 5000 samples (there is diminishing routines after 10K).   By collecting a few thousand samples you insure that even moderatly 'warm' methods will have at least 10 samples, and 'hot' methods will have at least 100s, which keep the error acceptably small.   Because PerfMonitor does not allow you to vary the sampling frequency, this means that you need to run the scenario for at least several seconds (for CPU bound tasks), and 10-20 seconds for less CPU bound activities.  

Even with 1000s of samples,  there is still 'noise' that is typically at least 3%.   This error gets larger as the methods  being investigated have fewer samples.   This makes it problematic to use sample based profiling to compare two traces to track down small regressions (say 3%).   Noise is likely to be at least as large as the 'signal' (diff) you are trying to track down.   Increasing the number of samples will help, however you should always keep in mind the sampling error when comparing small differences between two traces. 

Is your problem even CPU?

Before doing a detailed CPU analysis, it is important to insure that optimizing CPU will have an effect.   If your program is Disk or Network bound, then optimizing CPU will have little effect.  This is why the top level 'rollups' of CPU consumption are an important to look at first.   If the total CPU samples add up to 400 msec, than that is an upper bound on what CPU optimization can have.  You should spend your time only of costs that are large enough to justify optimization. 

Exclusive and Inclusive Metrics

Because a stack trace is collected for each sample, every node has both an exclusive metric (the number of samples that were collected in that particular method) and an inclusive metric (the number of samples that collected in that method or any method that that method called).     Exclusive times are useful in the bottom up view to quickly determine the 'hot' methods that are the most important to optimize.    Inclusive times are useful in the top-down and caller-calle views to understand how time is spent in methods that are not in the bottom-up view. 

Grouping of OS methods

All programs call into operating system functions to do 'primitive' operations (e.g. read from a file, update the screen ...).   However these functions are typically have a large number of private internal methods to implement the functionality.   This detail is typically not useful to an application developer trying to optimize HIS code, and only serves to make the report hard to read.   PerfMonitor avoids this problem by grouping all functions that are part of the operating system.   A method is defined to be part of the operating system if it is implemented in a DLL that is under the c:\windows directory.   Once a call is made into an OS function, any further calls are grouped into the existing frame until a method outside the OS is called.     This feature can be turned off with the /noOSGrouping qualifer if desired. 

This grouping greatly simplifies the CPU reports PerfMonitor generates.   It does however mean that the meaning of 'Exclusive' time for OS functions is different from user functions.  For OS functions, exclusive time means the sample might have been in the function itself but it might also mean that it occurs in any OS function that function called. 

Getting Stacks for Unmanaged Methods

By default PerfMonitor collects all the information needed to map the information collected at runtime (return addresses) into names of methods.   However unmanaged code relies on information stored in program database file (PDBs) to provide this mapping.   By default PerfMonitor does not try to map return addresses of unmanaged code to symbolic names.  Typically PerfMonitor can determine the module of the return address, but needs the PDB to resolve the function name, resulting in a name like module!?.  If even the module cannot be determined the address is displayed as ?!?) 

PerfMonitor can be made to look up return addresses in unmanaged code using the /SymbolsForDlls:dllList, where dllList is a comma separated list of DLL names (without path or extension).   For this to work the _NT_SYMBOL_PATH variable must be set so that the necessary PDB files can be found.   See Symbol Resolution for complete details.   You may also use the /NoOSGrouping qualifier if you are interested in the detailed stacks associated with DLLs under c:\windows.

Top-down and Bottom-up Analysis

Performance investigations can either be 'top-down' (starting with the Main program and how the time spent there is divided into methods it calls), or 'bottom-up' (starting with methods at 'leaf' methods where samples were actually taken, and look for methods that used a lot of time).   Both techniques are useful, however 'bottom-up' is usually a better way to start because methods at the bottom tend to be simpler and thus easier to understand and have intuition about how much CPU they should be using.   Top-down analysis will tell you how the 'large' components of your program use CPU.   This can be useful to find large structural problems (a phase of your program is taking much longer than expected).

By default PerfMonitor causes the Runtime to log an event at the begining and end of each .NET Garbage Collection as well as every time 100K of objects are allocated.   As with all events the precise time is logged, so the amount of time spent in the GC can be known.    Most applications spend less than 10% of the total CPU time in the GC itself.   If your appliation is over this %, it usually means that your allocation pattern is such that you are causing many expensive Gen 2 GCs to occur.   

If the GC heap is a large percentage of the total memory used then GC heap optimization using a profile like ClrProfiler.   See Memory Usage Auditing For .NET Applications for  more on memory optimization.  .

During SOME GCs the application itself has to be suspended so the  GC can update object references.   Thus the application will pause when a GC happens.   If these pause times are larger than 100ms or so, it can impact the user experience.   The GC statistics will track the maximum and average pause time to allow you to confirm that this bad GC behavior is not happening.   

Understanding Just In Time Compiler Performance Data

PerfMonitior tracks detailed information of what methods were Just In Time compiled.   This data is mostly useful for optimizing startup (because that is when most methods get JIT compiled).  If large numbers of methods are being compiled it can noticably affect startup time.  This report tells you exactly how much time is being spent (in fact exactly which methods and exactly when they were compiled).   If JIT time is high, the NGen Tool can be used to precompile the code, and thus eliminate most of this overhead at startup. 


ETW Backgroundd

Event Tracing for Windows (ETW) is a general purpose, efficient, scalable event logging service built into the windows operating system.   Effectively all Windows components as well as user defined componts can define logging 'Providers' that can be turned on and off selectively at any time to provide interesting information about what that component is doing.   Each event is automatically tagged with the provider it came from and the exact (too 100ns) time it happened, what process and thread generated it, as well as event specific information.   Events can also be optionally tagged with the exact stack trace.   Much of the power of ETW comes from the fact that the operating system provider come built in with a variety of useful events including process and thread creation, 1 msec CPU sampling (with stack trace), context switches, disk I/O (including file name), DLL loading, virutal memory allocation, and network I/O.   On top of these other windows components such as the .NET Runtime add their own providers that add component-specific information such as garbage collection events and Just in Time compilation (see advanced data collection for more).   The result is very detailed view of the system as a whole, which is sufficient to diagnose most performance issues. 

The data collection is machine wide (you automatically get events from every process in the system), which, among other things, means that data collection logistics are simpler (you don't need to know which process is important).   The overhead associated with collection varies depending on the kind of events logged and what the appliation does.    The default events turned on by PerfMonitor will slow the application down by less than 10% (more typically less than 5%).     In addition, the Windows development group uses ETW logging extensively, so these logging paths are well tested (you are not running in a unusual mode).     These characteristics make it reasonable to leave event collection on 'in production' to 'catch' unusal performance events.   The '/Circular' option is particularly valuable

The size of the data files again depend on the events that were turned on and what the applications is doing.  For the default events perfMonitor will typically generate between 2 and 4Meg of log per second.   Thus data collection times less than 100 seconds are reasonable.   If your scenario is larger then that, you should consider using a circular buffer to limit the amount of data collected.  


Advanced Data Collection

PerfMonitor data collection is based on Event Tracing for Windows (ETW).   This is a general facility for logging information in a low overhead way.  It is useful extensively throughout the Windows OS and in particular is used by both the Windows OS Kernel and the .NET CLR Runtime.     By default PerfMonitor picks a default set of these events that have high value for the kinds of analysis PerfMonitor can visualize.   Howevever PerfMonitor can also be used as simply a data-collector, at which point it can be useful to turn on other events.   This is what the /KerneEvents: /ClrEvents: and /Provider: qualifiers do

All ETW events log the following information

  1. The time (to 100ns resolution) when the event happened
  2. The provider that logged the event (eg the Kernel, CLR or some user provider).
  3. The event number (which indicates how to decode the payload)
  4. The process and thread associated with the event (some events however there is no useful process or thread ID, but most do)

Kernel Events

By far, the ETW events built into the Windows Kernel are the most fundamental and useful.   Almost any data collection will want to turn at least some of these on.     See Kernel ETW Event for more information on particular events.    Here we will only provide an overview.   It is useful to break kernel events into three groups:

  1.  tDefault group (so useful you should have them on all the time.
  2. High Volume (useful but you only should turn them on when you need them)
  3. Advanced (typically you don't turn these unless you are writing a device driver). 
The Default Kernel Group

The default group is the group that PerfMonitor turns on by default.   The most verbose of these events is the 'Profile' event that is trigger a stack trace every millisecond for each CPU on the machine (so you know what youre CPU is doing).    Thus on a 4 processor machine you will get 4000 samples (with stack traces) every second of trace time.   This can add up.   Assume you will get at least 1 Meg of file size per second of trace.   If you need to run very long traces (100s of seconds), you should strongly consider using the circular buffer mode to keep the logs under control. 

  1. Default = DiskIO | DiskFileIO | ImageLoad | MemoryHardFaults | NetworkTCPIP | Process | ProcessCounters | Profile | Thread
  2. DiskIO - Fires every time a physical disk read is COMPLETE, indicates the size, and how long the operation took.  No stack trace.
  3. DiskFileIO - Logs the mapping between OS file object handles and the name of the file.  Without this many kernel events are not useful because you can't relate the operation to a meaninful name.    You almost always want this event.  No stack trace.
  4. ImageLoad - Fires when a DLL or EXE is loaded into memory for execution (LoadLibaryEx is called).  Needed if you wan to map memory addresse back to symbolic names.  Logs a stack trace. 
  5. MemoryHardFaults - Fires when the OS had to cause a physical disk read in responce to mapping virtual memory.   Logs a stack trace.
  6. NetworkTCPIP - Fires when TCP  or UDP packets are sent or recieved.   Logs the two end points and the size.  No stack trace.
  7. Process - Fires when a process is created or destroyed.  Indicates the command line (on start) or exit code (on end).  Logs a stack trace.
  8. ProcessCounters - Logs process memory statistics before a process dies or the trace ends.   No stack trace.
  9. Profile  - Fires every 1 msec per processor and indicates where the instruction pointer current list and takes a stack trace.
  10. Thread - Fires every time a thread is created or destoyed.   Logs a stack trace. 
The High Volume Group

The following Kernel events are not on by default because they can be relatively verbose or are for more specialized performance investigations. 

  1. Verbose = Default | ContextSwitch | DiskIOInit | Dispatcher | FileIO | FileIOInit | MemoryPageFaults | Registry | VirtualAlloc
  2. ContextSwitch - Fires each time OS stops running switches to another.  It indicates loosing processor and the thread getting it.  This event fire > 10K second depending on scenario, but can be VERY useful for determing why some process is waiting.  Logs a stack trace.
  3. DiskIOInit - Fires each time Disk I/O operation begins (where DiskIO fires when it ends).  Unlike DiskIO this logs a stack trace. 
  4. Dispatcher - Fires when a thread goes from waiting to ready (note that the thread may not actually run if there is no CPU available).  This can also fire > 10K / sec, but is very useful in understanding why waits are happening. 
  5. FileIO - Fires when a file operation completes (even if the operation does not cause a disk read (because it was in the file system cache).  Does not log a stack trace. 
  6. FileIOInit - Fires when a file operation starts.  Unlike FileIO this will log a stack trace. 
  7. MemoryPageFaults - Fires when a virtual memory page is make accessible (backed by physical memory).   This fires not only when the page needed to be fetched from disk, but also if it was already in the file system cache, or only needed to be zeroed.    Logs a stack trace.
  8. Registry - Fires when a registry operation occurs.   Logs a stack trace.
  9. VirtualAlloc - Fires when the Virtual memory allocation or free operation occurs.  All memory in a process either was mapped or was allocated through Virtual Alloc operations.
The Advanced Group

The finalThe final set of kernel events are typically useful for people writing device drivers or trying to understand why hardware or low level OS software is misbehaving 

  1. OS = AdvancedLocalProcedureCalls | DeferedProcedureCalls | Driver | Interrupt
  2. AdvancedLocalProcedureCalls - Logged when a OS machine local procedure call is made.
  3. DeferedProcedureCalls - Logged when a OS Defered procedure call is made
  4. SplitIO - Logged when an disk I/O had to be split into pieces
  5. Driver - Logs various hardware driver events occur.
  6. Interrupt - Logged when a hardware interupt occurs.

CLR Events

In addition to the kernel events, if you are running .NET Runtime code you are likely to want to also have the CLR ETW events truned on.    PerfMonitor turns a number of these on by default.   See CLR ETW Events for more information on these events.

  1. Default = GC | Binder | Loader | Jit | NGen | StopEnumeration | Security | AppDomainResourceManagement | Exception | Threading | Contention
  2. GC - Fires when GC starts and stops
  3. Binder - TODO
  4. Loader -Fires when assemblies are loaded or unloaded
  5. Jit - Fires when methods are Just in Time (JIT) compiled.
  6. NGen - Fires when operations assumed with precompiled NGEN images happen
  7. Security - Fires on various security checks
  8. AppDomainResourceManagement - Fires when certain appdomain resource management events occur.
  9. Contention - Fires when managed locks cause a thread to sleep.
  10. Exception - Fires when a managed exception happens.
  11. Threading - Fires on various System.Threading.ThreadPool operations
  12. Stop Enumeration - Dumps symbolic information as early as possible (not recommended)
  13. Start Enumeration - Dumps symbolic informationb as late as possible (typically at process stop)  This is the default.
  14. JitTracing - Verbose information on Just in time compilation (why things were inlined ...)
  15. Interop - Verbose information on the generation of Native Interopations code. 
  16. Stack - Turn on stack traces for various CLR events. 

Collecting data from other ETW Providers too

The Kernel and CLR ETW events are by far the most important because the data they collect applies to most programs.   This is why PerfMonitor turns these providers on by default, and has special qualifers (/kernelEvents and /CLREvents) to make speicify specific events easy.   However sometime it is useful to also have events that tell you 'higher level' information.   There are litterally hundreds of ETW providers built into Windows already, and it is also possible for you to instrument your own code with ETW event.  PerfMonitor can control these other provider too.    This is what the /providers: qualifer is for.

To turn a ETW provider on, you need three pieces of information

  1. The GUID that was assigned to the provider that identifies it uniquely to the system.  
  2. A 64 bit integer that represents sets of events should be logged
  3. A small integer that represents the verbosity desired (1 = critical, 5 = verbose)

The /providers qualifer is a comma separated list of these specifications where each component above is separted by a ':'   For example the silverlight provider happens to be assigned the GUID AA087E0E-0B35-4e28-8F3A-440C3F51EEF1.  Thus the specification

Starts data collection with all the normal CLR and Kernel providers but also turns on the silverlight provider passing it the bitset 0xFFFFFFF and a verbose level (5).   This is relatively advanced scenario as you need to know what each bit in the bitset means, and how to decode the events once they are collected.   

You can find the list of providers available in the system by issuing the following 'logman' windows command

However this list is very large and overwelming.   It is typically more intersting to find out the set of providers that are available for your particular application.  You can do this with the command

Where PID_OF_PROCESS is the process id for your applciation.   This restricts the providers to just those that are currently active for that process.   This is a far smaller list, and much more likely to be relevant to you.   Once you have an interesting provider you can learmn about what keywords and levels it has by issueing

This will give details on what keywords and levels can be given.   You now hav the GUID, Keyword, and level information that you can then pass to the /providers: qualifer of PerfMonitor.

Finding and using EventSource providers

In version 4.5 of the .NET Runtime, the System.Diagnostics.Tracing.EventSource class was added to the runtime that makes it very easy to create your own ETW provider and generate your own events.   PerfMonitor has support for during these providers on and off very easily.   The first piece of this support is the ability to discover what EventSources a given application (or DLL).  For example

Will list all eventSources in 'MyApp.exe' (or any DLL that is statically reachable from Mypp.exe') and list them.  For example the output might be

Event sources in eventProvider.exe.

EventSource Name                 EventSource Guid
-------------------------------------------------------------------------
MinimalEventSource            : 7370a335-9b9a-5ba0-63c2-8ce045c35503
MyCompany-MyApp               : 74cf8ddb-febf-5c1c-67c4-c721006bace0
AnotherEventSource            : a217696f-5896-584d-31d5-8c90e8d3483e

The /dumpManifests qualifer can be used to specify a directory to dump all the eventSource manifests found.  For example

Will dump three .manifest.xml files into '.' (the current directory), one for each of the event sources.  This gives detailed information about the keywords available in the eventSource as well as all the events and their payload schmeas. 

Once you have determined which EventSource you are interested in, perfMonitor supports the syntax @filename#EventSourceName to specify eventSources in the /provider: qualifer.   For example you can turn on the 'MinimalEventSource' by specifying

Which runs 'MyApp.exe' with the MinimalEventSource turned on.   If you don't specify the # syntax, PerfMonitor assumes you want all EventsSources turned on.  Thus

Turns on all three EvenSources in MyApp.exe. and runs the application and .   When using the 'run' command, you can event exclude the Filename (it is assumed to be the exe that is run).  Thus

Is equivalent to the previous example.    The command above will turn on the Kernel and CLR events asl well as the EventSources.  You would need to specify the /clrEvents:None and /kernelEvents:Process,Image qualifiers to suppress all but the least verbose events.   This is common enough that PerfMonitor has a special command called 'monitorRun' that does just this.  Thus

Is really shorthand for

Once you have created the ETL file, you can print out the formatted text from EventSources using the printSources command

Will pretty print the events in the PerfMonitorOutput.etl file to standard output.   Use the /outputFile=FILE qualifer to send the output elsewhere.   Monitoring and then printing the EventSources output  as text is common enough that there is a shortcut 'monitorPrint' command

Will monitor the app (turn on all its eventSources), and then print the resulting output.    If XML output is desired instead you can use the 'monitorDump' shortcut.

Finally perfMonitor also supports realtime 'listening' to event sources.   For example

Will listen to all the eventSources in 'MyApp' and print them out in 'real time'.   Control - C ends the session. 

 


Symbol Resolution

At collection time, when a CPU sample or a stack trace is taken, it is represented by an address in memory.    This memory address needs to be converted to symbolic form to be useful for analysis.   This happens in two steps. 

  1.  First determine if the code belongs to a particular DLL (module) or not. 
  2. Given the DLL, look up detailed symbolic information

If the first step fails (uncommon), then the address is given the symbolic name ?!? (unknown module and method).   However if the second step fails (more common) then you can at least know the module and the address is given the symbolic name module!?.

?!? Methods

Code that does not belong to any DLL must have been dynamically generated.   If this code was generated by the .NET Runtime by compiling a .NET Method, it should  have been decoded by PerfMonitor.   Howvever if you specified the /NoRundown or the log file is otherwise incomplete, it is possible that the information necessary to decode the address has been lost.    More commonly, however there are a number of 'anonymous' helper methods that are generated by the runtime, and since these have no name, there is not much to do except leave them as ?!?.    These helper typically are uninteresting (they don't have much exclusive time), and can ignored during analysis.  They typically happen at the boundary of managed and unmanaged code. 

module!? Methods

Code that was not generated at runtime is always part of the body of a DLL, and thus the DLL name can always be determined.   Precompiled managed code lives in (NGEN) images which have in .ni in their name and the information should be in the ETL file PerfMonitor collected.    If you see things unknown function names in modules that have .ni in them it implies that something went wrong with CLR rundown (see ?!? methods).  For unmanaged code (that do not have .ni) the addresses need to be looked up in the symbolic information associated with that DLL.   This symbolic information is stored in program database files (PDBs), and can be fairly expensive (10s of seconds or more), to resolve a large trace.   Because of this PerfMonitor by default only fully decodes 'High Value' DLLs including

  1. Important OS DLL (ntdll.dll, kernel32.dll, ntoskrnl.dll, ntkrnlpa.dll)
  2. Important .NET Runtime DLLs (mscorwks.dll, clr.dll, corclr.dll)

This is a good set of defaults for most managed code development.   However if your performance problem deals with unmanagd code., You will want other DLLs decoded.   Currently the only way to do this is by specifying the /SymbolsForDlls:dll1,dll2 ...  when the data is analyzed.  Where the dlls do NOT have their file name extension or path.  (TODO: Improve the experence).    The first time you use the  /SymbolsForDlls qualifier you will also want to use /PrimeSymbols as explained below.

Symbol servers and Symbol Priming

When a trace is opened for analysis, it does the symbolic lookup for unmanaged dlls.    Click the 'log' button in the lower right corner to see detailed diagnostic information about symbol loockup.    An important part of this lookup is finding he program database (PDB) files.    Like most diagnostic tools, PerfMonitor uses the _NT_SYMBOL_PATH environment variable for a semi-colon separated list of places to look for PDB files.   If you have unmanged executables that you built, this environment variable should include paths to the PDBs that your compiler generated.  

The _NT_SYMBOL_PATH variable can have a special syntax of  SRV*cacheDir*symbolServer, where cacheDir is a directory on the local machine, and symbolsServer is a network or web path.    This directs PerfMonitor to look up the PDB on the remote symbol server and copy the PDB to cacheDir, and then decode the symbols.    If the PDB is archived in a symbol server, looking up the PDB is now easy (you point at a symbol server) but it can also be VERY slow (10s of seconds or more because of network latancies).    The most important use of symbol servers is to get the PDBs for the Windows operating system DLLs. 

Because symbol servers can be so slow and often you don't need the extra symbol information anyway (e.g if you are doing a pure managed investigation), by default PerfMonitor does NOT look up symbols on NON-LOCAL locations (it will look in al machine-local locations, including symbol server cache locations).    However this does NOT work well the FIRST time.  The FIRST time  you must use the /PrimeSymbols option when launching PerfMonitor (TODO improve experience).   This will tell PerfMonitor to take the time to look up the PDBs on symbol servers (and prime the local symbol server cache).   After doing this once, you should not have to do it again (until OS dlls get updated). 

Default Symbol Path

By far, the most common unmanaged DLLs of interest are the DLLs that Microsoft ships as part of the operating system.    Thus if you don't specify a _NT_SYMBOL_PATH PerfMonitor uses the following 'standard' one

This says is to look up PDB at the stanard microsoft PDB server http://msdl.microsoft.com/download/symbols and cache them locally in %TEMP%\symbols.   Thus by default you can always find the PDBs for stanard Microsoft DLLs. 

Summary

Thus typically all you need to get good symbols is

  1. Run PerfMonitor /PrimeSymbols once when doing your first analysis.  By default this will fetch the important OS and CLR dlls from the default Microsoft symbol server.

  2. Afterward just run PerfMonitor normally.   You will get the cached symbols and it will be (reasonably) fast (Because it finds them locally on your computer)

If you are investigating performance problems of unmanaged DLLs of EXEs that did not come from Microsoft (e.g you built them yourself), you have to set the _NT_SYMBOL_PATH to include the location of these PDBs before launching PerfMonitor. 


TroubleShooting

Here are some tips for troubleshooting comon issues.

TroubleShooting Symbols: (X!?  or ?!? in the display)

Because of the expense of looking up symbols for unmanaged DLL, PerfMonitor requires a small amount of work by the user to decode them.   First PerfMonitor by default only decodes a handful of important operating system dlls by default.   You must use the /SymbolsForDlls:dll1,dll2 to specify more.   Second, even for these important DLLs PerfMonitor NEVER tries to find symbols except on the local machine unless the /PrimeSymbols options is given.     Thus this option must be given the FIRST time you use PerfMonitor and wish to have symbols.   Finally if the DLL is not shipped from Microsoft, symbol search path is almost certain NOT to work, you need to tell PerfMonitor where to find symbols by specifying the _NT_SYMBOL_PATH variable. 

Thus if this is the first time you have used PerfMonitor and the symbols are important to your analysis, and the DLL were created by microsoft, then you can

If that does not work, look carefully at the *.convertlog.txt that is generated and read see Symbol Resolution for more complete information.

'BROKEN' Stack Frame in Trace.  

When a sample is taken, the ETW system attempts to take a stack trace.    For a variety of reasons it is possible that this will fail before a complete stack is taken.    PerfMonitor uses the heuristic that all stacks should end in a frame in a particular OS DLL (ntdll) which is responsible for creating threads.   If a stack does not end there, PerfMonitor assumes that it is broken, and injects a pseduo-node called 'BROKEN' between the thread and the part of the stack that was fetched (at the very least it will have the address of where the sample was taken).    Thus BROKEN stacks should always be direct children of some frame representing a OS thread.  

When the number of BROKEN stacks are small (say < 3% of total samples), they can simply be ignored.  This is the common case.   However the more broken stacks there are, the less useful a 'top-down' analysis (using the CallTree View) is because effectively some non-trivial fraction of the samples are not being placed in their proper place, giving you skewed results near the top of the stack.    A 'bottom-up' analysis (where you look first as where methods where samples occured) is not affected by broken stacks (however as that analysis moves 'up the stack', it can be affected)

Broken stacks occur for the following reasons

  1. In 32 bit processes, ETW relys on the compiler to mark the stack by emiting a 'EBP Frame'.  When it fails to do this completely and uses the EBP register for other purposes, it breaks the stack.   This should not happen for operating system code or for .NET Runtime code, but may occur for 3rd party code.
  2. In a 64 bit process, ETW relys on a different mechanism to walk the stack.  In this mechnism the compiler generates 'unwind information'.    Currently this ETW mechanism does not work properly for dynamically generated code (as generated by the .NET runtime JIT compiler).  This causes stacks to be broken at the first JIT compiled method on the stack (you see the JIT compile method, but no callers of that method).    This issue will be fixed in the future, but exists today. 
  3. Asynchronous activities.   Stack crawling is a 'best effort' service.   If the sample is taken at a time where it would be impossible to do logging safely, then the OS simply skips it.   For example, if during stack crawing while in the kernel the stack page is found to be swapped out to the disk, then stack crawling is simply aborted. 

Working around 64 bit stack breaks:

If you are profing a 64 bit process there is pretty good chance that you are being affected by scenario (2) above.    There are three work-arounds to broken stacks in that instance

  1. NGEN the application.   The failures occur at JIT compiled code.  If you NGEN the application, JIT compilation will not be necessary and the broken stacks will disappear.   To NGEN your appliation simply type 

    C:\Windows\Microsoft.NET\Framework\v2.0.50727\NGen install YourApp.exe.  

    You will have to repreat this every time your application is recompiled.
  2. Switch to 32 bit.   If your code is pure managed code, then it can run both as a 32 or a 64 bit process.  By switching use a 32 bit process, you avoids the problem.   This does not work if you took dependencies native code that only exists for 64 bit.    You can convert your applciation to run 32 bit by using the CorFlags utility that comes are part of the .NET SDK.   It also comes are part of Visual Studio (open theVS command prompt).   To switch simply type CorFlags /32bit+ YourApp.exe. You will have to repreat this every time your application is recompiled.
  3. Perform only a bottom-up analysis.   Even with many broken stacks, there is a lot of information in the profile, and a 'bottom-up' analysis is possible. 

Missing frames on stacks (Stacks Says A calls C, when in the source A calls B which calls C)

Missing stack frames are different than a broken stack because it is frames in the 'middle' of the stack that are missing.   Typically only one or maybe two methods are missing.   There are three basic reasons for missing stacks.

  1. Inlining.   If A calls B calls C, if B is very small it is not unusual for the compiler to have simply 'inlined' the body of B into the body of A.   In this case obviously B does not appear because in a very real sense B does not exist at the native code level.
  2. Tail-calling.   If the last thing method B does before returning is to call C, the compiler can do another optimization.   Instead of calling C and then returning to A, B can simply jump to C.    When C returns it will simply return to A directly.    From a profiler's point of view, when the CPU is executing C, B has been removed from the stack and thus does not show up in the trace.   Note also that B does not need to be small for this optimization to be beneficial.  The only requirement is that calling C is the last thing that B does.  
  3. EBP Frame optimization.  In 32 bit processes (64 bit processes don't use EBP Frames), the profiler is relying on the compiler to 'mark' the call by emitting code at the be.g.ining of the method called the EBP Frame.    If the comiler does not set up a frame at all and uses the EBP register for its on use it results in a broken stack.   However even when the compiler is aware of the need to generate EBP Frames there is overhead in doing so (2 instructions at the beginning and end of the method.   For small methods (too big to inline, but still small), the compiler can opt to simply omit the geneation of the frame (but LEAVE EBP unotuched).   This results in a missing frame.   It should be noted that that the EBP Frame that method sets up marks the CALLER, not itself.   Thus if method B seems to be missing, it is not because B ommited its EBP frame but because method C did.    Thus this kind of frame ommision happens when method C is small, not when B is small. 

While missing frames can be confusing and thus slow down analysis, they rarely truely block it.   Missing frames are the price paid  for profiling unmodified code in a very low overhead way.    


Frequently Asked Questions (FAQ)

 

 

 

 

 

 

 


Appendix

Source Code for the CPUTest Example

/// This class is simply a example of a CPU bound computation.   It does it recurisively to 
/// make the example more interesting.  The main entry point is the RecSpin routine.  
class Spinner
{
    public static int aStatic = 0;

    // Spin for 'timeSec' seconds.   We do only 1 second in this
    // method, doing the rest in the helper.   
    public static void RecSpin(int timeSec)
    {
        if (timeSec <= 0)
            return;
        --timeSec;
        SpinForASecond();
        RecSpinHelper(timeSec);
    }

    // RecSpinHelper is a clone of RecSpin.   It is repeated 
    // to simulate mutual recursion (more interesting example)
    static void RecSpinHelper(int timeSec)
    {
        if (timeSec <= 0)
            return;
        --timeSec;
        SpinForASecond();
        RecSpin(timeSec);
    }

    // SpingForASecond repeatedly calls DateTime.Now until for
    // 1 second.  It also does some work of its own in this
    // methods so we get some exclusive time to look at.  
    static void SpinForASecond()
    {
        DateTime start = DateTime.Now;
        for (; ; )
        {
            if ((DateTime.Now - start).TotalSeconds > 1)
                break;

            // Do some work in this routine as well.   
            for (int i = 0; i < 10; i++)
                aStatic += i;
        }
    }
}