1 __ __ ____ _ _____ ____ _ 2| \/ | ___| _ \ / \|_ _| | __ ) ___| |_ __ _ 3| |\/| |/ __| |_) / _ \ | | | _ \ / _ \ __|/ _` | 4| | | | (__| __/ ___ \| | | |_) | __/ |_| (_| | 5|_| |_|\___|_| /_/ \_\_| |____/ \___|\__|\__,_| 6 7McPAT: Multicore Power, Area, and Timing 8Current version 0.8Beta 9=============================== 10 11McPAT is an architectural modeling tool for chip multiprocessors (CMP) 12The main focus of McPAT is accurate power and area 13modeling, and a target clock rate is used as a design constraint. 14McPAT performs automatic extensive search to find optimal designs 15that satisfy the target clock frequency. 16 17For complete documentation of the McPAT, please refer McPAT 1.0 18technical report and the following paper, 19"McPAT: An Integrated Power, Area, and Timing Modeling 20 Framework for Multicore and Manycore Architectures", 21that appears in MICRO 2009. Please cite the paper, if you use 22McPAT in your work. The bibtex entry is provided below for your convenience. 23 24 @inproceedings{mcpat:micro, 25 author = {Sheng Li and Jung Ho Ahn and Richard D. Strong and Jay B. Brockman and Dean M. Tullsen and Norman P. Jouppi}, 26 title = "{McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures}", 27 booktitle = {MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture}, 28 year = {2009}, 29 pages = {469--480}, 30 } 31 32Current McPAT is in its beta release. 33List of features of beta release 34=============================== 35The following are the list of features supported by the tool. 36 37* Power, area, and timing models for CMPs with: 38 Inorder cores both single and multithreaded 39 OOO cores both single and multithreaded 40 Shared/coherent caches with directory hardware: 41 including directory cache, shadowed tag directory 42 and static bank mapped tag directory 43 Network-on-Chip 44 On-chip memory controllers 45 46* Internal models are based on real modern processors: 47 Inorder models are based on Sun Niagara family 48 OOO models are based on Intel P6 for reservation 49 station based OOO cores, and on Intel Netburst and 50 Alpha 21264 for physical register file based OOO cores. 51 52* Leakage power modeling considers both sub-threshold leakage 53 and gate leakage power. The impact of operating temperature 54 on both leakage power are considered. Longer channel devices 55 that can reduce leakage significantly with modest performance 56 penalty are also modeled. 57 58* McPAT supports automatic extensive search to find optimal designs 59 that satisfy the target clock frequency. The timing constraint 60 include both throughput and latency. 61 62* Interconnect model with different delay, power, and area 63 properties, as well as both the aggressive and conservative 64 interconnect projections on wire technologies. 65 66* All process specific values used by the McPAT are obtained 67 from ITRS and currently, the McPAT supports 90nm, 65nm, 45nm, 68 32nm, and 22nm technology nodes. At 32nm and 22nm nodes, SOI 69 and DG devices are used. After 45nm, Hi-K metal gates are used. 70 71How to use the tool? 72==================== 73 74McPAT takes input parameters from an XML-based interface, 75then it computes area and peak power of the 76Please note that the peak power is the absolute worst case power, 77which could be even higher than TDP. 78 791. Steps to run McPAT: 80 -> define the target processor using inorder.xml or OOO.xml 81 -> run the "mcpat" binary: 82 ./mcpat -infile <*.xml> -print_level < level of detailed output> 83 ./mcpat -h (or mcpat --help) will show the quick help message. 84 85 Rather than being hardwired to certain simulators, McPAT 86 uses an XML-based interface to enable easy integration 87 with various performance simulators. Our collaborator, 88 Richard Strong, at University of California, San Diego, 89 designed an experimental parser for the M5 simulator, aiming for 90 streamlining the integration of McPAT and M5. Please check the M5 91 repository/ for the latest version of the parser. 92 932. Optimize: 94 McPAT will try its best to satisfy the target clock rate. 95 When it cannot find a valid solution, it gives out warnings, 96 while still giving a solution that is closest to the timing 97 constraints and calculate power based on it. The optimization 98 will lead to larger power/area numbers for target higher clock 99 rate. McPAT also provides the option "-opt_for_clk" to turn on 100 ("-opt_for_clk 1") and off this strict optimization for the 101 timing constraint. When it is off, McPAT always optimize 102 component for ED^2P without worrying about meeting the 103 target clock frequency. By turning it off, the computation time 104 can be reduced, which suites for situations where target clock rate 105 is conservative. 106 1073. The output: 108 McPAT outputs results in a hierarchical manner. Increasing 109 the "-print_level" will show detailed results inside each 110 component. For each component, major parts are shown, and associated 111 pipeline registers/control logic are added up in total area/power of each 112 components. In general, McPAT does not model the area/overhead of the pad 113 frame used in a processor die. 114 1154. How to use the XML interface for McPAT 116 4.1 Set up the parameters 117 Parameters of target designs need to be set in the *.xml file for 118 entries taged as "param". McPAT have very detailed parameter settings. 119 please remove the structure parameter from the file if you want 120 to use the default values. Otherwise, the parameters in the xml file 121 will override the default values. 122 123 4.2 Pass the statistics 124 There are two options to get the correct stats: a) the performance 125 simulator can capture all the stats in detail and pass them to McPAT; 126 b). Performance simulator can only capture partial stats and pass 127 them to McPAT, while McPAT can reason about the complete stats using 128 the partial information and the configuration. Therefore, there are 129 some overlap for the stats. 130 131 4.3 Interface XML file structures (PLEASE READ!) 132 The XML is hierarchical from processor level to micro-architecture 133 level. McPAT support both heterogeneous and homogeneous manycore processors. 134 135 1). For heterogeneous processor setup, each component (core, NoC, cache, 136 and etc) must have its own instantiations (core0, core1, ..., coreN). 137 Each instantiation will have different parameters as well as its stats. 138 Thus, the XML file must have multiple "instantiation" of each type of 139 heterogeneous components and the corresponding hetero flags must be set 140 in the XML file. Then state in the XML should be the stats of "a" instantiation 141 (e.g. "a" cores). The reported runtime dynamic is of a single instantiation 142 (e.g. "a" cores). Since the stats for each (e.g. "a" cores) may be different, 143 we will see a whole list of (e.g. "a" cores) with different dynamic power, 144 and total power is just a sum of them. 145 146 2). For homogeneous processors, the same method for heterogeneous can 147 also be used by treating all homogeneous instantiations as heterogeneous. 148 However, a preferred approach is to use a single representative for all 149 the same components (e.g. core0 to represent all cores) and set the 150 processor to have homogeneous components (e.g. <param name="homogeneous_cores 151 " value="1"/> ). Thus, the XML file only has one instantiation to represent 152 all others with the same architectural parameters. The corresponding homo 153 flags must be set in the XML file. Then, the stats in the XML should be 154 the aggregated stats of the sum of all instantiations (e.g. aggregated stats 155 of all cores). In the final results, McPAT will only report a single 156 instantiation of each type of component, and the reported runtime dynamic power 157 is the sum of all instantiations of the same type. This approach can run fast 158 and use much less memory. 159 1605. Guide for integrating McPAT into performance simulators and bypassing the XML interface 161 The detailed work flow of McPAT has two phases: the initialization phase and 162 the computation phase. Specifically, in order to start the initialization phase a 163 user specifies static configurations, including parameters at all three levels, 164 namely, architectural, circuit, and technology levels. During the initialization 165 phase, McPAT will generate the internal chip representation using the configurations 166 set by the user. 167 The computation phase of McPAT is called by McPAT or the performance simulator 168 during simulation to generate runtime power numbers. Before calling McPAT to 169 compute runtime power numbers, the performance simulator needs to pass the 170 statistics, namely, the activity factors of each individual components to McPAT 171 via the XML interface. 172 The initialization phase is very time-consuming, since it will repeat many 173 times until valid configurations are found or the possible configurations are 174 exhausted. To reduce the overhead, a user can let the simulator to call McPAT 175 directly for computation phase and only call initialization phase once at the 176 beginning of simulation. In this case, the XML interface file is bypassed, 177 please refer to processor.cc to see how the two phases are called. 178 1796. Sample input files: 180 This package provide sample XML files for validating target processors. Please find the 181 enclosed Niagara1.xml (for the Sun Niagara1 processor), Niagara2.xml (for the Sun Niagara2 182 processor), Alpha21364.xml (for the Alpha21364 processor), and Xeon.xml (for the Intel 183 Xeon Tulsa processor). 184 185 Special instructions for using Xeon.xml: 186 McPAT uses ITRS device types including HP, LSTP, and LOP. Although most 187 designs follow ITRS projections, there are designs with special technologies. 188 For example, the 65nm Xeon Tulsa processor uses 1.25 V rather than 1.1V 189 for the core voltage domain, which results in the changes in threshold voltage, 190 leakage current density, saturation current, and etc, besides the different 191 supply voltage. We use MASTAR to match the special technology as used in Xeon 192 core domain. Therefore, in order to generate accurate results of Xeon 193 Tulsa cores, users need to do make TAR=mcpatXeonCore and use the generated 194 special executable. The L3 cache and buses must be computed using standard 195 ITRS technology. 196 197 198==================== 199McPAT is in its beginning stage. We are still improving 200the tool and refining the code. Please come back to its website 201for newer versions. If you have any comments, 202questions, or suggestions, please write to us. 203 204Version history and roadmap 205 206McPAT Alpha: released Sep. 2009 Experimental release 207McPAT Beta (0.6): released Nov. 2009 New code base and technology base 208McPAT Beta (0.7): released May. 2010 Added various new models, 209 including long channel devices, buses model; together 210 with bug fixes and extensive code optimization to reduce 211 memory usage. 212McPAT Beta (0.8): released Aug. 2010 Added various new models, 213 including on-chip 10Gb ethernet units, PCIe, and flash controllers. 214Next major release: 215McPAT 1.0: including advance power-saving states 216 217Future releases may include the modeling of embedded low-power 218processors as well as vector processors and GPGPUs. 219 220 221Sheng Li 222sheng.li@hp.com 223 224 225 226 227