#
11329:82bb3ee706b3 |
|
06-Feb-2016 |
Alexandru Dutu <alexandru.dutu@amd.com> |
x86: revamp cmpxchg8b/cmpxchg16b implementation
The previous implementation did a pair of nested RMW operations, which isn't compatible with the way that locked RMW operations are implemented in the cache models. It was convenient though in that it didn't require any new micro-ops, and supported cmpxchg16b using 64-bit memory ops. It also worked in AtomicSimpleCPU where atomicity was guaranteed by the core and not by the memory system. It did not work with timing CPU models though.
This new implementation defines new 'split' load and store micro-ops which allow a single memory operation to use a pair of registers as the source or destination, then uses a single ldsplit/stsplit RMW pair to implement cmpxchg. This patch requires support for 128-bit memory accesses in the ISA (added via a separate patch) to support cmpxchg16b.
|
#
11328:9512d2e25f14 |
|
06-Feb-2016 |
Steve Reinhardt <steve.reinhardt@amd.com> |
arch, x86: add support for arrays as memory operands
Although the cache models support wider accesses, the ISA descriptions assume that (for the most part) memory operands are integer types, which makes it difficult to define instructions that do memory accesses larger than 64 bits.
This patch adds some generic support for memory operands that are arrays of uint64_t, and specifically a 'u2qw' operand type for x86 that is an array of 2 uint64_ts (128 bits). This support is unused at this point, but will be needed shortly for cmpxchg16b. Ideally the 128-bit SSE memory accesses will also be rewritten to use this support.
Support for 128-bit accesses could also have been added using the gcc __int128_t extension, which would have been less disruptive. However, although clang also supports __int128_t, it's still non-standard. Also, more importantly, this approach creates a path to defining 256- and 512-byte operands as well, which will be useful for eventual AVX support.
|
#
9921:ee049bfce978 |
|
15-Oct-2013 |
Yasuko Eckert <yasuko.eckert@amd.com> |
arch/x86: add support for explicit CC register file
Convert condition code registers from being specialized ("pseudo") integer registers to using the recently added CC register class.
Nilay Vaish also contributed to this patch.
|
#
9582:0632d2d1575c |
|
11-Mar-2013 |
Nilay Vaish <nilay@cs.wisc.edu> |
x86: implement some of the x87 instructions This patch implements ftan, fprem, fyl2x, fld* floating-point instructions.
|
#
9471:4193ed60eed7 |
|
15-Jan-2013 |
Nilay Vaish <nilay@cs.wisc.edu> |
x86: implements emms instruction
|
#
9470:68f7e0bcf4aa |
|
15-Jan-2013 |
Nilay Vaish <nilay@cs.wisc.edu> |
x86: implement fabs, fchs instructions
|
#
9212:dc386ccc1db9 |
|
11-Sep-2012 |
Nilay Vaish <nilay@cs.wisc.edu> |
X86: make use of register predication The patch introduces two predicates for condition code registers -- one tests if a register needs to be read, the other tests whether a register needs to be written to. These predicates are evaluated twice -- during construction of the microop and during its execution. Register reads and writes are elided depending on how the predicates evaluate.
|
#
9211:46c3a74952ec |
|
11-Sep-2012 |
Nilay Vaish <nilay@cs.wisc.edu> |
x86: Add a separate register for D flag bit The D flag bit is part of the cc flag bit register currently. But since it is not being used any where in the implementation, it creates an unnecessary dependency. Hence, it is being moved to a separate register.
|
#
9010:7891b96e1526 |
|
22-May-2012 |
Nilay Vaish <nilay@cs.wisc.edu> |
X86: Split Condition Code register This patch moves the ECF and EZF bits to individual registers (ecfBit and ezfBit) and the CF and OF bits to cfofFlag registers. This is being done so as to lower the read after write dependencies on the the condition code register. Ultimately we will have the following registers [ZAPS], [OF], [CF], [ECF], [EZF] and [DF]. Note that this is only one part of the solution for lowering the dependencies. The other part will check whether or not the condition code register needs to be actually read. This would be done through a separate patch.
|
#
8500:5bae9eee9482 |
|
14-Aug-2011 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Use IsSquashAfter if an instruction could affect fetch translation.
Control register operands are set up so that writing to them is serialize after, serialize before, and non-speculative. These are probably overboard, but they should usually be safe. Unfortunately there are times when even these aren't enough. If an instruction modifies state that affects fetch, later serialized instructions which come after it might have already gone through fetch and decode by the time it commits. These instructions may have been translated incorrectly or interpretted incorrectly and need to be destroyed. This change modifies instructions which will or may have this behavior so that they use the IsSquashAfter flag when necessary.
|
#
8449:4be49ad47c74 |
|
05-Jul-2011 |
Gabe Black <gblack@eecs.umich.edu> |
ISA parser: Define operand types with a ctype directly.
|
#
7789:f455790bcd47 |
|
08-Dec-2010 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Take advantage of new PCState syntax.
|
#
7720:65d338a8dba4 |
|
31-Oct-2010 |
Gabe Black <gblack@eecs.umich.edu> |
ISA,CPU,etc: Create an ISA defined PC type that abstracts out ISA behaviors.
This change is a low level and pervasive reorganization of how PCs are managed in M5. Back when Alpha was the only ISA, there were only 2 PCs to worry about, the PC and the NPC, and the lsb of the PC signaled whether or not you were in PAL mode. As other ISAs were added, we had to add an NNPC, micro PC and next micropc, x86 and ARM introduced variable length instruction sets, and ARM started to keep track of mode bits in the PC. Each CPU model handled PCs in its own custom way that needed to be updated individually to handle the new dimensions of variability, or, in the case of ARMs mode-bit-in-the-pc hack, the complexity could be hidden in the ISA at the ISA implementation's expense. Areas like the branch predictor hadn't been updated to handle branch delay slots or micropcs, and it turns out that had introduced a significant (10s of percent) performance bug in SPARC and to a lesser extend MIPS. Rather than perpetuate the problem by reworking O3 again to handle the PC features needed by x86, this change was introduced to rework PC handling in a more modular, transparent, and hopefully efficient way.
PC type:
Rather than having the superset of all possible elements of PC state declared in each of the CPU models, each ISA defines its own PCState type which has exactly the elements it needs. A cross product of canned PCState classes are defined in the new "generic" ISA directory for ISAs with/without delay slots and microcode. These are either typedef-ed or subclassed by each ISA. To read or write this structure through a *Context, you use the new pcState() accessor which reads or writes depending on whether it has an argument. If you just want the address of the current or next instruction or the current micro PC, you can get those through read-only accessors on either the PCState type or the *Contexts. These are instAddr(), nextInstAddr(), and microPC(). Note the move away from readPC. That name is ambiguous since it's not clear whether or not it should be the actual address to fetch from, or if it should have extra bits in it like the PAL mode bit. Each class is free to define its own functions to get at whatever values it needs however it needs to to be used in ISA specific code. Eventually Alpha's PAL mode bit could be moved out of the PC and into a separate field like ARM.
These types can be reset to a particular pc (where npc = pc + sizeof(MachInst), nnpc = npc + sizeof(MachInst), upc = 0, nupc = 1 as appropriate), printed, serialized, and compared. There is a branching() function which encapsulates code in the CPU models that checked if an instruction branched or not. Exactly what that means in the context of branch delay slots which can skip an instruction when not taken is ambiguous, and ideally this function and its uses can be eliminated. PCStates also generally know how to advance themselves in various ways depending on if they point at an instruction, a microop, or the last microop of a macroop. More on that later.
Ideally, accessing all the PCs at once when setting them will improve performance of M5 even though more data needs to be moved around. This is because often all the PCs need to be manipulated together, and by getting them all at once you avoid multiple function calls. Also, the PCs of a particular thread will have spatial locality in the cache. Previously they were grouped by element in arrays which spread out accesses.
Advancing the PC:
The PCs were previously managed entirely by the CPU which had to know about PC semantics, try to figure out which dimension to increment the PC in, what to set NPC/NNPC, etc. These decisions are best left to the ISA in conjunction with the PC type itself. Because most of the information about how to increment the PC (mainly what type of instruction it refers to) is contained in the instruction object, a new advancePC virtual function was added to the StaticInst class. Subclasses provide an implementation that moves around the right element of the PC with a minimal amount of decision making. In ISAs like Alpha, the instructions always simply assign NPC to PC without having to worry about micropcs, nnpcs, etc. The added cost of a virtual function call should be outweighed by not having to figure out as much about what to do with the PCs and mucking around with the extra elements.
One drawback of making the StaticInsts advance the PC is that you have to actually have one to advance the PC. This would, superficially, seem to require decoding an instruction before fetch could advance. This is, as far as I can tell, realistic. fetch would advance through memory addresses, not PCs, perhaps predicting new memory addresses using existing ones. More sophisticated decisions about control flow would be made later on, after the instruction was decoded, and handed back to fetch. If branching needs to happen, some amount of decoding needs to happen to see that it's a branch, what the target is, etc. This could get a little more complicated if that gets done by the predecoder, but I'm choosing to ignore that for now.
Variable length instructions:
To handle variable length instructions in x86 and ARM, the predecoder now takes in the current PC by reference to the getExtMachInst function. It can modify the PC however it needs to (by setting NPC to be the PC + instruction length, for instance). This could be improved since the CPU doesn't know if the PC was modified and always has to write it back.
ISA parser:
To support the new API, all PC related operand types were removed from the parser and replaced with a PCState type. There are two warts on this implementation. First, as with all the other operand types, the PCState still has to have a valid operand type even though it doesn't use it. Second, using syntax like PCS.npc(target) doesn't work for two reasons, this looks like the syntax for operand type overriding, and the parser can't figure out if you're reading or writing. Instructions that use the PCS operand (which I've consistently called it) need to first read it into a local variable, manipulate it, and then write it back out.
Return address stack:
The return address stack needed a little extra help because, in the presence of branch delay slots, it has to merge together elements of the return PC and the call PC. To handle that, a buildRetPC utility function was added. There are basically only two versions in all the ISAs, but it didn't seem short enough to put into the generic ISA directory. Also, the branch predictor code in O3 and InOrder were adjusted so that they always store the PC of the actual call instruction in the RAS, not the next PC. If the call instruction is a microop, the next PC refers to the next microop in the same macroop which is probably not desirable. The buildRetPC function advances the PC intelligently to the next macroop (in an ISA specific way) so that that case works.
Change in stats:
There were no change in stats except in MIPS and SPARC in the O3 model. MIPS runs in about 9% fewer ticks. SPARC runs with 30%-50% fewer ticks, which could likely be improved further by setting call/return instruction flags and taking advantage of the RAS.
TODO:
Add != operators to the PCState classes, defined trivially to be !(a==b). Smooth out places where PCs are split apart, passed around, and put back together later. I think this might happen in SPARC's fault code. Add ISA specific constructors that allow setting PC elements without calling a bunch of accessors. Try to eliminate the need for the branching() function. Factor out Alpha's PAL mode pc bit into a separate flag field, and eliminate places where it's blindly masked out or tested in the PC.
|
#
7087:fb8d5786ff30 |
|
24-May-2010 |
Nathan Binkert <nate@binkert.org> |
copyright: Change HP copyright on x86 code to be more friendly
|
#
6479:b9ab1b56391b |
|
07-Aug-2009 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Implement shift right/left double microops. This is my best guess as far as what these should do. Other existing microops use implicit registers, mul1s and mul1u for instance, so this should be ok. The microop that loads the implicit DoubleBits register would fall into one of the microop slots for moving to/from special registers.
|
#
6360:c3058964d06f |
|
17-Jul-2009 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Tame the wilds of def operands.
|
#
5926:c182698e1ab3 |
|
25-Feb-2009 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Add microops for reading/writing debug registers.
|
#
5839:4cc05b7f2a97 |
|
01-Feb-2009 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Fix some incorrect register widths.
|
#
5789:46c548dbe620 |
|
07-Jan-2009 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Hook in the M5 pseudo insts.
|
#
5682:6f1cab082ba7 |
|
13-Oct-2008 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Add wrval/rdval microops for reading significant miscregs.
|
#
5659:f4b9c344d1ca |
|
12-Oct-2008 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Implement CPUID with a magical function instead of microcode.
|
#
5429:52dbcf7f7328 |
|
12-Jun-2008 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Keep handy values like the operating mode in one register.
|
#
5428:5a27fea50fee |
|
12-Jun-2008 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Change what the microop chks does. Instead of computing the segment descriptor address, this now checks if a selector value/descriptor are legal for a particular purpose.
|
#
5426:0bdcc60ccc45 |
|
12-Jun-2008 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Add microops and supporting code to manipulate the whole rflags register.
|
#
5409:0343cd06df4f |
|
12-Jun-2008 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Add in some support for the tsc register.
|
#
5294:7222bdaed33b |
|
02-Dec-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Reorganize segmentation and implement segment selector movs.
|
#
5291:5d38610cff05 |
|
02-Dec-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Implement the lgdt instruction.
|
#
5290:7dc3e8ee0a22 |
|
02-Dec-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Implement wrbase and wrlimit for loading pseudo descriptors.
|
#
5289:ca5390e654b8 |
|
02-Dec-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Separate the effective seg base and the "hidden" seg base.
|
#
5246:21f29e99e021 |
|
13-Nov-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Make microcode use presegmentation RIPs and the rest of m5 use post segmentation RIPS.
|
#
5241:a6602acdd046 |
|
12-Nov-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Implement the wrcr microop which writes a control register, and some control register work.
|
#
5083:49559a8060e8 |
|
19-Sep-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Move the fp microops to their own file with their own base classes in C++ and python.
|
#
5082:82dd253231c8 |
|
19-Sep-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Put in the foundation for x87 stack based fp registers.
|
#
5075:4ae876c5037d |
|
13-Sep-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Total overhaul of the division instructions and microops.
|
#
5063:8eb72b1bd3c6 |
|
06-Sep-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Rework the multiplication microops so that they work like they would in the patent.
|
#
5026:46dd8d55f6c9 |
|
29-Aug-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Add operands to handle floating point registers.
|
#
5025:5c264911b7a9 |
|
29-Aug-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Flesh out register indexing constants.
|
#
4950:f5f19784acf1 |
|
07-Aug-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Make a microcode branch microop. Also some touch up for ruflag.
|
#
4863:b6dacc9a39ff |
|
04-Aug-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Start implementing segmentation support. Make instructions observe segment prefixes, default segment rules, segment base addresses. Also fix some microcode and add sib and riprel "keywords" to the x86 specialization of the microassembler.
|
#
4816:13391cf96e9c |
|
30-Jul-2007 |
Gabe Black <gblack@eecs.umich.edu> |
X86: Take into account the regular registers and the microcode registers when decided whether or not to fold.
|
#
4805:cc9a5798e4d1 |
|
30-Jul-2007 |
Gabe Black <gblack@eecs.umich.edu> |
Make the register indices use the appropriate "fold" bit.
|
#
4712:79b4c64296ce |
|
19-Jul-2007 |
Gabe Black <gblack@eecs.umich.edu> |
x86 fixes Make the emulation environment consider the rex prefix. Implement and hook in forms of j, jmp, cmp, syscall, movzx Added a format for an instruction to carry a call to the SE mode syscalls system Made memory instructions which refer to the rip do so directly Made the operand size overridable in the microassembly Made the "ext" field of register operations 16 bits to hold a sparse encoding of flags to set or conditions to predicate on Added an explicit "rax" operand for the syscall format Implemented syscall returns.
|
#
4687:db7ca06d6e6a |
|
17-Jul-2007 |
Gabe Black <gblack@eecs.umich.edu> |
Add in operand which holds the condition code bits of the flag register.
|
#
4587:2c9a2534a489 |
|
19-Jun-2007 |
Gabe Black <gblack@eecs.umich.edu> |
Get rid of the immediate and displacement components of the EmulEnv struct and use them directly out of the instruction. The extra copies are conceptually realistic but are just innefficient as implemented. Also don't use the zeroeth microcode register for general storage since it's now the zero register, and implement a load and a store microops.
|
#
4519:f8da6b45573f |
|
04-Jun-2007 |
Gabe Black <gblack@eecs.umich.edu> |
Reworking x86's microcode system. This is a work in progress, and X86 doesn't compile.
src/arch/x86/isa/decoder/one_byte_opcodes.isa: src/arch/x86/isa/macroop.isa: src/arch/x86/isa/main.isa: src/arch/x86/isa/microasm.isa: src/arch/x86/isa/microops/base.isa: src/arch/x86/isa/microops/microops.isa: src/arch/x86/isa/operands.isa: src/arch/x86/isa/microops/regop.isa: src/arch/x86/isa/microops/specop.isa: Reworking x86's microcode system
|
#
4338:24d31b35bcf9 |
|
04-Apr-2007 |
Gabe Black <gblack@eecs.umich.edu> |
The process of going from an instruction definition to an instruction to be returned by the decoder has been fleshed out more. The following steps describe how an instruction implementation becomes a StaticInst.
1. Microops are created. These are StaticInsts use templates to provide a basic form of polymorphism without having to make the microassembler smarter. 2. An instruction class is created which has a "templated" microcode program as it's docstring. The template parameters are refernced with ^ following by a number. 3. An instruction in the decoder references an instruction template using it's mnemonic. The parameters to it's format end up replacing the placeholders. These parameters describe a source for an operand which could be memory, a register, or an immediate. It it's a register, the register index is used. If it's memory, eventually a load/store will be pre/postpended to the instruction template and it's destination register will be used in place of the ^. If it's an immediate, the immediate is used. Some operand types, specifically those that come from the ModRM byte, need to be decoded further into memory vs. register versions. This is accomplished by making the decode_block text for these instructions another case statement based off ModRM. 4. Once all of the template parameters have been handled, the instruction goes throw the microcode assembler which resolves labels and creates a list of python op objects. If an operand is a register, it uses a % prefix, an immediate uses $, and a label uses @. If the operand is just letters, numbers, and underscores, it can appear immediately after the prefix. If it's not, it can be encolsed in non nested {}s. 5. If there is a single "op" object (which corresponds to a single microop) the decoder is set up to return it directly. If not, a macroop wrapper is created around it.
In the future, I'm considering seperating the operand type specialization from the template substitution step. A problem this introduces is that either the template arguments need to be kept around for the specialization step, or they need to be re-extracted. Re-extraction might be the way to go so that the operand formats can be coded directly into the micro assembler template without having to pass them in as parameters. I don't know if that's actually useful, though.
src/arch/x86/isa/decoder/one_byte_opcodes.isa: src/arch/x86/isa/microasm.isa: src/arch/x86/isa/microops/microops.isa: src/arch/x86/isa/operands.isa: src/arch/x86/isa/microops/base.isa: Implemented polymorphic microops and changed around the microcode assembler syntax.
|
#
4298:a92aab35e34e |
|
29-Mar-2007 |
Gabe Black <gblack@eecs.umich.edu> |
Add code to generate register and immediate based integer op microop classes.
|
#
4279:acc38276ca1d |
|
21-Mar-2007 |
Gabe Black <gblack@eecs.umich.edu> |
Add a junk operand. With no operands, the parser breaks.
|
#
4158:a3fb9e29c6ce |
|
05-Mar-2007 |
Gabe Black <gblack@eecs.umich.edu> |
Stub decoder. This is probably even farther from finished than it looks...
|