How to find a list of all SSE instructions? What happens if a CPU doesn’t support SSE

computer-architecturecpucpu-architecture

So I've been reading about how processors work. Now I'm on the instructions (SSE, SSE2, etc) stuff. (Which is pretty interesting).

I have lot of questions (I've been reading this stuff on Wikipedia):

I've saw the names of some instructions that were added on SSE, however there's no explanation about any of them (Maybe SSE4? They're not even listed on Wikipedia). Where can I read about what they do?
How do I know which of these instructions are being used?
If we do know which are being used, let's say I'm doing a comparison, (This may be the most stupid question I've ever asked, I don't know about assembly, though) Is it possible to directly use the instruction on an assembly code? (I've been looking at this: http://asm.inightmare.org/opcodelst/index.php?op=CMP)
How does the processor interpret the instructions?
What would happen if I had a processor without any of the SSE instructions? (I suppose in the case we want to do a comparison, we wouldn't be able to, right?)

Best Answer

I've saw the names of some instructions that we're added on SSE, however there's no explain about all of them (Maybe SSE4? They're not even listed on Wikipedia). Where i can read about what they do?

The best source would be straight from the people who designed the extensions: Intel. The definitive references are the Intel® 64 and IA-32 Architectures Software Developer Manuals; I would recommend that you download the combined Volumes 1 through 3C (first download link on that page). You may want to look at Vol. 1, Ch. 12 - Programming with SSE3, SSSE3, SSE4 and AESNI. To refer to specific instructions, see Vol. 2, Ch. 3-4. (Appendix B is also helpful)

How do i know which of these instructions are being used?

The instructions are only used if a program you're running actually uses them (i.e. the bytecode corresponding to the various SSE4 instructions are being called). To find out what instructions a program uses, you need to use a disassembler.

If we do know which are being used, let's say i'm doing a comparation, (This may be the stupidest question i've ever done, i don't know about assembly, though) It's possible to directly use the instruction on an assembly code? (I've been looking at this: http://asm.inightmare.org/opcodelst/index.php?op=CMP)

How does the processor interpret the instructions?

You may want to have a look at my answer to the question, "How does a CPU 'know' what commands and instructions actually mean?". When you write out assembly code by hand, to make an executable, you pass the "human readable" assembly code to an assembler, which turns the instructions into the actual 0's and 1's the processor executes.

What would happen if i have a processor without any of the SSE instructions? (I suppose if in the case we want to do a comparation, we wouldn't be able, right?)

Since your computer is Turing complete, it can execute any arbitrary mathematical function using a software algorithm if it does not have the dedicated hardware to do so. Obviously, doing intense parallel or matrix mathematics in hardware is much faster than in software (requiring many loops of instructions), so this would cause a slow-down for the end user. Depending on how the program was created, it's possible that it may require a particular instruction (i.e. one from the SSE4 set), although given it's possible to do the same thing in software (and thus useable on more processors), this practice is rare.

As an example of the above, you may recall when processors first came out with the MMX instruction set extension. Let's say we want to add two 8-element, signed 8-bit vectors together (so each vector is 64-bits, equal to a single MMX register), or in other words, A + B = C. This could be done with a single MMX instruction called paddsb. For brevity, let's say our vectors are held at memory locations A, B, and C as well. Our equivalent assembly code would be:

movq   MM0, [A]
paddsb MM0, [B]
movq   [C], MM0

However, this operation could also easily be done in software. For example, the following C code performs the equivalent operation (since a char is 8-bits wide):

#define LEN 8
char A[LEN], B[LEN], C[LEN];

/* Code to initialize vectors A and B... */

for (i = 0; i < LEN; i++)
{
    C[i] = A[i] + B[i];
}

You can probably guess how the assembly code of the above loop would look, but it's clear that it would contain significantly more instructions (as we now need a loop to handle adding the vectors), and thus, we would need to perform that many more fetches. This is similar to how the word length of a processor affects a computer's performance (the purpose of MMX/SSEx is to provide both larger registers, as well as the ability to perform the same instruction on multiple pieces of data).

CISC

In a CISC architecture (x86, 68000, VAX) one instruction is powerful, but it takes multiple cycles to process. In older architectures the number of cycles was fixed, nowadays the number of cycles per instruction usually depends on various factors (cache hit/miss, branch prediction, etc.). There are tables to look up that stuff. Often there are also facilitates to actually measure how many cycles a certain instruction under certain circumstances takes (see performance counters).

If you are interested in the details for Intel, the Intel 64 and IA-32 Optimization Reference Manual is a very good read.

RISC

RISC (ARM, PowerPC, SPARC) architecture means usually one very simple instruction takes only a few (often only one) cycle.

Superscalar

But regardless of CISC or RISC there is the superscalar architecture. The CPU is not processing one instruction after another but is working on many instructions simultaneously, very much like an assembly line.

The consequence is: If you simply look up the cycles for every instruction of your program and then add them all up you will end up with a number way to high. Suppose you have a single core RISC CPU. The time to process a single instruction can never be less than the time of one cycle, but the overall throughput may well be several instructions per cycle.

CPU clock cycle and instructions interpretation

I want to know how this one cycle single carry the instruction bits.

The wording of this question is a little hard to understand. Do you mean to ask "how does the CPU receive instructions (1001) with a single clock line"?

It doesn't. A clock signal always look like (4 cycles):

+--+  +--+  +--+  +--+
|  |  |  |  |  |  |  |
+  +--+  +--+  +--+  +--+

It's a metronome. It doesn't carry any information other than timing. It keeps all parts of the CPU working at the same speed. There are lots of connections carrying signals inside a CPU. Signals take time to change (0 -> 1 or 1 -> 0). Some change faster, some slower. Changes take place between rising edges (or falling edges depending on circuit design). The CPU will do the "next step" of the computation at every rising edge (or falling). Eg, fetch, decode, execute, could take 3 cycles. Because the rising (or falling) edges are when signals should have stabilized.

The CPU fetches instructions through other lines, like buses. Typically, the address of the next instruction is placed on the address bus, the instruction is then placed on the data bus, the CPU reads it from the data bus, decodes, executes it. The clock line is for transmitting timing information only, not "data" information.

The 2nd diagram you drew is what 1001 would look like if you were to transmit it serially, but that's a different topic.

Best Answer

Related Solutions

What are “Instructions per Cycle”

CISC

RISC

Superscalar

CPU clock cycle and instructions interpretation

Related Question