The keywords you should probably look up are CISC, RISC and superscalar architecture.
CISC
In a CISC architecture (x86, 68000, VAX) one instruction is powerful, but it takes multiple cycles to process.
In older architectures the number of cycles was fixed, nowadays the number of cycles per instruction usually depends on various factors (cache hit/miss, branch prediction, etc.). There are tables to look up that stuff. Often there are also facilitates to actually measure how many cycles a certain instruction under certain circumstances takes (see performance counters).
If you are interested in the details for Intel, the Intel 64 and IA-32 Optimization Reference Manual is a very good read.
RISC
RISC (ARM, PowerPC, SPARC) architecture means usually one very simple instruction takes only a few (often only one) cycle.
Superscalar
But regardless of CISC or RISC there is the superscalar architecture.
The CPU is not processing one instruction after another but is working on many instructions simultaneously, very much like an assembly line.
The consequence is: If you simply look up the cycles for every instruction of your program and then add them all up you will end up with a number way to high. Suppose you have a single core RISC CPU. The time to process a single instruction can never be less than the time of one cycle, but the overall throughput may well be several instructions per cycle.
I want to know how this one cycle single carry the instruction bits.
The wording of this question is a little hard to understand. Do you mean to ask "how does the CPU receive instructions (1001
) with a single clock line"?
It doesn't. A clock signal always look like (4 cycles):
+--+ +--+ +--+ +--+
| | | | | | | |
+ +--+ +--+ +--+ +--+
It's a metronome. It doesn't carry any information other than timing. It keeps all parts of the CPU working at the same speed. There are lots of connections carrying signals inside a CPU. Signals take time to change (0 -> 1 or 1 -> 0). Some change faster, some slower. Changes take place between rising edges (or falling edges depending on circuit design). The CPU will do the "next step" of the computation at every rising edge (or falling). Eg, fetch, decode, execute, could take 3 cycles. Because the rising (or falling) edges are when signals should have stabilized.
The CPU fetches instructions through other lines, like buses. Typically, the address of the next instruction is placed on the address bus, the instruction is then placed on the data bus, the CPU reads it from the data bus, decodes, executes it. The clock line is for transmitting timing information only, not "data" information.
The 2nd diagram you drew is what 1001
would look like if you were to transmit it serially, but that's a different topic.
Best Answer
The best source would be straight from the people who designed the extensions: Intel. The definitive references are the Intel® 64 and IA-32 Architectures Software Developer Manuals; I would recommend that you download the combined Volumes 1 through 3C (first download link on that page). You may want to look at
Vol. 1, Ch. 12
- Programming with SSE3, SSSE3, SSE4 and AESNI. To refer to specific instructions, seeVol. 2, Ch. 3-4
. (Appendix B is also helpful)The instructions are only used if a program you're running actually uses them (i.e. the bytecode corresponding to the various SSE4 instructions are being called). To find out what instructions a program uses, you need to use a disassembler.
You may want to have a look at my answer to the question, "How does a CPU 'know' what commands and instructions actually mean?". When you write out assembly code by hand, to make an executable, you pass the "human readable" assembly code to an assembler, which turns the instructions into the actual 0's and 1's the processor executes.
Since your computer is Turing complete, it can execute any arbitrary mathematical function using a software algorithm if it does not have the dedicated hardware to do so. Obviously, doing intense parallel or matrix mathematics in hardware is much faster than in software (requiring many loops of instructions), so this would cause a slow-down for the end user. Depending on how the program was created, it's possible that it may require a particular instruction (i.e. one from the SSE4 set), although given it's possible to do the same thing in software (and thus useable on more processors), this practice is rare.
As an example of the above, you may recall when processors first came out with the MMX instruction set extension. Let's say we want to add two 8-element, signed 8-bit vectors together (so each vector is 64-bits, equal to a single MMX register), or in other words,
A + B = C
. This could be done with a single MMX instruction calledpaddsb
. For brevity, let's say our vectors are held at memory locationsA
,B
, andC
as well. Our equivalent assembly code would be:However, this operation could also easily be done in software. For example, the following C code performs the equivalent operation (since a
char
is 8-bits wide):You can probably guess how the assembly code of the above loop would look, but it's clear that it would contain significantly more instructions (as we now need a loop to handle adding the vectors), and thus, we would need to perform that many more fetches. This is similar to how the word length of a processor affects a computer's performance (the purpose of MMX/SSEx is to provide both larger registers, as well as the ability to perform the same instruction on multiple pieces of data).