Talk:Branch predictor

This is the talk page for discussing improvements to the Branch predictor article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Computer science Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

comment

It's not true that all pipelined processors have to do branch prediction -- the processor could just wait on the results of the conditional statement before resuming execution. I know of at least one modern microcontroller that does this. The second paragraph should be changed to say that branch prediction is just the best way to handle conditionals --Ryan Stone 18:08, 17 Nov 2004 (UTC)

Which microcontroller? Iain McClatchie 18:43, 17 Nov 2004 (UTC)

all kinds of branches, or just conditional branches ?

My understanding is that MIPS has a branch delay slot for *every* branch, even unconditional branches.

Yes, that's true. That's because irrespective of whether the branch is conditional or not, fetching the instruction at the target address takes some time. Another way the problem could be solved is to watch out for unconditional jump instructions right at the IF (instrn. fetch) stage instead of waiting for the ID (decode) stage & start fetching at the target address, but this increases the time taken by the IF stage & thus increases the time taken per instruction.

Are these "93.5%" numbers *just* for conditional branches, or do they also include unconditional branches ?

For unconditional branches, there's no use of predicting whether the branch is taken or not. They're always taken! Hence as far as the branch prediction logic goes, these should be treated as normal instructions.

A minor point about "unconditional branch" in MIPS. The `B <target>' "pseudo"-instruction is actually converted by the assembler to `BEQ $0, $0, <target>'. Since register 0 is equal to register 0, the branch is "unconditional". A MIPS processor could just naïvely treat it as a conditional branch and do branch prediction, or it could decode the instruction specially & avoid branch prediction. Note that the "unconditional jump" instruction `J <target>' is a real unconditional instruction and not assembler-synthesised. BTW the difference between B and J is that for B (used for nearby branches) the target is specified as a relative offset from the current PC whereas for J (used for "long" jumps) the absolute address of the target is specified. -- Paddu 07:42, 20 Feb 2005 (UTC)

For unconditional branches, at least on x86 systems, the target still must be predicted. Things like a "switch" statement in a high level language always branch, but it is not clear to where.:

In what way does the third paragraph of the article not address this point? Iain McClatchie 03:48, 15 October 2005 (UTC)[reply]

The article addresses it fine, but the "these should be treated as normal instructions" comment above is wrong. Branch prediction doesn't have to be done after the instruction is decoded, it can be done on the address alone, in which case you must predict the taken despite the fact it is unconditional.

some comments

I think it isn't made clear early enough in the section that the counters are indexed by the PC.

It might be good to move the local/global/combined predictor sections into a subsection of their own for 2-level predictors.

Good references on 2-level branch predictors: A Comparison of Dynamic Branch Predictors that use Two Levels of Branch History, by Yeh and Patt, 1993 (I've found it online as p257-yeh.pdf) Alternative Implementations of Two-Level Adaptive Branch Prediction, by Yeh and Patt, 1992 (p451-yeh.pdf)

A good (the original?) reference on global-history predictors: Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation, by Pan, So, and Rahmeh, 1992 (p76-pan.pdf)

A (the?) reference on gskew: Trading Conflict and Capacity Aliasing in Conditional Branch Predictors, by Michaud, Seznec and Uhlig, 1992 (p292-michaud.pdf)

I think a little too much importance is given to the comparison of the "speed" of the various predictor types (the article already mentions the need for overriding predictors, which arise because all interesting predictors are too slow).

It might be worth explaining more why branch prediction is so important. --CTho 02:20, 23 December 2005 (UTC)[reply]

Do we need to mention a little bit about perceptron predictors?

question

the article is not clear. "They allow processors to fetch and execute instructions without waiting for a branch to be resolved." Does the processor then discard the executed instructions if the branch is not taken? is this because logic is a lot slower than execution? — Preceding unsigned comment added by 24.7.86.143 (talk • contribs) 05:47, 12 October 2006 (UTC (UTC)

Older processors fetched along the predicted path, but didn't execute. Many now do speculative execution where, yes, the results are discarded when that path isn't taken. Gah4 (talk) 20:41, 18 April 2018 (UTC)[reply]

Comments in relation to modern high performance MPUs

Great article, it's quite comprehensive. One area that receives less attention than it should is specialized branch predictors. For instance, many of the advances in Intel's newer microarchitectures are adding more and more prediction mechanisms targeted at particular situations:

Indirect branch predictor - also to be added in AMD's new Barcelona core, mechanism to predict the target address of indirect branches. i.e. it picks from the BHT the most likely target address for a given branch Loop detector - mechanism to predict when a loop will be exited

Dkanter 19:38, 13 December 2006 (UTC)[reply]

Bimodal branch prediction using a single bit predictor

Actually one can do better than the article says with a single bit! This isn't something that can go in the main article but you might be amused by this. I wrote about jump prediction in 1979 when saving a bit actually meant something. Only changing the bit with a probability improves the percentage of correct predictions. Changing the bit with a very low probability for instance changes the chance of a correct prediction of a repeat loop size 2 from 0% to 50% and for a repeat loop of 3 from 33% to 55%. -Dmcq 21:31, 4 April 2007 (UTC)[reply]

Just noticed the business about the Burroughs B4900 storing bits back into the instruction. I rejected that because the instructions of the ICL 2900, the machine I was interested in, were variable length and it was possible for a program with low security to execute code meant for a higher level - so it could potentially gain control of the machine by altering code if the second half of an instruction looked like a branch. It'd be interesting to know if the B4900 was susceptible to that sort of attack! Dmcq 22:18, 9 July 2007 (UTC)[reply]

There are some Burroughs machines where all the security is in software. The system will only execute object programs generated by the compiler, and the compiler checks that you don't do things that you aren't supposed to do. I don't remember which machines, though. Gah4 (talk) 06:51, 27 June 2016 (UTC)[reply]

The Burroughs large systems did that; the B4900 was one of the Burroughs Medium Systems, however, so it might not have supported that.

However, if the processor stores prediction data back into the instruction in hardware/microcode, and the hardware/microcode is known to only modify the branch prediction bits, and if the prediction bits are only hints that affect performance but don't change the branch target, that feature would be safe. Guy Harris (talk) 07:17, 27 June 2016 (UTC)[reply]

Probably did?

The first paragraph of the Section "History" says that a cpu probably did use a branch predictor. This sounds like speculation or even wild guessing.--Henke37 21:09, 28 June 2007 (UTC)[reply]

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=63652 would probably have the answer if anyone has access to it... --CTho 02:40, 29 June 2007 (UTC)[reply]

Yes, based on that article, the Vax 9000 had a branch predictor. "The Branch Prediction unit can handle two outstanding conditional branches but must stall the XBAR [instruction fetch crossbar] when it encounters a third." If somebody else wants to edit the main page and give that citation, go ahead. Daedae 20:27, 2 August 2007 (UTC)[reply]

Trivial prediction section heading

The article describes a form of static branch prediction as "Trivial prediction." I don't think this is a commonly used term, in fact I have never seen it used outside of this wikipedia entry. I'm going to change that heading to Static Branch prediction, which I believe is the more commonly used term. Also the existing section on static branch prediction is describing a specific type of static branch prediction, that is Backwards Taken/Fowards Not-taken so I think that heading should also be changed. Wikiuser1239 (talk) 03:56, 11 December 2008 (UTC)[reply]

Some IBM systems do static prediction on BXH and BXLE (branch index high, and branch index less or equal). The instructions are commonly used at the beginning and end of for loops, such that BXH predicts not to branch, and BXLE predicts to branch. There is no requirement that they be used that way, though. Gah4 (talk) 20:47, 18 April 2018 (UTC)[reply]

what about GCC __builtin_expect( cond, exp ) ??

In C, there is a syntax to help with branch prediction. The command looks like:

if( __builtin_expect( x == y, true ) ) { foo(); }

Can someone please explain how this C keyword relates to CPU branch prediction? —Preceding unsigned comment added by 165.125.160.16 (talk) 07:02, 3 May 2009 (UTC)[reply]

It's not all that strongly related to CPU branch prediction. I don't know exactly what GCC does now but the sort oif things I've come across for this type facility are

It can move the less frequently used code out of line so it isn't included in a loop cache.

It can avoid loading variables outside of the code that are only used within the infrequent sections.

For small functions that don't call anything outside but might call something infrequently inside one can avoid setting up a stack frame unless it is needed.

As to CPU branch prediction it could set a static branch prediction or use knowledge of the default assumptions for how branches are dealt with. For instance if the code is moved out of line it might be moved to the end instead of before the beginning if branches forward are assumed not taken but branches back are assumed taken.

Anyway there's lots of good things it can do. Knowing the actual frequency from profiling can improve the decisions. Dmcq (talk) 11:58, 3 May 2009 (UTC)[reply]

Article rewritten, Oct. 2009

I have rewritten the article almost completely because it looked like a collection of random facts. Hopefully, this has solved many of the earlier issues on this talk page.

I have tried to improve the text to make it easier to understand for non-experts.

I have added an explanation of the 2-level predictor which was completely missing, even though most predictors are based on this principle today.

The history section still needs to be revised and updated. It may be a good idea to move the descriptions of older processors and obsolete prediction methods into the history section.

The word branch is ambiguous: It may mean (1) each of two alternative sections of code that an algorithm can choose between, or (2) the instruction that does the choosing. To avoid confusion, I have chosen to use the work branch only in the first meaning, and use conditional jump for the second meaning. Other Wikipedia articles make no such distinction and use branch in both meanings.

My drawings may need improvement.

Afog (talk) 17:59, 3 October 2009 (UTC)[reply]

Thanks, looks good and I like the diagrams. I've moved this to the end of the talk page as that's where new comments normally go on talk pages except for any special tags at the top. Dmcq (talk) 08:59, 4 October 2009 (UTC)[reply]

B4900 had a cache

It did, really! I participated in the design of the B4900 from concept through debug. I did not design the cache and do not remember the specs, but I know there was one. Any way, I deleted "cacheless" from the description of the B4900. By the way, we referred to changing the opcode as "code modification" which was something that other programs written for Medium Systems did. I don't remember those details either, but it was a design issue. Scottmunroe (talk) 02:12, 4 March 2010 (UTC)[reply]

B4900 Cacheless?

I woke up in the middle of the night thinking, "It wasn't really a cache. It was just a fetch buffer." It was only for code, not data. I suppose the function that would differentiate between a cache and a buffer would be whether or not the logic recognized that the required code was already in the buffer and did not read main memory if it was. I don't remember. So, since I am not certain, I will put "cacheless" back in. Scottmunroe (talk) 23:52, 12 March 2010 (UTC)[reply]

You might be interested in #Bimodal branch prediction using a single bit predictor for a reason to do with security for rejecting updating the instructions. Also the ICL 2900 range could also have read only constants in the executable code and jumping to that could change the constants if one implemented something like the B4900. Was more privileged or shared code only executable by going via some protected vector on the B4900 so this couldn't be done or were they not worried by the possibility? Dmcq (talk) 08:07, 13 March 2010 (UTC)[reply]

The IBM 360/85 called it buffer instead of cache, but the idea was there. If it is associative, I would call it a cache. The 360/91 loop buffer stores sequential instruction words, but isn't associative, so probably not a cache. Gah4 (talk) 07:04, 27 June 2016 (UTC)[reply]

Article needs work

This page presents the details of branch prediction methods without first giving the reader a better idea of what branch prediction itself is. There is no discussion of how it relates to the concept of a hazard and specifically to branch hazards. For the methods that are detailed the order does not make any sense either. The basic concepts need to be presented first, like say branch history table / branch prediction buffer, and then we can present how they are used in real chips later on. These basic branch prediction concepts can illustrated with scenarios on simpler idealized processors, like the one in the opening picture or maybe a little more complicated. Red Bulls Fan (talk) 01:09, 22 October 2010 (UTC)[reply]

Critics?

I'm coming over from http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array where I read that prediction is sometimes very harmful, especially for unsorted arrays. So aren't there any specialists who understand that topic and see that branche prediction is the attempt to predict something that often can't be predicted, and therefore must be considered harmful? I won't say that prediction isn't always harmful, but in some cases it is, and therefore the CPU designers should start to think more critically about what others implemented. Right now our processors do predictions no matter how much it costs, and this needs to be changed in the future. This angers me a bit, because it tells me that a lot of people aren't doing their work correctly. Actually, it's a shame. 178.197.235.31 (talk) 10:29, 24 May 2013 (UTC)[reply]

From what I understand this isn't a case of prediction is harmful, i.e that branch prediction slows things down. It is more of the case that for unpredictable input, branch prediction cannot do a very good job (for obvious reasons) and therefore things slow down. What the author of the question is seeing is not slowdown because of branch prediction, but the absence of branch prediction speedup. Also, most branches can be predicted reasonably well, especially the type of code used in performance sensitive areas (such as tight loops) usually have simple patterns and can therefore be predicted efficiently.--TowerOfBricks (talk) 13:30, 15 July 2013 (UTC)[reply]

Dead link

A note that the reference Championship Branch Prediction (13) is a dead link (or at least the server is not responding at the moment). As I do not know exactly what information the original page contained, I don't want to replace it directly. However this page http://www.jilp.org/jwac-2/ looks like a good reference on the subject. — Preceding unsigned comment added by TowerOfBricks (talk • contribs) 13:34, 15 July 2013 (UTC)[reply]

External links modified

Hello fellow Wikipedians,

I have just added archive links to one external link on Branch predictor. Please take a moment to review my edit. If necessary, add {{cbignore}} after the link to keep me from modifying it. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether. I made the following changes:

Added archive http://web.archive.org/web/20100616183714/http://cava.cs.utsa.edu/pdfs/micro03_dist.pdf to http://cava.cs.utsa.edu/pdfs/micro03_dist.pdf

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

Y An editor has reviewed this edit and fixed any errors that were found.

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—^{cyberbot II}_{Talk to my owner:Online} 01:50, 19 March 2016 (UTC)[reply]

I found a non-archived version. Guy Harris (talk) 02:59, 19 March 2016 (UTC)[reply]

External links modified

Hello fellow Wikipedians,

I have just modified one external link on Branch predictor. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20080528083720/http://camino.rutgers.edu/cbp2/ to http://camino.rutgers.edu/cbp2/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

Y An editor has reviewed this edit and fixed any errors that were found.

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 16:28, 24 July 2017 (UTC)[reply]

Worked, but I replaced it with a reference to a later CBP competition. Guy Harris (talk) 17:39, 24 July 2017 (UTC)[reply]

branch misprediction

There is supposed to be discussion of branch misprediction but I don't see it. Seems to be, it should be a redirect here. Gah4 (talk) 20:51, 18 April 2018 (UTC)[reply]

I redirected rather than merged, as all of the ideas on the branch misprediction was already included on this page, and there were also no references on it. Klbrain (talk) 09:47, 1 May 2019 (UTC)[reply]

How to be sure if it is not parallelization?

If there were the circuits of say one hundred Z80 CPUs in one chip and there was a microcode unit wrapped around which analyzed the fetched instructions for a possibility of parallelization and then distributed them equally to the single CPUs and outwardly emulated another CPU, how could you distinguish the weird behaviour, for example when code was self-modifying, from branch prediction?

If by "which analyzed the fetched instructions for a possibility of parallelization and then distributed them equally to the single CPUs" you mean that, for conditional branches, it sends "what we do if the branch is taken" to one CPU and "what we do if the branch isn't taken" to another CPU, that's not branch predictions, that's speculative execution in its eager execution form. Guy Harris (talk) 09:06, 22 November 2019 (UTC)[reply]

Overriding branch predictor examples

Hi Folks, I made the following edit which was rejected for COI (thanks MrOllie for flagging).

I would suggest retaining the edit and possibly altering the citation.

Intel's Silvermont (microarchitecture) implements a two-level overriding branch predictor. The first level predictor appears to be a next line predictor that controls instruction fetching. The second level predictor controls which instructions are decoded and can override the first predictor^[1].

The reference is to my own site (www.realworldtech.com), which is pretty widely cited in Wiki articles. Other references include Agner Fog's work (https://www.agner.org/optimize/microarchitecture.pdf), which cites me.

I think it's important to have as an example since Silvermont's overriding predictor actually uses both two different types of predictors (as well as different storage for the histories used to predict).

So - do you think the citation is necessary in this situation? If not, maybe we drop to avoid the COI problem? Alternatively, maybe folks are comfortable with it?

TheKanter (talk) 19:56, 17 October 2021 (UTC)[reply]

References

^ Kanter, David. "Silvermont, Intel's Low Power Architecture". Retrieved 2021-10-17.

Presetting the Predictor

Are there any CPUs that actually allow the programmer to preset the branch predictor. I have read about optimizations that use execution profile information to set branch prediction, so there must be some sort of external preset possible. Is there a CPU that lets this be done computationally? A Virtual Machine has an execution loop that has as many jump targets as the VM has instructions. Each of these targets ends with a jump to another target which is a direct result of the bytecode that is being interpreted. But the sequence of jumps is known. If the CPU had an instruction that allowed the next jump's target to be set by the code, the VM could execute without ever needing to "predict" a jump. If such a CPU is available, it should be discussed here. 50.206.176.154 (talk) 18:47, 5 April 2022 (UTC)[reply]

Some IBM z/Architecture microprocessors presumably do so, given the "branch prediction preload" instructions described on pages 7-43 through 7-46 of z/Architecture Principles of Operation, Twelfth Edition; those instructions have a register operand, with the register containing the predicted target address, and a memory operand, pointing to the branch instruction, as well as a third 4-bit immediate operand describing the type of branch instruction (length, indication of whether it's a call/return/other branch/EXECUTE instruction (CISC architecture dating back to the 1960's, of course it has an execute instruction :–)).

I don't know of any others. It would be interesting to see whether anybody's studied the benefits of explicit indirect ranch prediction and to see why no other major instruction set seems to provide this (including, as far as I know, IBM's own Power ISA). The main places I'd see it being used would be interpreters, as per your note, and method calls if the method implementation differs from instance to instance. Guy Harris (talk) 19:47, 5 April 2022 (UTC)[reply]

Fake?

Branch prediction looks like bullshit. Program Code will not send to code fragment (to some RAM address) if there no need to check if branch will be taken or not. And if there is code with condition statments then all branches need check anyway (all conditions must be cheked). For example if you adjusting window (window explorer in Windows 10) in horizontal direction. Then need to check lane lenght on top of window. And for this roughly need check each time if mouse pointer have bigger value on horizontal axis than window lane pixels position on horizontal axis (overwise lane pixels would be painted on top of desktop to the right, if you expanding window to right with mouse). Smarter would be check needed to draw pixels horizontal position not by comparasing each pixel position of lane with mouse pointer position, but to check (compare), say, after when 10 or 100 pixels colors was writen for they positions. If say on last check, when 10 pixels positions had smaller horizontal value than mouse pointer, but after another 10 pixels writen to to they positions, the last pixel have bigger number than mouse pointer on horizontal axis, then need go back from this point and to check (compare) each of those last 10 pixels positions. After pixels drawed correctly, branch can be taken to next line. Either way code will not send to address, where 100 % is known that branch will not be taken... If there is branch, it always can be taken or can be not taken and you (or CPU) never can know how it will be from my understanding. At most what you can do, is to execute instruction like if there was branch taken. And if there is many execution units or piplines in CPU, then they can do it for all instructions without any branch prediction technics. And seems those about 10 piplines in CPU makes CPU architecture much more complex, but speed up can be less than 5x due to increased execution complexity... Chinese could made very energy efiecent CPU without 5-15 piplines (but with only one pipline) and it would consume probably much less transistors and would be only 3-7 times slower. If those CPU piplines are not marketing and/or misunderstanding... Paraboloid (talk) 11:04, 16 December 2024 (UTC)[reply]

https://en.wikipedia.org/wiki/Instruction_pipelining

(The later "Prescott" and "Cedar Mill" NetBurst cores from Intel, used in the last Pentium 4 models and their Pentium D and Xeon derivatives, have a long 31-stage pipeline.) - As I read about this, many piplines are energy hungry and in future intel get away from so many piplines, because couldn't reach higher frequencies of CPUs due to thermal and energy problems.

https://en.wikipedia.org/wiki/Pentium_(original)

(Like the Intel i486, the Pentium is instruction set compatible with the 32-bit i386. It uses a very similar microarchitecture to the i486, but was extended enough to implement a dual integer pipeline design, as well as a more advanced floating-point unit (FPU) that was noted to be ten times faster than its predecessor.) - Pentium was first processor which using piplining.

https://en.wikipedia.org/wiki/P6_(microarchitecture)

(Superpipelining, which increased from Pentium's 5-stage pipeline to 14 of the Pentium Pro and early model of the Pentium III (Coppermine), and eventually morphed into less than 10-stage pipeline of the Pentium M for embedded and mobile market due to energy inefficiency and higher voltage issues that encountered in the predecessor, and then again lengthening the 10- to 12-stage pipeline back to the Core 2 due to facing difficulty increasing clock speed while improving fabrication process can somehow negate some negative impact of higher power consumption on the deeper pipeline design.) - So now Pentium have 5 stage pipline, not 2 piplines anymore?

P6 microarchitecture begins with Pentium Pro and Pentium II and returns to Intel Core architecture. — Preceding unsigned comment added by Paraboloid (talk • contribs) 12:31, 16 December 2024 (UTC)[reply]

If there is branch, it always can be taken or can be not taken and you (or CPU) never can know how it will be from my understanding. At most what you can do, is to execute instruction like if there was branch taken. Branch prediction doesn't purport to perfectly predict the target of a conditional branch; as the article indicates, either the programmer or compiler provides hints in branch instructions (static branch prediction), or the hardware makes a prediction based on past behavior. And, as the article also indicates, those aren't 100% certain predictions, and the hardware has to recover from mispredictions. So you can do more than always predict "branch taken", but any prediction, whether it's a simple static "branch taken" or something more complicated runs the risk of mispredicting. (If it could be proven whether the branch will be taken, perhaps the compiler, or programmer, should do that proof and either remove the branch or replace it with an unconditional branch.)

Whether it's worth doing any branch prediction is another matter. Daniel J. Bernstein has slides from a talk where he questions whether "a well-designed insn set with well-designed software [can] remove the speed incentive for branch prediction?" Here's "A Comparative Analysis of Branch Prediction Schemes", a paper from the 1990s studying branch predictor behavior; it cites some other papers.

(As for what the Chinese could do, here's a 2021 paper from researchers at the Nanjing University of Aeronautics and Astronautics about a branch predictor design.)

As for pipelines:

There's the number of pipelines and there's the number of stages in a pipeline, two separate numbers. Intel's "Architecture of the Pentium microprocessor" IEEE Micro paper shows, on page 13 of that issue, a 5-stage pipeline that, after the first two stages, has two separate integer pipelines; it's not as if it has two independent pipelines - the instruction fetch and the first stage of the instruction decode are done in the first two stages, with only one pipeline, and then the control words produced by the first stage of the instruction decode are processed either by the U pipeline or the V pipeline. So:

So now Pentium have 5 stage pipline, not 2 piplines anymore? It has a 5-stage pipeline in which the last 3 stages are duplicated and can process two instructions in parallel.

Pentium was first processor which using piplining. No, some earlier x86 processors, including, for example, the 486, were pipelined as well. The original Pentium was the first superscalar x86 processor, with a pipeline in which the last three stages of the integer pipeline were duplicated, so that two instructions can simultaneously have their "decode control word and generate memory address", "access data cache or calculate ALU result", and "write result" stages execute. Guy Harris (talk) 21:08, 16 December 2024 (UTC)[reply]

I read more about this branch prediction. And it looks like pseudoscience or like philosophy. Philosophy also studing in universities. It based on principle "better learn something, where part of it can be wrong, than don't learn anything at all", because intel wouldn't share real CPU working principles and architectures.

Here https://www.zilog.com/docs/z80/um0080.pdf on page 146 (in Acrobat Reader) or page 132 on list is instruction LDIR.

Description of LDIR.

This 2-byte instruction transfers a byte of data from the memory location addressed by the contents of the HL register pair to the memory location addressed by the DE register pair. Both these register pairs are incremented and the Byte Counter (BC) Register pair is decremented. If decrementing allows the BC to go to 0, the instruction is terminated. If BC is not 0, the program counter is decremented by two and the instruction is repeated. Interrupts are recognized and two refresh cycles are executed after each data transfer. When the BC is set to 0 prior to instruction execution, the instruction loops through 64 KB.

Register H is 8 bits and register L is 8 bits, together HL register is 16 bits. HL register content can point to memory location (address). If some register pair points to memory (RAM) address then such operand writen in brackets, like this (HL). From this HL registers pair pointed address taken one byte (8 bits) of data or can be writen to (HL) address 8 bits of data from some 8 bit register or input device...

In some Z80 CPU instructions there is Byte Counter. Byte counter can be 8 bits, then it is register B. Or byte counter can be 16 bits, like in this instruction LDIR. Then byte counter consist of two register: of register B and register C. BC register pair holds 16 bits as byte counter.

What this LDIR instruction does? It load 8 bits (1 byte) from HL register pair 16 bit RAM address to RAM address pointed by DE 16 bit register pair. Then to DE register pair added 1 (DE incrimented by 1) and to HL added 1. So if DE value was 1000 and HL value was 2000 (2^16 = 65536 possible addresses exist or 64 KB RAM adressing). Then DE have value 1001 and HL have value 2001.

Further BC register pair is decremented (BC-1). If say BC had value 100, then after increamenting DE and HL, BC have value 99.

If BC is not 0, the program counter is decremented by two and the instruction is repeated.

If decrementing allows the BC to go to 0, the instruction is terminated.

If

BC\neq 0

need 5 machine cycles (4, 4, 3, 5, 5) for this LDIR instruction. One machine cycle consist of few clock cycles. First machine cycle consist of 4 clock cycles, second machine cycle also of 4 clock cycles, third - of 3, fourth machine cycle consit of 5 clock cycles and fifth machine cycle consist also of 5 clock cycles. There is 4+4+3+5+5=21 total CPU clock cycles for LDIR instruction. For 4 MHz Z80 CPU taken time for this LDIR instructions is 21 * 1/4000000 = 0.00000525 (second) = 5.25 (microsecond) = 5250 ns.

If

BC=0

then need 4 machine cycles (4, 4, 3, 5) or 4+4+3+5=16 clock cycles. 4 MHz CPU in one second does 4000000 (4 milion) clock cycles. Delay in this case (BC=0) is 4 microseconds.

So there for some instructions already is Byte counter if say need to do looping with 100 variables. What branch you can predict in this case? Seems that all those predictions and misspredictions making architecture more complex and slow, because still need to calculate whether prediction will be taken or not.

From Branch Predictor article

Two-level adaptive predictor

"Assume, for example, that a conditional jump is taken every third time. The branch sequence is 001001001... In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one. Entry number 01 will go to state "strongly not taken", indicating that after 01 comes a zero. The same is the case with entry number 10, while entry number 11 is never used because there are never two consecutive ones.

The general rule for a two-level adaptive predictor with an n-bit history is that it can predict any repetitive sequence with any period if all n-bit sub-sequences are different.[8]

The advantage of the two-level adaptive predictor is that it can quickly learn to predict an arbitrary repetitive pattern. This method was invented by T.-Y. Yeh and Yale Patt at the University of Michigan.[13] Since the initial publication in 1991, this method has become very popular. Variants of this prediction method are used in most modern microprocessors."

So now need based on sequence 001001001 don't take branch first and second time and to take branch third time if found the same instruction (what if there is another instruction?). If "0" don't take branch and if "1", take branch, which could be sending or not sending to branch with simpler transistors logic scheme. What if such reapeating patterns very rare? And what if after many steps of program you find out that branch was taken wrong. Then need go back and check all branches from begining? Or maybe there in parallel need to do checking if those from first look correct predictions are right or wrong. If prediction wrong need very complex aparat to go back to location where branch was taken wrongly. And if need to check branch prediction after each branch, then it is simply Static branch prediction like you say. This means you doing two instructions in parallel - one instruction like if branch taken and another instruction like if branch is not taken. And this could cause lags when trying to write or get something from RAM, because can refer to RAM with two instructions at once. Only with [double] cache it could work... So much more complexity for being SOMETIMES (because not all instructions doing branches) faster by 1 instruction. And this complex scheme can be even slower. — Preceding unsigned comment added by Paraboloid (talk • contribs) 10:42, 17 December 2024 (UTC)[reply]

[1] Kanter, David. "Silvermont, Intel's Low Power Architecture". Retrieved 2021-10-17.

[1]