Welcome, Guest: Register On Nairaland / LOGIN! / Trending / Recent / New
Stats: 3,152,828 members, 7,817,420 topics. Date: Saturday, 04 May 2024 at 11:55 AM

Optimizations: Some Sketchy Rants - Programming - Nairaland

Nairaland Forum / Science/Technology / Programming / Optimizations: Some Sketchy Rants (1383 Views)

(2) (3) (4)

(1) (Reply) (Go Down)

Optimizations: Some Sketchy Rants by WhiZTiM(m): 3:30am On May 08, 2015
This post is a bit advanced but... I've tried to make it as simple as possible.
I will break each heading down in completely different post sometime when I am free.

If Compilers were transforming codes we write to machine code in verbatim (or simple rules), Software development would have been dramatically hard. I mean serious pain in the butt!

There are several optimizations employed by the Compiler... but we are only going to look at a few that can dramatically affect performance.
In my own world, the factors that determine performance are (in order of importance) Algorithm, Data Structure, Data Locality and Optimization

But, I assume the first three are known, so we will look at what typically happens between our Software and Hardware that makes our code faster.

Some happen only at Compiler level or CPU level, some are interleaved
There are many of them.
1. Constant folding
2. Constant propagation
3. Register allocation
4. Vectorization
5. Branch prediction
6. Caching
7. Out of Order Execution
8. Loop Unrolling
9. Compile time evaluation
9. ...etc...

If you value performance, you will know to depths, your Compiler(or Interpreter), Operating System, CPU and operating environments.


* Cache Hierarchy.
The presence of several levels of cache are apparent and important factor for several speed improvements from 0.5x to 1000x plus.... (it can be more).... (0.5x, reduced? Yes! Slower! typically presents itself as false-sharing in concurrency)
Between the RAM and the CPU execution units are typically several levels of Cache. At CPU level, we have L1, L2, L3 cache.
The CPU register is the closest, fastest and the only place bytes are operated upon. The next closest is the L1 cache, it's typically small. My CPU: L1=128KB per core. L2=512KB shared, L3=3MB shared.

The CPU doesn't fetch single data from RAM, it typically fetches Data in block to fill up what is known as Cache lines.
That is why you know that arrays are faster than lists when it comes to iteration and modification.


* Speculative Execution.... Branch Predictors
technically, this isn't the right heading, but its a language lawyer thing.
For conditional constructs in code, the compiler translates that to a Branch. using jump statements and Jump tables. For every Conditional in a code, we are in for Branch Prediction
Branch prediction is the CPU's way of speculating whether the condition will be satisfied to execute the body or skip it
There are several ways Branch Predictors could be implemented at Circuit level. But its also easy to implement a model of it in Software.


* Single threaded programs have some minute level of parallelism... Out of Order Execution
The ISO C++ standard permits every C++ compiler to alter the 'form' of any code provided the resultant program produces the exact empirical output expected by the user.
CPU Operations can be piplened. This is when the CPU can do more than 1 thing in one instruction cycle.
For example, the Intel CPU can perform Addition and Subtraction at the same time... and you have this code..

int a = 1;
int b = 2;
int c = 3;
int d = 4;

and you wrote this:
int x = d - a;
int y = c - b;

Instead of the CPU to solve for x in the first execution cycle, then y in the second execution cycle, it can simple solve them both in one cycle.
That alone is 2x improvement!
There is an interesting caveat here... The CPU also provides some opportunities to reorder code. This feature is known as Out of Order Execution
OOE can happen anytime several execution units within the CPU aren't synchronized(doing the same thing at the same time), and opportunities still exists for pipelining.

* Loop Unrolling.
THis has to do with Induction analysis...

summ = 0; for x in range(1, 500): summ += x;
Here, the Python interpreter can throw those statements away and replace it with Arithmetic Progression summation.
Same with C++, etc

There are several advanced and very fast methods of analysis in use today. (cost of a few microseconds to milliseconds extra per optimization). There are more that are not currently in use because of the cost of time and memory at compile time.

Example:
int x = 2;
for(int i=0; i<50: i += 2){ x += i; }

Will certainly be transformed into this:
int x = 50; ...by any C++ compiler with even the poorest optimization.

* Compile Time Execution.
No need to say much here... but this is one of C++'s most renowned strengths. As a matter of principle, every performant C++ compiler will attempt to evaluate every code at compile time and replace with shorter expressions. This is thoroughly limited by data available and operation requirements;
This is why most C++ programmers use templates or constexpr statements.


* Vectorization.
Ever heard of processor features such as MMX, SSE, SSE2, SSE4, AVX etc?
Yea... these are features in CPU that makes vectorization possible.
SSE stands for Streaming SMID Extensions. ....SMID means Single Instruction Multiple Data.
AVX means Advanced Vector Instructions

The point is that, all CPU execution Units can only perform one operation at time(execution cycle).
Note that different operations may have different CPU cycles. Addition may take as few as 5 cycles. Division typically takes 20+ cycles... x86 and x64.

A CPU or GPU may have several execution units. (My GPU has 16).
Vectorization is the transformation of repeated operations.
Consider:
auto res = some_computed_array();
for (int x = 0; x < 20; x++)
res[x] = res[x - 1] + res[x];

Normally, there is no room for Out of Order Execution here.... but there is room for Loop Unrolling and vectorization....
instead of:
register1 = register2 + register3;

it can say something like
register_xmm0 = register_xmm3... + register_xmm6;

Meaning, a single instruction was streamed over multiple data. SSE to SSE4.1 offers some good set of instructions.
AVX is the future...
Currently, 4th generation Intel CPUs implements AVX2 ...a 256bit wide band;
Current technology has support for even better. AVX512 is expected to start rolling out this year.

Intel and nVidia promises more than 3x performance by 2017 for almost the same price of today

5 Likes 1 Share

Re: Optimizations: Some Sketchy Rants by Celebrimbor(m): 9:27pm On May 09, 2015
quite informative. please keep the lecture coming.

2 Likes

Re: Optimizations: Some Sketchy Rants by AAinEqGuinea: 2:50am On May 10, 2015
Nice op!

"16 gpu registers"? shocked stock gpu?

I've joined the movement.

I'm trekking right now for optimized MIMD implementations. I'm not stopping until single program paradigms obtains widespread market implementation. Every home with a cluster of ps3's... merely to calculate stuff.

1 Like

Re: Optimizations: Some Sketchy Rants by Javanian: 11:45am On May 10, 2015
Great post! Though i expected to see other optimiztion techniques like loop jamming and 3AC.

1 Like

Re: Optimizations: Some Sketchy Rants by WhiZTiM(m): 3:19pm On May 10, 2015
AAinEqGuinea:
Nice op!

"16 gpu registers"? shocked stock gpu?

I've joined the movement.

I'm trekking right now for optimized MIMD implementations. I'm not stopping until single program paradigms obtains widespread market implementation. Every home with a cluster of ps3's... merely to calculate stuff.

...lol. I never said "registers", I said, "Execution Units". I have no idea of how many registers my GPU contains...

MIMD is surely a good stuff... though there are theoretical constraints on why it isn't a popular hype;
MIMD is more applicable to CPUs, not GPUs.
GPUs are more of SIMD ...and they are crazy fast at it.

Trekking? ...haha? why?


@Javanian. THanks. But why does Nairaland's spam bot almost always blocks me, even before today? sad ?
I just tried editing my original post, but was banned by antispam bot. I want to modify some parts.

2 Likes

Re: Optimizations: Some Sketchy Rants by WhiZTiM(m): 3:30pm On May 10, 2015
@Javanian
I can't know all of them in the world, however, I do know most of the popular ones and I am constantly keeping myself updated with people's papers and journals. Some optimizations are not so popular, and some have different terminologies. Nonetheless here is a continuation of my rants... with respect to your expectations...

I presume by '3AC', you mean Three Address Code (TAC). ..if it's not so, then this post is irrelevant to '3AC'
If that is so, then yes, that could be requirement for some optimizations.... (I can think of constant folding)

Consider Register allocation....
Register allocation simply means the optimal combination of variables you can have in the CPU registers at a time.
(More on this later)

int a = 4;
int b = 2;
int c = 6 - a;
int d = a * 5 - b + 2;

A TAC translation can yield

a = 4;
b = 2;
c = 6 <subtract> a;
k1 = a <multiply> 5;
k2 = k1 <subtract> b;
d = k2 + 2;

Here, the compiler can lookup whether the operation <multiply> or <subtract> can be done at compile time, if yes, it looks up the operands... if the values are available, it solves them right there...
And replaces: a with 4, b with 2, c with 2, and d with 20;

Another interesting thing here is that constant propagation takes place...
That is, everywhere the compiler sees any of the variable within lexical scope, it replaces it with the real numbers in place.
Note: After such optimization above; For C++, The ISO C++ standard permits the compiler not to reserve memory for a, b, c, and d provided it is never used again or reassigned...

1 Like

Re: Optimizations: Some Sketchy Rants by WhiZTiM(m): 3:39pm On May 10, 2015
To me, loop Jamming is a form of Loop unrolling...
Well, Loop jamming has to do with Induction analysis as I mentioned earlier in loop unrolling.

* More rants on Loop Unrolling ....and Loop Jamming(Fusion)

That, certainly, almost takes place in every matured language.
Three cases I can think of now is:
1. Nested Loops
2. Union(able) Loops
3. Looping over adjacent memory

(^^Note the above are my classification, I am not an authority (yet) on this subject matter, so I classified the use-cases for understanding; but I try to be as accurate as possible. Mistakes are mine!)

1. Nested Loops:
Consider:

int z =1;
for(int x=0; x< 10; x++){
for(int y = 0; i < 10; y++)
z += x+y;
}

Apparently, the inner loop always adds 0 ... 9;
And is executed 9 times; hence it can be factored out as:
(0 ... 9) * 10;

Change it to AP summation:
45 * 10 = 450;


So we can do away with the inner loop.
Since, the inner loop executed 10 times we must multiply the preceding variables by 10 also.
hence:

int z =1;
for(int x=0; x< 10; x++){
z += (x*10) + 450;
}

//We just jammed the above^^


If we continue with our analysis, its easy to discern that we can compute the value of z at compile time... Should be 900 I guess.
From there, constant propagation takes place! smiley

2. Union(able) Loops:
Can't think of an example now, but it has to do with merging a nested loop to one;
Consider an outer Loop of M times, and an inner loop of N times.... hence inner loops does M * N times in total
Jamming can take place where the inner loop has a defined sequence of operation that can be reduced to M times;
kind-of like the first explanation. difference is that the first one has a dependency in the inner loop. A Union(able) loop has no dependency.

3. Looping over adjacent memory:
Consider:

int a=1, b=2;
for(int x=0; x< 10; x++){
a += x;
}

for(int x=0; x< 10; x++){
b += x;
}


if certain invariants are ok, the above will be surely translated to:

for(int x=0; x< 10; x++){
a += x;
b += x;
}


...obviously, these are very simple examples... in the real world, it is more complex than this..
That is why we have generalizations....

2 Likes

Re: Optimizations: Some Sketchy Rants by WhiZTiM(m): 3:51pm On May 10, 2015
* REGISTER ALLOCATION

Consider Register allocation....
Register allocation simply means the optimal combination of variables you can have in the CPU registers at a time.

Consider a CPU with 4 integer registers only. And a program like this:
int a = 4;
int b = 2;
int c = 6 - a;
int d = a * 5 - b + 2;

A TAC translation can yield:
NB: Take "TAC (Three Address Code)" to be reducing code to a representation of at most 3 operands and one operator

a = 4;
b = 2;
c = 6 <subtract> a;
k1 = a <multiply> 5;
k2 = k1 <subtract> b;
d = k2 + 2;

...now, with only 4 registers (r1, r2, r3, r4), the Compiler can assign the variables

r1 = a;
r2 = b;
r3 = 6 <subtract> r1;
r4 = r1 <multiply> 5;
write_back r1 to L1 cache {Hence, we can reuse r1)
r1 = r4 <subtract> r2;
write_back r3 to L1 cache {Hence, we can reuse r3)
r3 = r1 <add> 2;
flush_registers;

do some more computation with registers... till you finish.
Now can you think of the optimal way to assign the variables to registers?
Yes!... Its a Graph Colouring problem! ...(Ever heard of Graph Colouring?)

* An Addendum ^^ on Vectorization
Where Vectorization comes into play is for example:
consider the subtractions, if we could fit a union of its dependencies in registers...
we could apply a single subtraction once for all; ...(more like a hardware circuitry loop)

Did you know: This is the primary reasons why GPUs are faster than Super-scaler CPUs?
CPUs are by far more complex than GPUs due extra circuitry or feature for Memory Protection, Faulting, Address Translation, Paging, Interrupts, and so many more important stuff.

GPUs on the other hand doesn't do that... GPUs instead uses up those circuit space to add more simple ALU's or execution units.
Hence, for a complex flow of a sequential algorithm, CPU's are much faster.

For simple things that can be paralleled, GPUs are orders of magnitude faster.
Consider a fading effect where you may write want to reduce the 'alpha' value of each pixel...
For such operation, a GPU will do it much faster then a CPU... For example the nVidia GTX880 SLi will do it ~1000x faster than most CPU's of today's average computers

3 Likes

Re: Optimizations: Some Sketchy Rants by AAinEqGuinea: 3:53pm On May 10, 2015
WhiZTiM:


...lol. I never said "registers", I said, "Execution Units". I have no idea of how many registers my GPU contains...

MIMD is surely a good stuff... though there are theoretical constraints on why it isn't a popular hype;
MIMD is more applicable to CPUs, not GPUs.
GPUs are more of SIMD ...and they are crazy fast at it.

Trekking? ...haha? why?


@Javanian. THanks. But why does Nairaland's spam bot almost always blocks me, even before today? sad ?
I just tried editing my original post, but was banned by antispam bot. I want to modify some parts.

My reference to your gpu registers or e.u. wasn't in reference to my MIMD proposition smiley Just wanted to inquire about the your gpu and then reference MIMD, which I concur with gpu's exception throughput

But pleased do Go on, good job, and thanks for this contribution

1 Like

Re: Optimizations: Some Sketchy Rants by Javanian: 7:33pm On May 11, 2015
Great post again! Just to add, Three address code also makes it easier to generate more abstract code, not bothering too much about things like register allocations, aids machine independent optimization and spots compile time error very easily. Forgive me if i am sounding finicky, I took a course on Compiler Design in school last semester so i just can't help it.

1 Like

Re: Optimizations: Some Sketchy Rants by Javanian: 7:36pm On May 11, 2015
Sorry, for the Spam bot ban, I can't really say why it kept banning you (probably need some optimizations grin ) but i would try and monitor the thread to ensure it doesn't happen again.

1 Like

Re: Optimizations: Some Sketchy Rants by kudaisi(m): 1:02am On May 14, 2015
Good post bro, this is why I always view any thread with your name in It. But correct me if I am wrong, to avoid ambiguities shouldn't the title be compiler optimization rather than just optimization ?

Additionally, I'm looking forward to a post on other levels of optmization for the purpose we newbies who are the majority of nairaland programming section. May be you should start from higher level optimizations such as Design before easing us in to the lower ones so that a greater portion of audience can benefit and relate with it.

Attempting to do design level optimization in middle of a project in a lot of cases can infer re-coding the entire project or even go as far as changing the programming language of choice.... A newbie would want to know why and what design level optimization entails.

Then I have two questions. A lot of peeps will open this post and be like 'what is he talking about?' and 'how does he even know this kind of thing'.

In view of the a move mentioned, how many years of coding did it take you before you started worrying about compiler level optimization ?

Secondly, please share a scenario where I will find myself having to worry about compiler optimization (apart from the array example, I learnt that when I started understanding the importance of algorithm and data structure level optimizations). From my experience, the only time I've had to consider this sort of optimization is when I'm trying to figure out which compiler to use for a certain based on how optimized they are for the task herein.

4 Likes 1 Share

Re: Optimizations: Some Sketchy Rants by abdul01(m): 5:20am On May 14, 2015
Kudaisi my guy, see as u just took out all (abi most of) d things that I've been bothering on (as regards this thread).

I wish to meet u someday sha, cos I like ur sense of humility

1 Like

Re: Optimizations: Some Sketchy Rants by WhiZTiM(m): 11:02pm On May 14, 2015
Javanian:
Great post again! Just to add, Three address code also makes it easier to generate more abstract code, not bothering too much about things like register allocations, aids machine independent optimization and spots compile time error very easily. Forgive me if i am sounding finicky, I took a course on Compiler Design in school last semester so i just can't help it.
...haha... its cool oh.... Yep, True! Machine independent optimizations. Thanks for that addition
Javanian:
Sorry, for the Spam bot ban, I can't really say why it kept banning you (probably need some optimizations grin ) but i would try and monitor the thread to ensure it doesn't happen again.
.
..lolz. Thanks. Well appreciated.

1 Like

Re: Optimizations: Some Sketchy Rants by harryobas: 11:36pm On May 14, 2015
@WhiZTiM

how do you actually measure performance? In my opinion, and especially in large systems the software architecture plays a central role in achieving high overall performance since performance bottlenecks are not caused by the actual computation related to domain functionality but due to context switches, synchronization points, dynamic memory management, communication overheads etc.

1 Like

Re: Optimizations: Some Sketchy Rants by WhiZTiM(m): 12:01am On May 15, 2015
kudaisi:
Good post bro, this is why I always view any thread with your name in It.
Wow. Thanks for that! :-)

But correct me if I am wrong, to avoid ambiguities shouldn't the title be compiler optimization rather than just optimization ?
You are absolutely right. Thanks for that. Noted!

Additionally, I'm looking forward to a post on other levels of optmization for the purpose we newbies who are the majority of nairaland programming section. May be you should start from higher level optimizations such as Design before easing us in to the lower ones so that a greater portion of audience can benefit and relate with it.
Firstly, you will have to forgive me on haziness of this post, it wasn't really organized. I was writing(ranting) what was on my head at that time.

Yes, there are 'higher' levels of optimization. But I think the higher levels of optimization simply entails being very transparent to the compiler/interpreter. This is because, modern day compilers and interpreters have so much rules and data baked into them that they can quickly recognize hotspots for optimization. I will talk about that later

Attempting to do design level optimization in middle of a project in a lot of cases can infer re-coding the entire project or even go as far as changing the programming language of choice.... A newbie would want to know why and what design level optimization entails.
First of all, "Premature optimization is the root of all evil". You should always benchmark your program before 'Optimizing'. When running performance tests, you will often be surprised of the places you'll find hotspots.

In practice, you wouldn't start churning out 'optimized' code on first release. However, for your software project, you must choose Software Design pattern(s) that favors localization of your "moving parts". By moving parts, I mean your program's logic, algorithm, models, etc. Localization in this context means the ability to change the implementation of a part of your program without reworking your API.

Secondly, regarding choice of Programming Language, for a company, it typically depends on the decision of the Software Architect and/or the CTO. For indie development. The Choice of programming Language is mostly subjective or contract-dictated. There is more to this... maybe later


Then I have two questions. A lot of peeps will open this post and be like 'what is he talking about?' and 'how does he even know this kind of thing'.
what is he talking about?: He is talking about how being friends with villagers will make your life in that village better. We are talking about knowing the details of some of their traditions and rites.... This is analogous to Compiler optimizations... :-)

How does he even know this kind of thing?: He is a High Performance Computing enthusiast. He has been experimenting with several language facilities for over half a decade. He is an ACM member. He occasionally reads the ISO C++ Standard and is an avid follower of the ISO C++ Committee. He loves squeezing last drop performance from the metal.

In view of the a move mentioned, how many years of coding did it take you before you started worrying about compiler level optimization ?
Half a decade. :-)


Secondly, please share a scenario where I will find myself having to worry about compiler optimization (apart from the array example, I learnt that when I started understanding the importance of algorithm and data structure level optimizations). From my experience, the only time I've had to consider this sort of optimization is when I'm trying to figure out which compiler to use for a certain based on how optimized they are for the task herein.

Imagine I gave you something to model on Computer for example, Fluid flow, Radiation, Bacteria Culture, etc. Assuming, there are no known algorithms to speed up computation time, you are to implement it and run millions of data through it. If you implement it correctly but non-challantly, it may cost 300USD (for example) per day to run it on some Processing farm.... If it takes 2 weeks to run to completion, imagine the savings you can make if you can reduce the CPU usage and increase the Programs throughput to extents cost now falls down to 200USD per day.

I am feeling sleepy now, please I will comeback later with a better example.

3 Likes

Re: Optimizations: Some Sketchy Rants by WhiZTiM(m): 12:18am On May 15, 2015
harryobas:
@WhiZTiM

how do you actually measure performance? In my opinion, and especially in large systems the software architecture plays a central role in achieving high overall performance since performance bottlenecks are not caused by the actual computation related to domain functionality but due to context switches, synchronization points, dynamic memory management, communication overheads etc.

Well, I have never done performance review for any large software. I will inquire from experts and senior folks who may have done such and get back to us on that. ...hopefully next week.

However, before I do so, I can speculate:
You are right, measuring performance of large scale software is more of an architectural thing. The first thing to consider is your baselines.

Baselines are worst performance situations that are deemed acceptable by you.
So, you have to breakdown your application in hierarchy and measure the throughput of each system.
I am feeling sleepy now, and have a busy day ahead... will edit the post later in the weekend please.

1 Like

Re: Optimizations: Some Sketchy Rants by kudaisi(m): 12:58am On May 15, 2015
WhiZTiM:

Well said, all doubts cleared. More grease to your elbow.

1 Like

(1) (Reply)

A Web Developer/programmer And Web Designer Needed In Enugu Urgently / Register For 21 Free Online Courses And Certifications! / What Is The Value Of Software Engineers In Nigeria?

(Go Up)

Sections: politics (1) business autos (1) jobs (1) career education (1) romance computers phones travel sports fashion health
religion celebs tv-movies music-radio literature webmasters programming techmarket

Links: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

Nairaland - Copyright © 2005 - 2024 Oluwaseun Osewa. All rights reserved. See How To Advertise. 82
Disclaimer: Every Nairaland member is solely responsible for anything that he/she posts or uploads on Nairaland.