WhiZTiM's Posts
Nairaland Forum › WhiZTiM's Profile › WhiZTiM's Posts
1 2 3 4 5 6 7 8 9 10 11 12 (of 18 pages)
Honestly, if Rivers state was manipulated, what of Kano? Even if so, but why haven't we seen Rivers people protesting? NB: I am asking without a clear bias. freedom96: omenka: voxpopp:
|
Fayose! ... I kinda like his doggedness ... I remember seeing this funny image somewhere... ...whatever your views are... its kinda hilarious ...
|
Haha! .... Interesting! |
End time things.... ...#ETT ... Recall that period here on Nairaland?... Well, we must keep eternity in view. What we have is now. We don't own tomorrow (no matter our plans) Think of succession! If you are married, does your spouse knows your estate details? Including ATM pins, Next of kin... ...etc. If you are unmarried, who should benefit from your years of hard work when you die?? We don't think about these, but we should at least once a year.... ... Because your beneficiaries will change over time |
Chai! |
Thanks seun, thanks markessien for this. .. Can you tell us a bit about your time? I mean... 1. how was your typical day like before hotels.ng? 2. How was your typical day like during the early days of hotels.ng? 3. How is your typical day like now? Regards, Timothy |
Haha! |
O boy!!! |
Haha! ...funny |
nairation:Ummm... Mr. Nairaland police... Please a sense of humour... okay? .... #smh |
kudaisi:The 2x2 grid question.....
runs in 0.0000041seconds.... couldn't help it :-) |
olyjosh:...lol. Yes sir.... Cool!... I respect FUTMinna people. at least some of their Computer Sciences geeks. I just saved your number, I will message you on WhatsApp. ------------------------ nnasino:Mehn... Nice solution! |
Kai! |
harryobas:Well, I have never done performance review for any large software. I will inquire from experts and senior folks who may have done such and get back to us on that. ...hopefully next week. However, before I do so, I can speculate: You are right, measuring performance of large scale software is more of an architectural thing. The first thing to consider is your baselines. Baselines are worst performance situations that are deemed acceptable by you. So, you have to breakdown your application in hierarchy and measure the throughput of each system. I am feeling sleepy now, and have a busy day ahead... will edit the post later in the weekend please. |
kudaisi:Wow. Thanks for that! :-) But correct me if I am wrong, to avoid ambiguities shouldn't the title be compiler optimization rather than just optimization ?You are absolutely right. Thanks for that. Noted! Additionally, I'm looking forward to a post on other levels of optmization for the purpose we newbies who are the majority of nairaland programming section. May be you should start from higher level optimizations such as Design before easing us in to the lower ones so that a greater portion of audience can benefit and relate with it.Firstly, you will have to forgive me on haziness of this post, it wasn't really organized. I was writing(ranting) what was on my head at that time. Yes, there are 'higher' levels of optimization. But I think the higher levels of optimization simply entails being very transparent to the compiler/interpreter. This is because, modern day compilers and interpreters have so much rules and data baked into them that they can quickly recognize hotspots for optimization. I will talk about that later Attempting to do design level optimization in middle of a project in a lot of cases can infer re-coding the entire project or even go as far as changing the programming language of choice.... A newbie would want to know why and what design level optimization entails.First of all, "Premature optimization is the root of all evil". You should always benchmark your program before 'Optimizing'. When running performance tests, you will often be surprised of the places you'll find hotspots. In practice, you wouldn't start churning out 'optimized' code on first release. However, for your software project, you must choose Software Design pattern(s) that favors localization of your "moving parts". By moving parts, I mean your program's logic, algorithm, models, etc. Localization in this context means the ability to change the implementation of a part of your program without reworking your API. Secondly, regarding choice of Programming Language, for a company, it typically depends on the decision of the Software Architect and/or the CTO. For indie development. The Choice of programming Language is mostly subjective or contract-dictated. There is more to this... maybe later Then I have two questions. A lot of peeps will open this post and be like 'what is he talking about?' and 'how does he even know this kind of thing'.what is he talking about?: He is talking about how being friends with villagers will make your life in that village better. We are talking about knowing the details of some of their traditions and rites.... This is analogous to Compiler optimizations... :-) How does he even know this kind of thing?: He is a High Performance Computing enthusiast. He has been experimenting with several language facilities for over half a decade. He is an ACM member. He occasionally reads the ISO C++ Standard and is an avid follower of the ISO C++ Committee. He loves squeezing last drop performance from the metal. In view of the a move mentioned, how many years of coding did it take you before you started worrying about compiler level optimization ?Half a decade. :-) Secondly, please share a scenario where I will find myself having to worry about compiler optimization (apart from the array example, I learnt that when I started understanding the importance of algorithm and data structure level optimizations). From my experience, the only time I've had to consider this sort of optimization is when I'm trying to figure out which compiler to use for a certain based on how optimized they are for the task herein.Imagine I gave you something to model on Computer for example, Fluid flow, Radiation, Bacteria Culture, etc. Assuming, there are no known algorithms to speed up computation time, you are to implement it and run millions of data through it. If you implement it correctly but non-challantly, it may cost 300USD (for example) per day to run it on some Processing farm.... If it takes 2 weeks to run to completion, imagine the savings you can make if you can reduce the CPU usage and increase the Programs throughput to extents cost now falls down to 200USD per day. I am feeling sleepy now, please I will comeback later with a better example. |
Javanian:...haha... its cool oh.... Yep, True! Machine independent optimizations. Thanks for that addition Javanian:. ..lolz. Thanks. Well appreciated. |
Somebody will soon come and blame Fayose... |
Good news.... Though, I just wish we stop blaming Abuja and Aso Villa for every problem in the country... Hit 'like' for this cat family...
|
Take it or leave it... Atiku is sound 'Nigerian' Businessman. Keep insulting him, He is making it.. He is a large employer of labour. Despite their flaws, Appreciate the people ahead of you... So that you may reach there someday! ..... topsyking:Yes! And it will continue to be like that! As far as the poor man keeps believing that it is his right to have money; and that the rich always cut corners; and keeps condemning the rich.... |
* REGISTER ALLOCATION Consider Register allocation.... Register allocation simply means the optimal combination of variables you can have in the CPU registers at a time. Consider a CPU with 4 integer registers only. And a program like this: int a = 4; int b = 2; int c = 6 - a; int d = a * 5 - b + 2; A TAC translation can yield: NB: Take "TAC (Three Address Code)" to be reducing code to a representation of at most 3 operands and one operator a = 4; b = 2; c = 6 <subtract> a; k1 = a <multiply> 5; k2 = k1 <subtract> b; d = k2 + 2; ...now, with only 4 registers (r1, r2, r3, r4), the Compiler can assign the variables r1 = a; r2 = b; r3 = 6 <subtract> r1; r4 = r1 <multiply> 5; write_back r1 to L1 cache {Hence, we can reuse r1) r1 = r4 <subtract> r2; write_back r3 to L1 cache {Hence, we can reuse r3) r3 = r1 <add> 2; flush_registers; do some more computation with registers... till you finish. Now can you think of the optimal way to assign the variables to registers? Yes!... Its a Graph Colouring problem! ...(Ever heard of Graph Colouring?) * An Addendum ^^ on Vectorization Where Vectorization comes into play is for example: consider the subtractions, if we could fit a union of its dependencies in registers... we could apply a single subtraction once for all; ...(more like a hardware circuitry loop) Did you know: This is the primary reasons why GPUs are faster than Super-scaler CPUs? CPUs are by far more complex than GPUs due extra circuitry or feature for Memory Protection, Faulting, Address Translation, Paging, Interrupts, and so many more important stuff. GPUs on the other hand doesn't do that... GPUs instead uses up those circuit space to add more simple ALU's or execution units. Hence, for a complex flow of a sequential algorithm, CPU's are much faster. For simple things that can be paralleled, GPUs are orders of magnitude faster. Consider a fading effect where you may write want to reduce the 'alpha' value of each pixel... For such operation, a GPU will do it much faster then a CPU... For example the nVidia GTX880 SLi will do it ~1000x faster than most CPU's of today's average computers |
To me, loop Jamming is a form of Loop unrolling... Well, Loop jamming has to do with Induction analysis as I mentioned earlier in loop unrolling. * More rants on Loop Unrolling ....and Loop Jamming(Fusion) That, certainly, almost takes place in every matured language. Three cases I can think of now is: 1. Nested Loops 2. Union(able) Loops 3. Looping over adjacent memory (^^Note the above are my classification, I am not an authority (yet) on this subject matter, so I classified the use-cases for understanding; but I try to be as accurate as possible. Mistakes are mine!) 1. Nested Loops: Consider:
Apparently, the inner loop always adds 0 ... 9; And is executed 9 times; hence it can be factored out as: (0 ... 9) * 10; Change it to AP summation: 45 * 10 = 450; So we can do away with the inner loop. Since, the inner loop executed 10 times we must multiply the preceding variables by 10 also. hence:
If we continue with our analysis, its easy to discern that we can compute the value of z at compile time... Should be 900 I guess. From there, constant propagation takes place! ![]() 2. Union(able) Loops: Can't think of an example now, but it has to do with merging a nested loop to one; Consider an outer Loop of M times, and an inner loop of N times.... hence inner loops does M * N times in total Jamming can take place where the inner loop has a defined sequence of operation that can be reduced to M times; kind-of like the first explanation. difference is that the first one has a dependency in the inner loop. A Union(able) loop has no dependency. 3. Looping over adjacent memory: Consider:
if certain invariants are ok, the above will be surely translated to:
...obviously, these are very simple examples... in the real world, it is more complex than this.. That is why we have generalizations.... |
@Javanian I can't know all of them in the world, however, I do know most of the popular ones and I am constantly keeping myself updated with people's papers and journals. Some optimizations are not so popular, and some have different terminologies. Nonetheless here is a continuation of my rants... with respect to your expectations... I presume by '3AC', you mean Three Address Code (TAC). ..if it's not so, then this post is irrelevant to '3AC' If that is so, then yes, that could be requirement for some optimizations.... (I can think of constant folding) Consider Register allocation.... Register allocation simply means the optimal combination of variables you can have in the CPU registers at a time. int a = 4; int b = 2; int c = 6 - a; int d = a * 5 - b + 2; A TAC translation can yield a = 4; b = 2; c = 6 <subtract> a; k1 = a <multiply> 5; k2 = k1 <subtract> b; d = k2 + 2; Here, the compiler can lookup whether the operation <multiply> or <subtract> can be done at compile time, if yes, it looks up the operands... if the values are available, it solves them right there... And replaces: a with 4, b with 2, c with 2, and d with 20; Another interesting thing here is that constant propagation takes place... That is, everywhere the compiler sees any of the variable within lexical scope, it replaces it with the real numbers in place. Note: After such optimization above; For C++, The ISO C++ standard permits the compiler not to reserve memory for a, b, c, and d provided it is never used again or reassigned... |
AAinEqGuinea:...lol. I never said "registers", I said, "Execution Units". I have no idea of how many registers my GPU contains... MIMD is surely a good stuff... though there are theoretical constraints on why it isn't a popular hype; MIMD is more applicable to CPUs, not GPUs. GPUs are more of SIMD ...and they are crazy fast at it. Trekking? ...haha? why? @Javanian. THanks. But why does Nairaland's spam bot almost always blocks me, even before today? ![]() ?I just tried editing my original post, but was banned by antispam bot. I want to modify some parts. |
This post is a bit advanced but... I've tried to make it as simple as possible. I will break each heading down in completely different post sometime when I am free. If Compilers were transforming codes we write to machine code in verbatim (or simple rules), Software development would have been dramatically hard. I mean serious pain in the butt! There are several optimizations employed by the Compiler... but we are only going to look at a few that can dramatically affect performance. In my own world, the factors that determine performance are (in order of importance) Algorithm, Data Structure, Data Locality and Optimization But, I assume the first three are known, so we will look at what typically happens between our Software and Hardware that makes our code faster. Some happen only at Compiler level or CPU level, some are interleaved There are many of them. 1. Constant folding 2. Constant propagation 3. Register allocation 4. Vectorization 5. Branch prediction 6. Caching 7. Out of Order Execution 8. Loop Unrolling 9. Compile time evaluation 9. ...etc... If you value performance, you will know to depths, your Compiler(or Interpreter), Operating System, CPU and operating environments. * Cache Hierarchy. The presence of several levels of cache are apparent and important factor for several speed improvements from 0.5x to 1000x plus.... (it can be more).... (0.5x, reduced? Yes! Slower! typically presents itself as false-sharing in concurrency) Between the RAM and the CPU execution units are typically several levels of Cache. At CPU level, we have L1, L2, L3 cache. The CPU register is the closest, fastest and the only place bytes are operated upon. The next closest is the L1 cache, it's typically small. My CPU: L1=128KB per core. L2=512KB shared, L3=3MB shared. The CPU doesn't fetch single data from RAM, it typically fetches Data in block to fill up what is known as Cache lines. That is why you know that arrays are faster than lists when it comes to iteration and modification. * Speculative Execution.... Branch Predictors technically, this isn't the right heading, but its a language lawyer thing. For conditional constructs in code, the compiler translates that to a Branch. using jump statements and Jump tables. For every Conditional in a code, we are in for Branch Prediction Branch prediction is the CPU's way of speculating whether the condition will be satisfied to execute the body or skip it There are several ways Branch Predictors could be implemented at Circuit level. But its also easy to implement a model of it in Software. * Single threaded programs have some minute level of parallelism... Out of Order Execution The ISO C++ standard permits every C++ compiler to alter the 'form' of any code provided the resultant program produces the exact empirical output expected by the user. CPU Operations can be piplened. This is when the CPU can do more than 1 thing in one instruction cycle. For example, the Intel CPU can perform Addition and Subtraction at the same time... and you have this code.. int a = 1; int b = 2; int c = 3; int d = 4; and you wrote this: int x = d - a; int y = c - b; Instead of the CPU to solve for x in the first execution cycle, then y in the second execution cycle, it can simple solve them both in one cycle. That alone is 2x improvement! There is an interesting caveat here... The CPU also provides some opportunities to reorder code. This feature is known as Out of Order Execution OOE can happen anytime several execution units within the CPU aren't synchronized(doing the same thing at the same time), and opportunities still exists for pipelining. * Loop Unrolling. THis has to do with Induction analysis... summ = 0; for x in range(1, 500): summ += x; Here, the Python interpreter can throw those statements away and replace it with Arithmetic Progression summation. Same with C++, etc There are several advanced and very fast methods of analysis in use today. (cost of a few microseconds to milliseconds extra per optimization). There are more that are not currently in use because of the cost of time and memory at compile time. Example: int x = 2; for(int i=0; i<50: i += 2){ x += i; } Will certainly be transformed into this: int x = 50; ...by any C++ compiler with even the poorest optimization. * Compile Time Execution. No need to say much here... but this is one of C++'s most renowned strengths. As a matter of principle, every performant C++ compiler will attempt to evaluate every code at compile time and replace with shorter expressions. This is thoroughly limited by data available and operation requirements; This is why most C++ programmers use templates or constexpr statements. * Vectorization. Ever heard of processor features such as Yea... these are features in CPU that makes vectorization possible. SSE stands for Streaming SMID Extensions. ....SMID means Single Instruction Multiple Data. AVX means Advanced Vector Instructions The point is that, all CPU execution Units can only perform one operation at time(execution cycle). Note that different operations may have different CPU cycles. Addition may take as few as 5 cycles. Division typically takes 20+ cycles... x86 and x64. A CPU or GPU may have several execution units. (My GPU has 16). Vectorization is the transformation of repeated operations. Consider: auto res = some_computed_array(); for (int x = 0; x < 20; x++) res[x] = res[x - 1] + res[x]; Normally, there is no room for Out of Order Execution here.... but there is room for Loop Unrolling and vectorization.... instead of: register1 = register2 + register3; it can say something like register_xmm0 = register_xmm3... + register_xmm6; Meaning, a single instruction was streamed over multiple data. SSE to SSE4.1 offers some good set of instructions. AVX is the future... Currently, 4th generation Intel CPUs implements AVX2 ...a 256bit wide band; Current technology has support for even better. AVX512 is expected to start rolling out this year. Intel and nVidia promises more than 3x performance by 2017 for almost the same price of today |
Will be retiring now..... ....My engagements have drastically increased Thanks Kudasi for these simple questions... ...I learned some things... ...I am a two time participant in one of the Regionals of the Most prestigious Programming contest in the World, ACM-ICPC. Participated in the ACM-ICPC South African Regionals back in 2012 and 2013. Needless to say. - In 2012, my team (ATBU) came 2nd in Nigeria but 40th in Africa. Covenant University's C++ team came 1st in Nigeria but 38th in Africa. http://icpc.baylor.edu/regionals/finder/south-africa-2012/standings - In 2013, my team (ATBU) came 1st in Nigeria but 5th in Africa. Covenant University's C++ team came 2nd in Nigeria but 18th in Africa. We would have landed top 2 if I and my partner didn't screw up some things (We correctly attempted 6 questions, but successfully submitted 3)... Since then, I haven't done much of Competitive programming. http://icpc.baylor.edu/regionals/finder/south-africa-2013/standings We were honored and rewarded ...I will find time to write on that someday here on Nairaland... olyjosh. Nice work. Keep it up bro! lordZOUGA. ...err... I see you. ... |
kudaisi:Ans: Starting number: 837799 with length: 524 Runtime 0.01sec; (with memorization) 30x speedup... ~10,000 more memory used But interestingly, I just began tinkering on this problem, and I thought, "Let me do something stupid once again, bruteforce"... so I implemented it in several Languages... C++ Compilers excel pretty well in optimization. Except for C++, All the other languages ran for unacceptable timings. Never do this out there!
Ans: Starting number: 837799 with length: 524 Runtime 0.3sec; |
kudaisi:Ans: 76576500 is the first triangle number with over 500 divisors, 576 divisors to be exact Suboptimal bruteforce. . .. Could significantly improve it by sieving and/or making informed speculations
|
andreT:Now that is a perfect cubic runtime.... Don't do that!. .....at least in Python.... :-) ....(this is gonna run for minutes) PS: Don't take me too serious.... Ehmm... there should be known methods or algorithm that generates pythagorean triples... Google that.... ...implement it and woolala... it will run under a second. |
kudaisi:Here, a non-optimal solution... but it works http://ideone.com/qxpoTH http://ideone.com/upxEf2 Answer: 200, 375, 425; Product: 31875000 Runtime: 0.1 second Memory: 3.0MB Side Note: I initially implemented it in Python, but it ran for 2 minutes... :-) So, I was pretty sure that other than machine code, the C++ compiler will emit vectorized instructions.... and some other optimizations, loop unrolling... blah nlah blah.... |

...#ETT ... Recall that period here on Nairaland?


...