Everyone can code. They can write “some” code and get some task done. Some people calculate time complexities and try to reduce the run time as well (which is good). But that is not enough..!! Because you have to think of the platform which your code runs. Your program may have a linear run time complexity (theoretically), but still might not perform well. If your intention is just to make the work done, regardless of the performance, please stop reading this article right now.!
However, you still may need to consider about performance at some point. So just keep on reading.
If you’re not familiar with Multi-Threaded Programming, I strongly suggest you to read on that, since it may help you to boost up your program in a higher rate. For example, if you have a core i7 processor and you still write single threaded programs, you are wasting your resources. If you efficiently write the code, you may be able to get more than 400% efficiency.
Even if you write serial programs (not serial communication programs), following 6 tips may help you to get a boot up.
- Select of best (or a better) algorithm
- For large data sets, using a low complexity algorithm (with lesser operations count) will be highly beneficial.
- Use efficient libraries
- This will save you lots of time and effort.
- BLAS (Basic Linear Algebra Subroutines) (www. netlib.org).
- Good quality public domain library: LAPACK
- But, don’t use libraries for the tiniest tasks.
- don’t use
pow()
method to get x^2, just use x*x.
- don’t use
- This will save you lots of time and effort.
- Optimal data locality
- Learn how the memory hierarchy actually works so that you can use them efficiently. (We will talk about these later in this article)
- Use compiler optimizations
- e.g.
cc -o3 demo.c
- Using a medium optimization level (-o2 on many machines) typically leads to a speedup of factors of 2 to 3 without significant increase in compilation time. However trying -o4 or more may do unwanted optimizations and may lead your program to crash.
- e.g.
- Use a better computing environment
- Use a more powerful system with good systems software, to match your application.
Compiler Optimizations
Compiler already does some level of optimization to improve the performance of your program, however it is always better to have some kind of idea. And follow these best practices whenever you can, so that you can reduce the compile time, increase readability, make sure your code is already optimized etc.
Compiler optimizes the code units called “basic blocks”, where the code doesn’t have any branching.
Common sub-expression elimination
a = x + y + p b = x + y + q
This will be automatically replaced by
t = x + y a = t + p b = t + q
However, it is always a good practice to follow this optimization when you’re writing a code.
Strength Reduction
Replace high complexity arithmetic operator, with equivalent, but faster operation(s).
For instance, 2*x
will be replaced by i + i
, and x^2
will be replaced by x*x
.
Loop Invariant Code Motion
for (int i=0, i < count ; i++){ data[i] = y * z * data[i]; }
This is replaced by :
t = y * z for (int i=0, i < count ; i++){ data[i] = t * data[i]; }
Since, the value of x if invariant throughout the loop.
Constant Value Propagation
Whenever the compiler sees a variable is having a constant value, it treats it as a literal and replaces all it’s occurrences by it’s value. Moreover, when an expression have a several constant values, they will be replaced by the calculated value.
x = 2; y = 4 * x * z
Will be replaced by
y = 8 * z
Dependency Elimination
This dependency occurs when there is a dependency between two (or more) statements which involves the same variables. These are (tried to be) eliminated by the compiler, however better to do it manually if possible.
There are 3 types of dependencies.
True Dependency
t1 = t2 - t3 t5 = t4 - t1
See the following code sample.
x = 1 * 2 * y; s = s+x; z = 2 * 3 * r; t = t+z;
These statements are truly dependent and cannot be run simultaneously, since the 2nd statement cannot be executed until the 1st one finishes.
This can be re-arranged as:
x = 1 * 2 * y; z = 2 * 3 * r; s = s+x; t = t+z;
Anti-Dependency
t1 = t2 - t3 t2 = t4 - t6
In here, the value of t2 is re-written, but t1 is not used again (not truly dependent). Hence, we can store the result of the 2nd statement in t5 (a separate variable).
Output Dependency
t3 = t4 + t5 ... t3 = t6 - t7
In this case, compiler assigns either of the results in t2 (independent variable).
Data Locality
When you’re writing a program, it is always better to have an idea about how the memory (in the memory hierarchy) is manipulated. The main reason is, even though the processors have a high speed, it is useless if it has to wait for data causing a bottleneck at the data transfer, not in processing.
As you can see, accessing the lower (in the pyramid) location becomes drastically slow. We MUST know this when writing programs. We may not access files frequently, just do them all (or more) at once, and keep the mostly needed data in RAM if necessary.
Ideally, when a piece of data is accessed, it is copied to RAM, then to Cache. Then the copy at the cache is accessed (cache hit) for further references. If the requested data is not in cache (cache miss) at a later time (replaced by another piece of data), it has to be accessed back from the RAM, which take a lot of time. Hence, when we load data from RAM, we may take the maximum usage out of the loaded ones before going for other data.
Note :
Cache line: The smallest unit of data that can be transferred to or from memory . It holds contents of a contiguous block of main memory. Cache lines are usually between 32 and 128 bytes.
Cache Associativity
This specifies how the cache slots are mapped with the main memory locations.
In Direct-Mapped Caching, a memory location is fixed for a cache slot, since a same cache line is mapped to a lot of memory locations, a cache line has a higher possibility of getting overwritten.
In Set-Associative Caching, the cache space is divided into sets each having N slots, where a single memory location is mapped to a set, instead of a cache slot. A memory can be loaded to a cache slot within the allocated cache set.
In fill-Associative Caching, a memory location can be loaded into any cache slot freely.
The key idea of data locality is to minimize the cache misses, by trying to keep a piece of data inside the cache as long as possible.
- Spacial Locality
- Access (in the program) the nearly defined variables consecutively.
- Temporal Locality
- Access the same variable consecutively as much as possible.
There are a few ways to improve the data locality.
Fusion
When the same data is accessed in different parts of the code, bring them closer if possible. This is very effective when the data is comparatively larger, where it might cause cache replacements if accessed in different parts of the code.
Pre-fetching
Load the data before-hand. When a piece of data is needed in the latter part of the program, it is better to load it before hand along with other data (i.e. File reading operations).
Loop Interchanging
Reorder loops to align the access pattern in loop with pattern of data storage in memory where this is mostly relevant when accessing array elements.
for (i=0;i<4;i++){ for (j=0;j<4;j++){ a[j][i] = ... } }
If we assume that a is a int[][] array, in the above code, in each iteration forces to read a data point beyond 4*sizeof(int) = 16 bytes away (assuming C/C++ is Row Major). Hence, there is a higher chance of getting the cache lines replaced, which leads to lots of cache misses.
If we reorganize the same loop as follows (note that i and j are interchanged now) :
for (i=0;i<4;i++){ for (j=0;j<4;j++){ a[i][j] = ... } }
In here, each iteration directly access the next int value, which is beneficial for spacial locality.
Blocking
If the amount of data being handled at a time exceeds the cache line sizes, it’s better to create more blocks. This has to be done very carefully, without affecting the functioning of the code.
Standard matrix multiplication:
for (i=0; i<N; i++) for (j=0; j<N; j++) { c[i][j] = 0.0; for (k=0; k<N; k++) c[i][j] = c[i][j] + a[i][k]*b[k][j]; /* Makes C row by row */ } }
After blocking:
for (p=0; p<NB; p++){ for (q=0; q<NB; q++) { for (r=0; r<NB; r++) { for (i=p*NEIB;i<p*NEIB+NEIB;i++) for (j=q*NEIB;j<q*NEIB+NEIB;j++) for (k=r*NEIB;k<r*NEIB+NEIB;k++) c[i][j]=c[i][j] + a[i][k]*b[k][j]; /*makes C, block-row by block-row */ } } } } } }
The key idea used here is, to get the element of the output matrix, not all the data is required. For instance, if we consider the multiplication of two 3×3 matrices, A and B, to calculate C[0,0], only the first row of A and first column of B is used.
These are only a very few ways of optimizing the code to obtain higher performance.
However, please note that following things are done in your code, which have nothing to do with your performance.
- Use meaningful variable names
- Because the compiler is going to discard whatever the name you’re using and use it’s own one.
- Use comments to describe what you have done
- Of course these are NOT even considered when parsing the source code.
- Indent your code
- Of course, white spaces (spaces and tabs) are thrown away when parsing.
And these facts has to be considered since they affect the performance.
- Use debug points instead of sout/cout/printf
- IO operations are going drag your performance down into hell.
- Access files only when needed.
- Do not write to files inside loops. Do it at a single shot.
- Clean up your garbage
- If you create a heap object, clean it! (for C/C++ people)
- Use threads when needed
- No matter how good your design is, if you don’t utilize the CPU, your program will become slow once in a while. So, you threads to parallelize the work. But don’t mess with the UI thread (Don’t do heavy tasks there).
Hope these facts helps you in some way.
Cheers…!!