I'm testing two nearly identical codes with a tiny difference on one of the for loops. The first uses three loops to iterate indexes y
,z
,x
while the second iterated x
,z
,y
.
My question is why the difference in user time and wall clock time? Is it because of the memory locations in one code and the other?
test_1.c:
#define N 1000
// Matrix definition
long long int A[N][N],B[N][N],R[N][N];
int main()
{
int x,y,z;
char str[100];
/*Matrix initialization*/
for(y=0;y<N;y++)
for(x=0;x<N;x++)
{
A[y][x]=x;
B[y][x]=y;
R[y][x]=0;
}
/*Matrix multiplication*/
for(y=0;y<N;y++)
for(z=0;z<N;z++)
for(x=0;x<N;x++)
{
R[y][x]+= A[y][z] * B[z][x];
}
exit(0);
}
The difference in the second code (test_2.c) its on the last for loop:
for(x=0;x<N;x++)
for(z=0;z<N;z++)
for(y=0;y<N;y++)
{
R[y][x]+= A[y][z] * B[z][x];
}
If I print /user/bin/time -v ./test_1 I get the following stats:
Command being timed: "./test_1"
User time (seconds): 5.19
System time (seconds): 0.01
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.22
While /user/bin/time -v ./test_2 gives the following stats:
Command being timed: "./test_2"
User time (seconds): 7.75
System time (seconds): 0.00
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.76
Basically, you're accessing the memory in a different pattern - your first approach is much more friendly to the memory cache, because you're accessing a lot of data in the same area, then moving on to the next piece of memory, etc.
If you want a real world analogy, imagine you're delivering leaflets to 10 different roads (A-J), each of which has house numbers 1-10. You could deliver A1, A2, A3...A10, B1, B2, B3...B10 etc... or you could deliver A1, B1, C1...J1, A2, B2, C2... etc. Clearly, the first way is going to be more efficient. It's just like that in computer memory - it's more efficient to access memory "near" memory you've recently accessed than it is to hop around.
See more on this question at Stackoverflow