Project Stage 3 - Optimization

Finally the time has come for me to implement my plan for optimization and see if my theories hold true. As detailed in my plan I first looked at compiler flags to see if there could be any significant gains while still maintaining full functionality. My first step was to read through the Makefile to understand where I need to make my changes and test the timings. This proved to be somewhat of a challenge as the Makefile for ssdeep is very extensive with a variety of conditional checks and error handling. Having read through the first chunk of code I quickly realized that I was not equipped to fully understand what the Makefile was doing. I decided to then less the Makefile and search for "flags" which led me to a section of the Makefile that looked promising:

CFLAGS = -g -O2
CPP = gcc -E
CPPFLAGS = -I/usr/local/include
CXX = g++
CXXCPP = g++ -E
CXXDEPMODE = depmode=gcc3
CXXFLAGS = -g -O2

My first test is simply to kick the optimization up to -O3 to see if I gain any time and if it will break anything in the overall program. So I remake the program under -O3 optimization and tested it under the same conditions as my benchmarks. The results were underwhelming with a mere 0.5% decrease in time at an average of 20.764s. Using diff on both outputs shows me that the increase in optimization did not effect the overall functionality of the program thus I can safely say that O3 slightly optimizes the ssdeep code.

I was not satisfied with this small gain though so I decided to investigate a few other flags the might push the speed even further. The first was the -Ofast flag which pushes the optimization even further than -O3 but at the potential expense of accuracy so I was sure to keep a very close eye on the output from these tests. I ran another 15 tests which average out to a 20.481s run time which converts to about a 2% overall decrease in time which is getting better. To check the integrity of the output I first ran a diff command on both files which declared them perfectly matching and just for my own peace of mind I ran a few quick wc commands to ensure they had similar counts for various parameters as well as making sure their size was the same, all of which indicated to me that -Ofast had not compromised ssdeep's accuracy at all.

Finally on the compiler side I will attempt to push it a little further using profile guided optimization which allows gcc to profile a build and execution of the program and then rebuild it using the profile to guide further optimizations. Step one of the process is building with the -fprofile-generate flag to generate the profile on program execution. That done I ran ssdeep through my usual test data without timing it in-case that effected the profile without my knowledge. Step two was simply to re-compile replacing the -fprofile-generate flag with -fprofile-use which will hopefully beef up the optimization and shave off some more time. I compiled again and ran through my 15 tests and to my great surprise I was now at 23.828s on average with no accuracy impact. This was a real head-scratcher for me as I had not considered that it would significantly slow me down. I decided to kick it down to -O3 to see if it was just a strange interaction of -Ofast and the profile. Yet I found the same issue with -O3 as well, from what I have read one of the big benefits of profile guided optimization is on highly branched code which allows it to recognize hot functions and structure around that. Since ssdeep is fairly flat in it's implementation the profile added bulk to the code without having the upside one might normally expect.

With my main three compiler changes tested my compiler results are that ssdeep can safely compile with -Ofast optimization to gain a 2% overall decrease in run time without sacrificing accuracy on Aarchie. Since this seemed to be the best I was going to get out of the compiler flags I also tested this setup on Yaagi to make sure my changes didn't break anything on that side. I found that it worked just as well on Yaagi and maintained it's accuracy with an overall decrease in run time of 3%. I will include my full -Ofast benchmarks below for those interested, as with my previous post all my outliers were top end and were explained by other users starting a program during my test. However, 2% did not seem meaty enough for me so I decided to continue on with my second plan for optimization.

My second approach proved to be far far more complicated than I had originally anticipated. My thought was to take this loop:

for (i = self->bhstart; i < self->bhend; ++i)
{
self->bh[i].h = sum_hash(c, self->bh[i].h);
self->bh[i].halfh = sum_hash(c, self->bh[i].halfh);
}

And vectorize it using the TBL instruction to replace sum_hash which simply returns a value from the sum_table. My assumption for this loop was that if I used the TBL instruction to do a simultaneous look up of 8 8-bit unsigned characters that I could significantly speed up this particular loop. Unfortunately I did not examine the table look-up closely when I first decided to choose this as my path. After starting to rewrite the loop I quickly realized that it was in fact a two dimensional table using the unsigned char h as the first index and c as the second. this threw a big monkey wrench into my code. What I had thought was going to be a simple alignment and instruction call turned into a beast of a problem for my current skill set.

I decided to press on with it though as this seemed to be where most of the time was spent in the code so a major gain here could be significant. I spent many hours pouring over the ARMv8 instruction manuals and sifting through assembly tutorials until finding out that 2D arrays in assembler are 1D arrays aligned by row or column so you have to manage the offset for each row or each column depending on the layout and that managing this for a 64x64 array was going to be difficult. Upon learning this fact my initial thought was to get the row I wanted in C and then perform the vectorization from there eliminating my two dimensional problem. After trying that out I realized the flaw in my logic which was that I had no way of guaranteeing that two consecutive look-ups would hit the same row which meant I could only look-up one at a time, fully defeating the purpose of vectorizing in the first place.

-Ofast benchmarks:

Aarchie:

1. 20.444277145 seconds time elapsed
2. 20.453308310 seconds time elapsed
3. 20.488060016 seconds time elapsed
4. 20.456172209 seconds time elapsed
5. 20.472601960 seconds time elapsed
6. 20.447276332 seconds time elapsed
7. 20.460654555 seconds time elapsed
8. 20.469467618 seconds time elapsed
9. 20.471569602 seconds time elapsed
10. 20.488927501 seconds time elapsed
11. 20.468304492 seconds time elapsed
12. 20.481516280 seconds time elapsed
13. 20.486256404 seconds time elapsed
14. 20.471053165 seconds time elapsed
15. 20.463490788 seconds time elapsed

Average: 20.4678 seconds time elapsed

Old Average: 20.879700191 seconds time elapsed

Decrease: 2%

Yaagi:

1. 7.827208084 seconds time elapsed
2. 7.810797700 seconds time elapsed
3. 7.799247921 seconds time elapsed
4. 7.869076013 seconds time elapsed
5. 7.847016071 seconds time elapsed
6. 7.818802489 seconds time elapsed
7. 7.831928547 seconds time elapsed
8. 7.797149357 seconds time elapsed
9. 7.783070721 seconds time elapsed
10. 7.789773376 seconds time elapsed
11. 7.808798807 seconds time elapsed
12. 7.807145351 seconds time elapsed
13. 7.832143760 seconds time elapsed
14. 7.830165557 seconds time elapsed
15. 7.810145385 seconds time elapsed

Average: 7.817 seconds time elapsed

Old Average: 8.0551890022 seconds time elapsed

Decrease: 3%

Search This Blog

Colin McManus SPO600

Project Stage 3 - Optimization

Comments

Post a Comment

Popular posts from this blog

Project Stage 2: Benchmarks

Lab 5: Algorithm Selection