Project Stage 1: Planning

     For the first stage of my final project I had to select a hash, checksum or compression package to try to optimize. The package also had to compile down to machine code, like C, C++ or Assembler. Each student needed to select a different package to tackle. To keep as many possibilities for optimization open as possible, I decided to look both on Aarchie and Xerxes for packages available on both.

     I began my search by dnf searching for hash checksum and compression which gave me a ton of results. After sifting through these I first thought to focus on zpaq but discovered it's archiving and journaling processes were pretty complex and difficult to read through so I moved on. Next I was going to work with spooky hash but after building and profiling it I discovered one of my classmates had beaten me to the punch.

     Thus I landed on ssdeep, a program used to take input files and create a context triggered piecewise hash, or fuzzy hash. It can also compare signatures for any matches and return a degree of matching between them, unlike other programs like md5 which can only recognize perfect matches. It can also recursively compute fuzzy hashes through directories and compare hash values against a list of known hashes. With all this functionality I thought there would be a few opportunities to optimize it either through code tweaks or build configurations as these are the two methods I am most comfortable with.

    I investigated the community next and found some good and some bad signs. The good was that the last update to their code repository on GitHub was only 4 months ago so it is not a completely dead project. There was also a fair amount of documentation available to help me if/when I get stuck in the code. There were also no signs of active attempts to optimize the code which means any work I do won't be rendered pointless by another community member later on. The bad is that there were no signs of activity for the past 4 months so my attempts to reach out to the community for help or to include my work might not garner a response. Weighing these pros and cons I decided it was still worth pursuing ssdeep as the focus of my project so I loaded the tarball onto Xerxes, unpacked, configured and built it by following ssdeep's documentation. It all went smoothly and quickly which was promising as it wasn't overly complex with many dependencies which should make it easier to test and isolate functionality.

    I then ran ssdeep on a few files from the source code to ensure it was functioning and it worked perfeectly. The final step was to profile it using perf to try to identify hot spots that could be targets for optimization. To do so I needed a fair amount of files to hash each with decent size so I created 200 text files to hash. Running this under a perf report --call-graph dwarf command allowed me to break down the call graph and see that 88% of the operation time is spent in fuzzy_update. That signals to me that I should look there for potential code improvements.

The Plan:

     For establishing benchmarks I will be using the school-provided servers, specifically AArchie and Xerxes. This means that there is potentially going to be a high variance in my results based on current server load from my classmates and others. To combat this I will run my benchmark test repeatedly at various points of the day and discard any extreme outliers and average the rest. All tests will be run using the same data set to ensure the integrity of the tests.

     Once I have my benchmarks I will start by trying some build changes in the makefile. The provided makefile is fairly complex and fleshed out so I do not expect much impact from the build side but a closer inspection will not take too long and might reveal some simple changes that could speed it up. I will keep stability as the priority while investigating build options as not all possible options are safe or stable, for instance the -O3 flag which turns on all optimization options may speed up one section of code but completely destroy the purpose of the program as a whole.

     Failing to identify any improvements in the build process I will then attempt some code improvements. I will target the fuzzy_update function as it consumes the vast majority of the operation time.  The main tool I will attempt to leverage is vectorization, with in-line assembler and algorithm analysis as secondary options should my investigations warrant them.

     I will record my optimizations by profiling my edited code as I did my benchmarks using perf and comparing the results. I can use this to break down how much time or memory or how many cycles are being saved by my changes, for my purpose I will focus on time as my key metric for savings. The second prong of analyzing my changes is ensuring that all original functions still work as intended or rather that I don't break anything in my attempts to improve. To do so I will store the output of all ssdeep methods run against my test data and compare my results to ensure consistency.

Comments

Popular posts from this blog

Lab 1: Investigating Open Source Development

Lab 3: Compiled C Code

Lab 4: Assembler