Generalized Algorithm Detection or Signatures
From Ben's Writing
Contents |
To some random reader
Don't asume that this has any merit, I wrote it late at night after a binge of coding and already sleep deprived. It may be utter garbage. That said, if it works, and no one else has done it, then it all mine, baby.
Question
Can we find a way to extract the types of algorithms used by a piece of code, or binary data?
Motivation
Finding a way to detect algorithm types might help tune large scale software applications. For instance, it would be convenient to scan a large project, determine the algorithms is using and then decide if new ones could solve the problem more effectively than the existing ones. This could all be done without ever looking at the code of the application.
There are usually standard ways of solving problems in Computer Science. If for example algorithm A is the most efficient way to solve problem P, and some piece of code is using algorithm B, it should be modified to use A. How we define most efficient here is relative to the domain in which the application is used. For instance, the meaning of efficiency with regards to a scheduling algorithm is dependant on the environment it is used in. Sometimes, as in the case of an air traffic control system, fast turnaround is vital, where as for movie presentations the turn-around can be slower but produce a more highly optimized output (and not just the first solution it finds).
Algorithm Types
Algorithms that use a similar problem-solving approach can be grouped together. Here we present a basic classification scheme to organize these algorithm types under. This scheme is neither exhaustive nor necessarily disjoint, but it will serve our purposes.
(Oversimplified) Basic Algorithm Classifications:
- Loop
- Simple Recursive
- Backtracking
- Divide and conquer
- Dynamic programming
- Greedy
- Branch and Bound
- Brute force
- Randomized
Some Problems
The first problem is that a simple Loop in a program would be indistinguishable from a Simple Recursive algorithm, unless the Simple Recursive algorithm had parameters or some other means of identifying it. The Stack could then be used to differentiate Loops from Simple Recursive algorithms.
Another problem, is that some of the other algorithm types have an underling Simple Recursive structure, so there must be some further criteria by which we can disambiguate a Simple Recursive algorithm from, say, a Divide and Conquer algorithm. Locality or the distance between two function invocations may help in this respect. Alternatively, we might need a completely different type of categorization, if the above proves to be ambiguous.
Approaches
- To establish the basic patterns we are looking for we could graph the locations of program counter (PC) at runtime. The graph would consist of a node for every executable instruction and a directed edge between any two nodes that are adjacent—with regards to the flow of execution—at runtime.
- Another route may be to look at patterns in long running stack traces. Looking at how and when particular functions are invoked, especially with regards to locality or distance between similar function invocations.
Loop
It's trivial to find unconditional loops, bounded loops however, are a little more difficult. But not by much. First lets take a simple look at a loop in x86 assembler.
_loop:
jmp short _loop
Sure it's nothing fancy, but no matter what we stuff between those two lines, this loop will cycle forever, unless there is another jump instruction inserted.