abstract
-
Many industries have legacy software systems which are definitely important to them but are however, difficult to maintain due to a lack of understanding of those systems. This occurs as a result of inadequate or inconsistent documentation. Although the costs of redesigning the system may be large, some organizations still plan to reverse engineer the software specification documents from the code to alleviate a large burden from such endeavour. This thesis provides an incremental and modular approach to create a process and tools to extract the semantics of legacy assembly code.
Our techniques consist of static analysis and symbolic interpretation in order to reverse engineer the semantics of legacy software. We examine the case of IBM-1800 programs in detail. From the abstract model of the operational semantics of IBM-1800, we simultaneously obtain an emulator and a symbolic analysis process. Augmented with control flow information, we can use the symbolic analysis to provide complete semantics for the code sequences of interest. We can also generate Data Flow Graphs to depict the flow of data in those code segments. The whole process of extracting semantic information from the assembler codes is fully automated with only a little human intervention at the initial step.
We use Haskell as our implementation language and its important features help us to create modular and well structured software. The literate programming documentation style in this thesis increases the readability and consistency of the implementation's documentation.
The process and the associated tools created in this thesis are used in a large reverse engineering project, which has a goal to extract requirements specification from legacy assembly code. This project is funded jointly by Ontario Power Generation (OPG) and CITO (Communications and Information Technology Ontario).