SEMA Evolution: Redefining Malware Analysis Toolchain Architecture

I'm excited to share the remarkable work of my student, Manon O., on enhancing the SEMA (Symbolic Execution for Malware Analysis) toolchain, which was the focus of her Master's thesis at École polytechnique de Louvain. This toolchain is an open-source project designed to detect and classify malware using symbolic execution and machine learning.

- https://github.com/csvl/SEMA-ToolChain - https://hub.docker.com/r/manonoreins/sema-scdg - https://hub.docker.com/r/manonoreins/sema-classifier

Here are the key improvements and results:

* Overview

Malware is evolving rapidly, and traditional analysis tools struggle to keep up. SEMA aims to address this by using symbolic execution to generate System Call Dependency Graphs (SCDG), which serve as signatures for malware classification.

* Key Enhancements

    A. Architectural Overhaul:
        - Transitioned from a monolithic structure to a microservices architecture, enhancing maintainability and scalability.
        - Implemented three distinct microservices: SemaSCDG for symbolic execution, SemaClassifier for machine learning-based classification, and SemaWebApp for a user-friendly web interface.

    B. Code Quality and Usability:
        - Introduced configuration files to simplify parameter management and enhance flexibility.
        - Improved documentation and added detailed README files to assist users.
        - Refactored large classes and methods for better readability and maintainability.
        - Reduced code duplication and removed dead code.

    C. Performance Improvements:
        - Utilized the PyPy3 Just-in-Time compiler to enhance execution speed.
        - Conducted memory analysis to address and fix memory leaks.
        - Optimized state management within the angr framework to improve performance.

* Results

    - Maintainability: The new microservices architecture and refactored codebase significantly improve the toolchain's maintainability, making it easier to update and extend.
    - Performance: Execution speed increased, and memory usage reduced, enabling faster and longer experiments.
    - Usability: Enhanced documentation, streamlined parameter management, and a cleaner codebase make the toolchain more accessible to users and developers.

* Conclusion

These enhancements make SEMA a robust tool for tackling the evolving landscape of malware threats. The toolchain is now better prepared for future developments and more accessible for broader use.

The updated version of SEMA is available on GitHub https://github.com/csvl/SEMA-ToolChain.

Manon and I welcome feedback and contributions from the community!

Feel free to ask any questions or provide feedback. I’d love to discuss more about this project and its impact on malware analysis.

Looking forward to your thoughts!