- Conventional techniques in the domain of digital forensics and data carving focus on simple heuristics. Recently, machine learning has been applied to the subject, achieving state-of-the-art results. The research, however, has focused on general file support. No research has been done on the identification of the algorithm used to compress file fragments. A dataset with a sole focus on compression algorithms was developed and released, using GovDocs1 as a base. Using NIST's Statistical Test Suite, it was found that several compression algorithm produce output that are seemingly random - resulting in a difficult classification problem. By training a convolutional neural network on the dataset, an accuracy of 41\% was achieved, highlighting the difficult nature of this problem. The files created by the tools compress, lz4 and bzip2 were accurately classified whilst others such as zip, rar and gzip produced random guesses. Future work could focus on developing a new purpose-built convolutional neural network, or exploring long short-term memory networks.
0 commit comments