Skip to content

Commit ce18a14

Browse files
Fix typo, clarify confusion matrix sample
1 parent 0a986d5 commit ce18a14

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

paper/main.tex

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
\maketitle
3434

3535
\begin{abstract}
36-
Conventional techniques in the domain of digital forensics and data carving focus on simple heuristics. Recently, machine learning has been applied to the subject, achieving state-of-the-art results. The research, however, has focused on general file support. No research has been done on the identification of the algorithm used to compress file fragments. A dataset with a sole focus on compression algorithms was developed and released, using GovDocs1 as a base. Using NIST's Statistical Test Suite, it was found that several compression algorithm produce output that are seemingly random - resulting in a difficult classification problem. By training a convolutional neural network on the dataset, an accuracy of 41\% was achieved, highlighting the difficult nature of this problem. The files created by the tools compress, lz4 and bzip2 were accurately classified whilst others such as zip, rar and gzip produced random guesses. Future work could focus on developing a new purpose-built convolutional neural network, or exploring long short-term memory networks.
36+
Conventional techniques in the domain of digital forensics and data carving focus on simple heuristics. Recently, machine learning has been applied to the subject, achieving state-of-the-art results. The research, however, has focused on general file support. No research has been done on the identification of the algorithms used to compress file fragments. A dataset with a sole focus on compression algorithms was developed and released, using GovDocs1 as a base. Using NIST's Statistical Test Suite, it was found that several compression algorithm produce output that are seemingly random - resulting in a difficult classification problem. By training a convolutional neural network on the dataset, an accuracy of 41\% was achieved, highlighting the difficult nature of this problem. The files created by the tools compress, lz4 and bzip2 were accurately classified whilst others such as zip, rar and gzip produced random guesses. Future work could focus on developing a new purpose-built convolutional neural network, or exploring long short-term memory networks.
3737
\end{abstract}
3838

3939
\section{Introduction}
@@ -255,7 +255,7 @@ \subsection{Model Training and Evaluation}
255255

256256
The model's state at the fifth epoch (its best performance in terms of validation accuracy) was saved for further analysis.
257257

258-
Once trained, the model achieved 41\% accuracy on the validation set. A confusion matrix of the trained model evaluated on 200 000 evenly distributed samples can be seen in figure \ref{fig:confusion-matrix}.
258+
Once trained, the model achieved 41\% accuracy on the validation set. A confusion matrix of the trained model evaluated on 200 000 evenly distributed samples not in the training set can be seen in figure \ref{fig:confusion-matrix}.
259259

260260
\begin{figure}
261261
\centering
@@ -329,7 +329,7 @@ \subsubsection{Conclusion Validity}
329329
\newpage
330330
\section{Conclusion}
331331

332-
Compressed file fragments produced by some tool are far from random, whilst some produce data virtually indistinguishable from pseudo-random and random data. The tools compress and lz4 perform far worse than the other evaluated tools in terms of the NIST Statistical Test Suite.
332+
Compressed file fragments produced by some tools are far from random, whilst some produce data virtually indistinguishable from pseudo-random and random data. The tools compress and lz4 perform far worse than the other evaluated tools in terms of the NIST Statistical Test Suite.
333333

334334
A CNN model may be used and trained on compressed file samples to produce classification with an accuracy of 41\%. The used model was accurately able to identify files compressed using compress, lzip and bzip2.
335335

0 commit comments

Comments
 (0)