Using Machine Learning for Text File Format Identification

EasyChair Preprint 4698

12 pages•Date: December 3, 2020

Santhilata Kuppili Venkata, Paul Young and Alex Green

Abstract

File format identification is a necessary step for effective digital preservation of records. It allows appropriate actions for curation and access of file types. While binary files contain header information (metadata) about the file type which can aid identification, text files have none. Methods applied for binary file format identification are ineffective for text files. Most text formats can be opened as plain text files, however file type information is often needed to understand the files full use and context. When huge volumes of files need to be checked, automated methods are necessary for text file format identification. A project was initiated at The National Archives to identify file types from the contents of text files using computational intelligence methods. A machine learning based methodology was tested and implemented using test data. The prototype developed as a proof-of-concept has achieved reasonably good accuracy in successfully detecting five file formats.

Keyphrases: Text file formats, digital preservation, supervised learning

Links:

https://easychair.org/publications/preprint/PLSj

BibTeX entry

BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:

@booklet{EasyChair:4698,
  author    = {Santhilata Kuppili Venkata and Paul Young and Alex Green},
  title     = {Using Machine Learning for Text File Format Identification},
  howpublished = {EasyChair Preprint 4698},
  year      = {EasyChair, 2020}}

Download PDF Open PDF in browser