Download PDFOpen PDF in browser

Using Machine Learning for Text File Format Identification

EasyChair Preprint no. 4698

12 pagesDate: December 3, 2020

Abstract

File format identification is a necessary step for effective digital preservation of records. It allows appropriate actions for curation and access of file types. While binary files contain header information (metadata) about the file type which can aid identification, text files have none. Methods applied for binary file format identification are ineffective for text files. Most text formats can be opened as plain text files, however file type information is often needed to understand the files full use and context. When huge volumes of files need to be checked, automated methods are necessary for text file format identification. A project was initiated at The National Archives to identify file types from the contents of text files using computational intelligence methods. A machine learning based methodology was tested and implemented using test data. The prototype developed as a proof-of-concept has achieved reasonably good accuracy in successfully detecting five file formats.

Keyphrases: digital preservation, supervised learning, Text file formats

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:4698,
  author = {Santhilata Kuppili Venkata and Paul Young and Alex Green},
  title = {Using Machine Learning for Text File Format Identification},
  howpublished = {EasyChair Preprint no. 4698},

  year = {EasyChair, 2020}}
Download PDFOpen PDF in browser