Understanding file extensions and file formats
Did you know that a file extension has little to do with the file format? Meaning that example.jpg does not imply it is a JPEG file. It could be a Word Document or even a PDF. It is simple to create a file and replace the extension to your preference – that is to say, after creating example.docx, I could change the extension and store it as example.jpg. Again, this does not mean the file format has changed to a JPEG simply because the extension was changed. The original file format remains as a Word Document.
However, if you try to open example.jpg, it will not be able to load as a JPEG file, again due to it being a Word Document.
So why does this matter if we cannot open the file?
Changing extensions is an obfuscation method. Let’s say we suspect that Hanah’s computer has a sensitive document, nothingtoseehere.docx. Since she knows we plan to scan her computer to search all Word Document files, she replaces the sensitive document’s extension to a JPEG, nothingtoseehere.jpg. Now, when we run the scan against her machine, searching for a Word file, and we do not find the sensitive document.
How exactly do I look at the file format?
A file’s signature, also known as a magic number, will let you know in what format is a file. Every file format has a unique signature. To accomplish this, you can read a file in hexadecimal, and by looking at the beginning bytes, you can find the signature or magic number.
Hexdump is a tool on Linux that can view a file’s hexadecimal.
There are pre-existing tools that check a file’s format, but we will be going over how to do it ourselves in Python.
Getting Started
If you want to jump right into the code on how to check for a file’s format, view it here.
Tutorial
Our program will be using the following imports:
import os
from os import path
from argparse import ArgumentParser
We need a place to store all of our file signatures of interest. There are multiple ways to go about this, but for this example, we will be using a dictionary:
signatureAndExtension = dict()
Let’s start adding known file signatures to our dictionary, you can find common ones at SANS or view https://filesignatures.net/.
Since there are multiple signatures given for JPEG, the corresponding values need to be in a list.
signatureAndExtension = {"jpeg": ["FF D8 FF E1", "FF D8 FF E0", "FF D8 FF FE"],
"png": "89 50 4E 47",
"pdf": "25 50 44 46",
"word": "D0 CF 11 E0",
"docx": "50 4B 03 04",
"xls": "D0 CF 11 E0",
"zip": "50 4B 03 04"
}
Let’s switch gears and start working on our main script. Since the purpose of the program is to check a file’s format, we need to obtain the file by asking the user. However, it is always good to check if the file is a file and if it even exists before attempting to read.
if __name__ == "__main__":
# obtaining file
fileName = input("Enter the file you would like to check: ")
validPath = path.exists(fileName) # check path
validFile = path.isfile(fileName) # check if it is a file and not a directory
# checking if file given exists
if validPath and validFile:
print("*FILE FOUND*")
else:
print("*ERROR: FILE NOT FOUND*")
print("\tDoes path exist?", validPath)
print("\tIs it a file?", validFile)
Once we have the file, we will need to convert it to hex and obtain the signature. Returning fileHexNumbers [:11]
accomplishes this, as it returns the portion we are interested in, being the file signature.
def convertFileToHex(fileName):
"""
This function converts a given file into hexadecimal (hex) to find the file's signature.
:param fileName: Name of the file to be converted into hex
:return: The file signature
"""
# reading as bytes
file = open(fileName, "rb").read(32) # .hex().upper()
# converting bytes to hex
fileHexNumbers = " ".join(['{:02X}'.format(byte) for byte in file])
return fileHexNumbers[:11]
Let’s call this function in our main. Within the if statement, if validPath and validFile:
, call convertFileToHex
and store it in a variable: fileSignature = convertFileToHex(fileName)
.
Since we have the signature, we need to reference our dictionary, signatureAndExtension
, to check what the file format is.
We need to create another function that can perform this check. To accomplish this, search the items in the dictionary, and see if any of the hex stored in the dictionary matches. However, since one of the values in our dictionary is a list, we need to account for iterating over that list.
def find_file_format(fileSignature):
"""
This function checks if the given signature exists in the dictionary. If it does it will return the corresponding
key, which is the file format.
:param value:
:return: a key or False if no key was found
"""
# Begin iterating over the dict to see if any of it's values matches the given fileSignature
for key, val in signatureAndExtension.items():
# since some of the values in the dict are lists, I need to check it's type to properly search
if type(val) == list:
# check if any of the items in the list match with value
if fileSignature in val:
return key # if so return the match
# checking if value matches with a string in the dict
elif fileSignature == val:
return key # if so return the match
return False # if nothing was found then return 'False' -- file format needs to be added to dict
After creating the function, call find_file_format
in main and pass in the variable fileSignature
, such as fileFormat = find_file_format(fileSignature)
.
If fileFormat
is False
, then it was not found in our dictionary. Otherwise, we can output what the file format is.
This is how your main should look like once you have made all the function calls and code for the output:
if __name__ == "__main__":
# obtaining file
fileName = input("Enter the file you would like to check: ")
validPath = path.exists(fileName) # check path
validFile = path.isfile(fileName) # check if it is a file and not a directory
# checking if file given exists
if validPath and validFile:
print("*FILE FOUND*")
fileSignature = convertFileToHex(fileName)
# NOTE - To see the signature in hex: print(fileSignature)
fileFormat = find_file_format(fileSignature)
if not fileFormat:
print("Does not match")
else:
print("The format of the file is:", fileFormat)
else:
print("*ERROR: FILE NOT FOUND*")
print("\tDoes path exist?", validPath)
print("\tIs it a file?", validFile)
Great! You have successfully programmed how to find a file’s format by using Python!
Full code can be found here.