PDF file carving with Python

File carving is a useful technique if the file system is not recognized. In certain cases, the imaged device might just contain a large chunk of data; rather than something structured, such as a FAT16 USB. Uncommon file systems may also not be supported by traditional forensics tools. Therefore, file carving is an excellent option for retrieving the data. However, it is necessary to know the header and footer of the file(s) you wish to carve. A great file signature table can be found here.

As an example, this article will focus on carving out PDF documents. The file header is 25 50 44 46 and footer 25 25 45 4f 46. The interesting data is located between the header and the footer. Therefore, Regex will be used to determine the data between these two values. However, be aware that some PDF documents require a different footer than demonstrated in this article.

A USB from the previous article is used as a demonstration. The image above shows that xxd detects two PDF headers.

Moreover, the challenge is to extract the relevant data. The image above illustrates the structure of the USB. Regex can be used to filter out the relevant hexadecimal values, which will ignore the irrelevant data. The regex value 25504446(.*?)2525454f46 searches for the header, the footer, and the data in-between. Extracting these values gives us the relevant data to write out the PDF documents. Regex’s findall returns the data into two separate lists. For practising regex and ensuring that the syntax is correct, try Regex 101.

The code snippet above demonstrates how regex is used to extract the relevant data before assembling them together into two separate PDF documents. The full Python code is found at the end of the article.

Furthermore, executing the script successfully carved out the two PDF documents and ignored the irrelevant data. The two documents are fully functional and can be viewed by the user.

Conclusion

File carving is a highly powerful method to retrieve specific data if necessary. This technique can be used to carve out relevant files when the file system is not recognized. An investigator may face uncommon file types/systems or large data chunks. Thus, detecting the file header and footer is necessary to carve out the data. If you would like to lookup an equivalent open-source and powerful tool, try Scalpel.

Python code

"""
About: A Python script which carves out PDFs from an .img file
Version: Python 3.8.3
Author: Erik D @ Toxicsolutions.net
"""

import binascii, re, os, sys

# Used to create a fancy coloured output
class colour:
	GREEN = "\033[92m"
	YELLOW = "\033[93m"
	RED = "\033[91m"
	END = "\033[0m"

def main():

	# Checks if the user has selected the targeted .img file
	try:
		img_path = sys.argv[1]
	except:
		print(colour.RED + "[!]" + colour.END, "No .img file provided")
		print(colour.YELLOW + "[?]" + colour.END, "Usage: python3 script.py /path/to/file.img")
		exit()

	# Opens and reads the .img file
	try:
		with open(img_path, "rb") as file:
			picture = file.read()
	except FileNotFoundError:
		print(colour.RED + "[!]" + colour.END, "File was not found")
		exit()

	print(colour.YELLOW + "[-]" + colour.END, "Dumping .img file to hex")

	# Creating the file into a hex dump
	hexdump = binascii.hexlify(picture)

	# PDF file header and file footer signatures
	file_header = "25504446"
	file_footer = "2525454f46"

	# Finds the values between the file header and file footer
	# This includes all PDFs located inside the .img file
	content = re.findall("25504446(.*?)2525454f46", hexdump.decode())

	# If any PDF headers and footers are detected: proceeed. Else: abort
	if content:
		print(colour.GREEN + "[+]" + colour.END, "Found", len(content), "PDF documents")
		counter = 1

		for i in content:

			# Adds the header, content, and footer together to create the entire file
			full_image = file_header + i + file_footer

			# Converts the hexdump into binary
			converted_image = binascii.a2b_hex(full_image)

			# Detects the user's home path. Then adds the number + .pdf extension
			home_directory = os.getenv("HOME")
			output_file_path = home_directory + "/" + str(counter) + ".pdf"

			counter += 1

			# Opens and writes the output as a PDF document
			try:
				with open(output_file_path, "wb") as image:
					image.write(converted_image)
				print(colour.GREEN + "[+]" + colour.END, "Carved and saved at", output_file_path)
			except:
				print(colour.RED + "[!]" + colour.END, "Error writing file as", output_file_path)
	else:
		print(colour.YELLOW + "[!]" + colour.END, "No PDFs detected")
	
if __name__ == "__main__":
	main()