Parsing Binary File Formats : Forbidden Arts
what it is, why and when and how ?
The computing/programming field contains a lot of these small tasks or things that fall under what I call forbidden arts, forbidden is a bad word dark maybe better, because unless you’re put into a situation that requires it from you, you have less than 1% probability of actually doing/needing to do it.
My first encounter with this problem has been a year and half ago, while building a project. Using an open-source solution is often possible and best, but then you need to do the task for 10,000,000 files and Python isn’t really made for it. After searching for a while for an explainatory guide I came out empty handed and mostly found breadcrumbs for how to do it.
In this post I’ll try to do it live i.e by parsing a very complex file format using Golang and the specification. First so it can serve as a guideline for later bycomers and more importantly to remind myself of the darkness lurking behind these shiny devices :) .
What is a file format actually ?
We all know that computers deal with bits, so everything is actually a sequence of bytes somewhere.Interpretation is where things start to become fuzzy when your media player opens a file it first verifies that it knows it’s format (try opening a doc file with VLC).A file format is kind of like an index to make it possible to go into the important parts and jump around.Essentially a file format is just a way to organize those bytes in a standard, deterministic way. Determinism is why computers actually work, you don’t see your program giving you random results at each execution altough it certainly happens.
Suppose you want to write an MP3 Player then you’ll need to find a way to read MP3 files as raw bytes and then manipulate them so they can be transformed to sound. The latter part is for the hardware, the first part is what we’re going to focus on.
The file format thus organizes the data so that it’s recognized as what it is (image,sound,executable,text…)
Obscenities in computing
File formats are sometimes protected and undocumented, meaning you have no official document such as RFC that tells you what it is at byte 6 or 41857. Some formats are open and documented using a specification, meaning there’s an exact document that walks you trough each byte and what it represents for example an MP3 file will encode in it’s header (first 4 bytes) information about versions,layers,bit rate,frequencey… similarly a PDF file will encode page numbers,fonts,titles …
In this example I choose a complex file format since I am very familiar with it,I’ll also focus on parsing the header only.
Our target is the venerable PE file format (Portable Executable) it’s Windows proprietary file format for .exe files (extensions are just naming conventions it’s the actual magic that tells you what a filetype a file is).