ZIPs 1/n
What’s a ZIP?
ZIPs are shit period
According to wikipedia : ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.
Official specification of ZIP format by PKWARE : APPNOTE.TXT
ZIPs! They are everywhere. Not only .zip but .xlsx, .docx, .pptx, ODF files, .jar, .apk, python .whl and .egg files, flash files, some .dat files of games, .phar, et cetra SMH!
This list is never ending tbh.
So, let’s start by learning about ZIP file format! The format is from 1989 and it’s not like all other “normal” file types. I’d say that it’s a pointer
file type (WHY? Just keep on reading!). ZIP parsing starts from the end of the file and it’s the trivial way of parsing ZIPs. We have atleast 3 headers in a ZIP file (You find it weird? Well IKR? :P) and every header starts from the letter PK
followed by 2 hex bytes.
Basic structure is somewhat like this :
[local file header 1]
[encryption header 1]
[file data 1]
[data descriptor 1]
.
.
.
[local file header n]
[central directory header 1]
.
.
.
[central directory header n]
[end of central directory record]
The story of parsing a ZIP using the trivial method starts from the bottom or end of the file. The parer will first parse EoCDH (End of Central Directory Header). The EoCDH looks like this :
struct EndOfCentralDirectory
{
// signature for EoCDH is : 0x06054B50
WORD DiskNumber ;
WORD CentralDirectoryStartDisk ;
WORD CentralDirectoryStartOffset ;
WORD NumEntries ;
DWORD CentralDirectorySize ;
DWORD CentralDirectoryOffset ; // with respect to the starting disk number
WORD ZipCommentLength ;
char ZipComment[ZipCommentLength] ; //variable size
} ;
You see it stores the offset for the second header called the Central Directory Header (CDH). CDH looks something like this :
struct CentralDirectoryFileHeader
{
// signature for the CDH is : 0x02014b50
WORD VersionMadeBy ;
WORD VersionNeededToExtract ;
WORD GeneralPurposeBitFlag ;
WORD CompressionMethod ;
WORD LastModFileTime ;
WORD LastModFileDate ;
DWORD Crc32 ;
DWORD CompressedSize ;
DWORD UncompressedSize ;
WORD FileNameLength ;
WORD ExtraFieldLength ;
WORD FileCommentLength ;
WORD DiskNumberStart ;
WORD InternalFileAttributes ;
DWORD ExternalFileAttributes ;
DWORD RelativeOffsetOfLocalHeader ;
char FileName[FileNameLength] ;
char ExtraField[ExtraFieldLength] ; //variable length
char FileComment[FileCommentLength] ; // variable length
} ;
Now you can see that there’s a offset for local file header or LFH. Basically our compressed or stored data is inside this header (LFH). It looks like this :
struct LocalFileHeader
{
// signature for the LFH is : 0x04034b50
WORD VersionNeededToExtract ;
WORD GeneralPurposeBitFlag ;
WORD CompressionMethod ;
WORD LastModFileTime ;
WORD LastModFileDate ;
DWORD Crc32 ;
DWORD CompressedSize ;
DWORD UncompressedSize ;
WORD FileNameLength ;
WORD ExtraFieldLength ;
char FileName[FileNameLength] ;
char ExtraField[ExtraFieldLength] ;
char FileData[CompressedSize] ;
} ;
Hush! Such a weird structure. This is how it looks when you view a ZIP in a hex editor :
Notice that here the file name (marked green) in LFH (marked blue from 0x00 to 0x29) is 0x41.txt and the file data (marked black) is “test”. The CDH is from 0x2A to 0x5F and the EoCDH is from 0x60 to 0x75.
If you look close then you can see that in CDH you can define external attributes. Now according to APPNOTE : The mapping of the external attributes is host-system dependent (see 'version made by'). For MS-DOS, the low order byte is the MS-DOS directory attribute byte. If input came from standard input, this field is set to zero.
and the current mappings are :
0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
1 - Amiga 2 - OpenVMS
3 - UNIX 4 - VM/CMS
5 - Atari ST 6 - OS/2 H.P.F.S.
7 - Macintosh 8 - Z-System
9 - CP/M 10 - Windows NTFS
11 - MVS (OS/390 - Z/OS) 12 - VSE
13 - Acorn Risc 14 - VFAT
15 - alternate MVS 16 - BeOS
17 - Tandem 18 - OS/400
19 - OS X (Darwin) 20 thru 255 - unused
* But, How am I supposed to correctly parse a ZIP?
Parsing a ZIP looks simple but it’s not that much of a simple task! The trivial way of parsing a zip is from botton to top of the file and looking of all the headers. The parser first finds the EoCDH and then from there it looks at the offset of CDH and then finds the LFH.
Note that When you only list the zip file the name which is stored in CDH is used. But when you unzip a zip file then the name which is stored inside the LFH is used.
But yeah there are other ways of parsing a ZIP too! There is Top-Bottom Parsing, and we also know that the juice is in LFH only so why not just parse the local file headers and unzip them. Let’s call them Stream Parsers. And then we have aggressive parsing.
Many of you