What’s a ZIP?

ZIPs are shit period

According to wikipedia : ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed.

Official specification of ZIP format by PKWARE : APPNOTE.TXT

ZIPs! They are everywhere. Not only .zip but .xlsx, .docx, .pptx, ODF files, .jar, .apk, python .whl and .egg files, flash files, some .dat files of games, .phar, et cetra SMH!

This list is never ending tbh.

So, let’s start by learning about ZIP file format! The format is from 1989 and it’s not like all other “normal” file types. I’d say that it’s a pointer file type (WHY? Just keep on reading!). ZIP parsing starts from the end of the file and it’s the trivial way of parsing ZIPs. We have atleast 3 headers in a ZIP file (You find it weird? Well IKR? :P) and every header starts from the letter PK followed by 2 hex bytes.

Basic structure is somewhat like this :

      [local file header 1]
      [encryption header 1]
      [file data 1]
      [data descriptor 1]
      . 
      .
      .
      [local file header n]

      [central directory header 1]
      .
      .
      .
      [central directory header n]

      [end of central directory record]

The story of parsing a ZIP using the trivial method starts from the bottom or end of the file. The parer will first parse EoCDH (End of Central Directory Header). The EoCDH looks like this :

struct EndOfCentralDirectory
{
	// signature for EoCDH is : 0x06054B50
    WORD  DiskNumber ;
    WORD  CentralDirectoryStartDisk ;
    WORD  CentralDirectoryStartOffset ;
    WORD  NumEntries ;
    DWORD CentralDirectorySize ;
    DWORD CentralDirectoryOffset ; // with respect to the starting disk number
    WORD  ZipCommentLength ;
    char  ZipComment[ZipCommentLength] ; //variable size
} ;

You see it stores the offset for the second header called the Central Directory Header (CDH). CDH looks something like this :

struct CentralDirectoryFileHeader
{
	// signature for the CDH is : 0x02014b50
    WORD VersionMadeBy ;
    WORD VersionNeededToExtract ;
    WORD GeneralPurposeBitFlag ;
    WORD CompressionMethod ;
    WORD LastModFileTime ;
    WORD LastModFileDate ;
    DWORD Crc32 ;
    DWORD CompressedSize ;
    DWORD UncompressedSize ;
    WORD  FileNameLength ;
    WORD  ExtraFieldLength ;
    WORD  FileCommentLength ;
    WORD  DiskNumberStart ;
    WORD  InternalFileAttributes ;
    DWORD ExternalFileAttributes ;
    DWORD RelativeOffsetOfLocalHeader ;
    char  FileName[FileNameLength] ;
    char  ExtraField[ExtraFieldLength] ; //variable length
    char  FileComment[FileCommentLength] ; // variable length
} ;

Now you can see that there’s a offset for local file header or LFH. Basically our compressed or stored data is inside this header (LFH). It looks like this :

struct LocalFileHeader
{
    // signature for the LFH is : 0x04034b50 
    WORD VersionNeededToExtract ;
    WORD GeneralPurposeBitFlag ;
    WORD CompressionMethod ;
    WORD LastModFileTime ;
    WORD LastModFileDate ;
    DWORD Crc32 ;
    DWORD CompressedSize ;
    DWORD UncompressedSize ;
    WORD  FileNameLength ;
    WORD  ExtraFieldLength ;
    char  FileName[FileNameLength] ;
    char  ExtraField[ExtraFieldLength] ;
    char  FileData[CompressedSize] ;
} ;

Hush! Such a weird structure. This is how it looks when you view a ZIP in a hex editor :

Notice that here the file name (marked green) in LFH (marked blue from 0x00 to 0x29) is 0x41.txt and the file data (marked black) is “test”. The CDH is from 0x2A to 0x5F and the EoCDH is from 0x60 to 0x75.

If you look close then you can see that in CDH you can define external attributes. Now according to APPNOTE : The mapping of the external attributes is host-system dependent (see 'version made by'). For MS-DOS, the low order byte is the MS-DOS directory attribute byte. If input came from standard input, this field is set to zero. and the current mappings are :

0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
1 - Amiga                     2 - OpenVMS
3 - UNIX                      4 - VM/CMS
5 - Atari ST                  6 - OS/2 H.P.F.S.
7 - Macintosh                 8 - Z-System
9 - CP/M                      10 - Windows NTFS
11 - MVS (OS/390 - Z/OS)      12 - VSE
13 - Acorn Risc               14 - VFAT
15 - alternate MVS            16 - BeOS
17 - Tandem                   18 - OS/400
19 - OS X (Darwin)            20 thru 255 - unused

* But, How am I supposed to correctly parse a ZIP?

Parsing a ZIP looks simple but it’s not that much of a simple task! The trivial way of parsing a zip is from botton to top of the file and looking of all the headers. The parser first finds the EoCDH and then from there it looks at the offset of CDH and then finds the LFH.

Note that When you only list the zip file the name which is stored in CDH is used. But when you unzip a zip file then the name which is stored inside the LFH is used.

But yeah there are other ways of parsing a ZIP too! There is Top-Bottom Parsing, and we also know that the juice is in LFH only so why not just parse the local file headers and unzip them. Let’s call them Stream Parsers. And then we have aggressive parsing.

Many of you