Thursday, February 09, 2023

Same Map, Less Megs: Optimizing USGS Map File Size, Part 2

In my last post I convinced myself that a standard USGS topo map contains a heap of image data, that if I removed, would result in a smaller map file and no loss of functionality. So now it's time to make that happen.

My first attempt was to follow this recipe. The suggestion was to use PyMuPDF to redact every image in the document. While functionally promising, the result was a miss. First, the redaction process takes a significant amount of time given that there's over 170 images in a single map. More importantly, the redacted images are replaced with text that leak outside the 'Images' map layer. The result was a map covered in redaction text, which as you can imagine, was useless.

Looking at the PyMuPDF docs I found Page.delete_image. Apparently, I had been overthing this. It looked like the image removal process was going to be as simple as:

for page_num in range(len(doc)):
    for img in doc.get_page_images(page_num):
        page = doc[page_num];
        xref = img[0]
        page.delete_image(xref)

That is, for each page of the document, loop through every image on that page. For each of these images, call delete_image. Alas, when I tried this, delete_image triggered an error message:

File "/Users/ben/Library/Python/3.9/lib/python/site-packages/fitz/utils.py", line 255, in replace_image
if not doc.is_image(xref):
  AttributeError: 'Document' object has no attribute 'is_image'

Looking at the source code, the error message is right: Document doesn't have an is_image method on it. This looks like a bug in PyMuPDF.

Fortunately, what was broken about delete_image was a pre-check I didn't need. The code that does the work of removing the image appears to be functional, so I grabbed it and used it directly.

Deleting an image is now accomplished with this code:

for page_num in range(len(doc)):
    for img in doc.get_page_images(page_num):
        page = doc[page_num];
        xref = img[0]
        new_xref = page.insert_image(page.rect, pixmap=pix)
        doc.xref_copy(new_xref, xref)
        last_contents_xref = page.get_contents()[-1]
        doc.update_stream(last_contents_xref, b" ")

doc.save(output_file, deflate=True, garbage=3);

The doc.save arguments of deflate=True and garbage=3 ensure that space is reclaimed from the removed images.

Given my new found knowledge, I enhanced pdfimages to support -a remove, which removes all images in a PDF.

Here's my script in action:

# 4 freshly downloaded USGS Topo Maps
$ ls -lh *.pdf
-rw-------@ 1 ben  staff    53M Sep 23 00:20 VA_Bon_Air_20220920_TM_geo.pdf
-rw-------  1 ben  staff    56M Sep 17 00:17 VA_Chesterfield_20220908_TM_geo.pdf
-rw-------  1 ben  staff    48M Sep 23 00:21 VA_Drewrys_Bluff_20220920_TM_geo.pdf
-rw-------  1 ben  staff    48M Feb  8 08:05 VA_Drewrys_Bluff_20220920_TM_geo.with_images.pdf
-rw-------  1 ben  staff    51M Sep 23 00:22 VA_Richmond_20220920_TM_geo.pdf

# Remove their images
$ for pdf in *.pdf; do \
    pdfimages -a remove -i $pdf -o compressed/$pdf ; \
  done

# And we're smaller! From 50meg to 6meg. Not bad.
$ ls -lh compressed/
total 69488
-rw-------  1 ben  staff   6.7M Feb  9 07:47 VA_Bon_Air_20220920_TM_geo.pdf
-rw-------  1 ben  staff   8.0M Feb  9 07:47 VA_Chesterfield_20220908_TM_geo.pdf
-rw-------  1 ben  staff   6.4M Feb  9 07:47 VA_Drewrys_Bluff_20220920_TM_geo.pdf
-rw-------  1 ben  staff   6.4M Feb  9 07:47 VA_Drewrys_Bluff_20220920_TM_geo.with_images.pdf
-rw-------  1 ben  staff   6.3M Feb  9 07:47 VA_Richmond_20220920_TM_geo.pdf

# Are the PDF layers still intact? They are
$ python3 ~/dt/i2x/src/trunk/tools/bash/bin/pdflayers -l compressed/VA_Richmond_20220920_TM_geo.pdf
off:230:Labels
on:231:Map Collar
on:232:Map Elements
on:233:Map Frame
on:234:Boundaries
on:235:Federal Administrated Lands
on:236:National Park Service
on:237:National Cemetery
on:238:Jurisdictional Boundaries
on:239:County or Equivalent
on:240:State or Territory
on:241:Woodland
on:242:Terrain
off:243:Shaded Relief
on:244:Contours
on:245:Hydrography
on:246:Wetlands
on:247:Transportation
on:248:Airports
on:249:Railroads
on:250:Trails
on:251:Road Features
on:252:Road Names and Shields
on:253:Structures
on:254:Geographic Names
on:255:Projection and Grids
off:256:Images
on:257:Orthoimage
on:258:Barcode

My script shrinks a USGS PDF from 50'ish megs to 7'ish. That means I can now store the 1,697 map files for Virginia in 11.8 gigs of disk space, instead of 84.8 gigs. That's quite an improvement for a script that was relatively easy to write, and fast to execute.

The question remains: does the modified PDF remain a valid GeoPDF? Will Avenza Maps treat it like a location aware document? I loaded up one of my newly compressed maps into Avenza to confirm this:

Success! As you can see, Avenza is able to detect the coordinates on the map, as well as measure distances and bearings. The image-less maps are more compact, and completely functional.

You'll notice in the above screenshot that there are no street names printed on the map. That's by design. I turned off the layer that displays this information to verify that OCGs are still being respected. They are.

Time to start filling up Micro SD cards with collections of maps.

No comments:

Post a Comment