Wednesday, November 08, 2023

Finding a Photo's Who and When | Extracting and Scrubbing Exif Data

To simplify analyzing my DSLR vs camera phone pictures, I'd like a reliable way of pulling two key pieces of metadata from a photo: the name of the capturing device and the timestamp of when it was captured) taken. A typical JPG stores this information in exif data, so it shouldn't be hard to access. Consider these four sample photos:

ImageMagick's identify command extracts exif entries with ease. Here's what I see when I run it against my sample photos and look for interesting keywords:

$ for f in *.jpg *.JPG ; do echo $f ; identify -verbose $f | grep exif: | egrep -ie '(date|canon|samsung|eos)' ; done
20230709_050311.jpg
    exif:DateTime: 2023:07:09 05:03:11
    exif:DateTimeDigitized: 2023:07:09 05:03:11
    exif:DateTimeOriginal: 2023:07:09 05:03:11
    exif:Make: samsung
20230709_055335.jpg
    exif:DateTime: 2023:07:09 05:53:35
    exif:DateTimeDigitized: 2023:07:09 05:53:35
    exif:DateTimeOriginal: 2023:07:09 05:53:35
    exif:Make: samsung
IMG_0029.JPG
    exif:DateTime: 2023:07:09 09:00:00
    exif:DateTimeDigitized: 2023:07:09 09:00:00
    exif:DateTimeOriginal: 2023:07:09 09:00:00
    exif:Make: Canon
    exif:Model: Canon EOS Rebel T6s
IMG_0045.JPG
    exif:DateTime: 2023:07:09 09:02:59
    exif:DateTimeDigitized: 2023:07:09 09:02:59
    exif:DateTimeOriginal: 2023:07:09 09:02:59
    exif:Make: Canon
    exif:Model: Canon EOS Rebel T6s

Extracting a Device Name

It looks like exif:Make or exif:Model is going to be the best source for the camera name. Focusing on these fields, I see:

$ for f in *.jpg *.JPG ; do echo $f ; identify -verbose $f | grep exif: | egrep -ie 'exif:(Make|Model):' ; done
20230709_050311.jpg
    exif:Make: samsung
    exif:Model: SM-S908U1
20230709_055335.jpg
    exif:Make: samsung
    exif:Model: SM-S908U1
IMG_0029.JPG
    exif:Make: Canon
    exif:Model: Canon EOS Rebel T6s
IMG_0045.JPG
    exif:Make: Canon
    exif:Model: Canon EOS Rebel T6s

SM-S908U1 doesn't mean anything to me, and Canon EOS Rebel T6s is a bit verbose. But mapping these values to easy to work with names requires only a  bit of trivial shell scripting:

case $make in
  SM-S908U1) echo "s22" ;;
  *T6s*) echo "t6s" ;;
  *) echo "unknown" ;;
esac

Extracting a Timestamp

On the surface, it looks like DateTime contains exactly the timestamp I'm looking for. It uses :'s instead of -'s to delineate the date, but that's trivial to fix:

$ for f in *.jpg *.JPG ; do echo $f ; identify -verbose $f | grep exif:DateTime: | sed -r 's/([0-9]{4}):([0-9]{2}):([0-9]{2}) ([0-9]{2}):([0-9]{2}):([0-9]{2})/\1-\2-\3 \4:\5:\6/' ; done
20230709_050311.jpg
    exif:DateTime: 2023-07-09 05:03:11
20230709_055335.jpg
    exif:DateTime: 2023-07-09 05:53:35
IMG_0029.JPG
    exif:DateTime: 2023-07-09 09:00:00
IMG_0045.JPG
    exif:DateTime: 2023-07-09 09:02:59

While these are all seemingly valid timestamps, there's a problem: these photos were all taken in the early morning of July 9th. Why do some have the timestamp of 5am and some 9am? Something's not right.

According to the Google Photos UI, they were all taken between 5am and 6am:

The photos from my Galaxy S22 have exif timestamps that match Google's UI. But the DSLR are pics are totally off. What gives?

First, I spent time closely analyzing the photo metadata from the DSLR's pics. I could see no timestamp that matched what the Photos UI reported.

$ identify -verbose IMG_0029.JPG |egrep -ie '(date|time|stamp)'
    date:create: 2023-11-08T06:59:29-05:00
    date:modify: 2023-11-08T03:58:16-05:00
    exif:DateTime: 2023:07:09 09:00:00
    exif:DateTimeDigitized: 2023:07:09 09:00:00
    exif:DateTimeOriginal: 2023:07:09 09:00:00
    exif:ExposureTime: 1/125
    exif:SubSecTime: 41
    exif:SubSecTimeDigitized: 41
    exif:SubSecTimeOriginal: 41
  User time: 0.210u
  Elapsed time: 0:01.329

Then I cursed out Google: how could they show me a correct timestamp on the web, but give me back a random timestamp in the image itself?

And then I remembered that I shot these photos with my camera's internal clock set incorrectly. I used a clever feature offered by Google Photos to shift the image capture time by a relative amount:

Google's almost certainly returning the original exif timestamp to me, not the one that I shifted on Google Photos. While this isn't the behavior I want, it is reasonable behavior. To deal with this, I've implemented my own time shifting logic. My approach is to describe in a text file the original and corrected timestamp for a given photo. Any other photos taken that day will be corrected by the same offset. For example, I can describe my DSLR's timestamp correction for July 9th, 2023 as:

device|date|exif_timestamp|google_photos_timestamp
t6s|2023-07-09|09:00|05:46

With this approach, I can issue a correction once and have it apply to the hundreds of photos taken on a give day.

I've packaged up the friendly device name mapping and timestamp correction logic into a shell script named photoassist. Using this script, I can now access scrubbed metadata.

$ for f in *.jpg *.JPG ; do d=$(photoassist -a device -i $f) ; t=$(photoassist -a timestamp -i $f); echo "file=$f, device=$d, timestamp=$t" ; done
file=20230709_050311.jpg, device=s22, timestamp=2023-07-09 05:03:11
file=20230709_055335.jpg, device=s22, timestamp=2023-07-09 05:53:35
file=IMG_0029.JPG, device=t6s, timestamp=2023-07-09 05:46:00
file=IMG_0045.JPG, device=t6s, timestamp=2023-07-09 05:48:59

These timestamps now agree with Google Photos and the device names are far easier to recognize and work with.

Next up, I want use this data to annotate and organize a day's worth of photos so I can clearly see what my DSLR is bring to the table.

No comments:

Post a Comment