Using Tags in GThumb

2023-11-23 23:54

Detlev Zundel

Tags:

I guess most people have their own way of organizing lots of digital images efficiently. Most of them probably use databases for organization even though there are standardized ways to put metadata into the image files themselves. This blog post will detail my approach based on manipulating only the image files themselves rather than putting a database "on top", or other abstractions external to the files themselves.

The GThumb application from the GNOME Project is my preferred tool to organize digital images. Up until I wanted to work only on subsets of the original images, the standard functionality was good enough for my very amateurish needs. In the previous years we used to manually copy individual files into separate directories for different further processing. Such a procedure leaves a lot of duplicate files behind and does not record the selection process in the original files themselves. After cleaning up the left behinds from those previous years I set out to establish a better procedure this time. The aim was to enable an easy process to select pictures relevant for different groups of persons all the while recording the process in the original image files. The tagged files are then collected with a short script into individual directories for further processing, e.g. for creating a picture book out of them.

Keywords In Exif Data

All digital images from cameras or mobile phones already carry a great many of metadata in the Exif section of the files. This data includes fields for creation dates of the picture, camera make, camera settings, etc. This section also already contains the field Keywords as can be seen with the exiftool command line tool:

dzu@krikkit:/tmp/exif$ exiftool -G -D -keywords DSC02657.JPG
[IPTC]             25 Keywords                        : Favorit
dzu@krikkit:/tmp/exif$

The "-G" option adds the group prefix for the attribute name and the "-D" adds the ID of the field from the spec. As I know now, there are many such groups in the standard and as we can check, the IPTC Group defines the ID 25 as Keywords with the data type string[0,64]+, i.e. a list. This allows us to add many such keywords to a single file. For our use case I would like to see keywords represent the target audience group.

The configuration file .config/gthumb/tags.xml simply contains a list of strings that GThumb will present in the tag menu:

<?xml version="1.0" encoding="UTF-8"?>
<tags version="1.0">
  <tag value="Bildschirmfotos"/>
  <tag value="Familie"/>
  <tag value="Favorit"/>
  <tag value="Geburtstag"/>
  <tag value="Party"/>
  <tag value="Radfahren"/>
  <tag value="Spiele"/>
  <tag value="Temporär"/>
  <tag value="Urlaub"/>
  <tag value="Wichtig"/>
  <tag value="Wissenschaftlich"/>
</tags>

With this setup, we can use GThumb's tag menu to add one or multiple of these tags to a file.

Reading Keywords From Files

Now that we have the tags in the files, we can use our tools to process them. To scan all files and extract the keywords, we can again use exiftool and its JSON output format. We use the -if '$keywords' construct of exiftool to limit the output to files that really carry keywords. Leaving this option out would create an entry for every files in the JSON output, including empty ones where no keywords were found, but I prefer to see only the relevant entries.

Here is an example with a made up directory hierarchy /tmp/pictures. The subdirectory 2023 contains 9 pictures from 2023 and two of them are tagged files. One of them carries only a single keyword, and the other carries two. Note that in the latter case, the value of Keywords is a string and not an array of a string. Actually I would have liked to see arrays all the time and it seems that -strict should work this way, but it does not work for me as advertised. Any help would be appreciated, but for now we need to remember this for the script in the next section.

dzu@krikkit:/tmp/pictures$ exiftool -r -if '$keywords' -keywords -json 2023 > 2023-keywords.json
    1 directories scanned
    7 files failed condition
    2 image files read
dzu@krikkit:/tmp/pictures$ jq . 2023-keywords.json 
[
  {
    "SourceFile": "2023/IMG_20231029_160718.jpg",
    "Keywords": [
      "Radfahren",
      "Urlaub"
    ]
  },
  {
    "SourceFile": "2023/IMG_20231029_154527.jpg",
    "Keywords": "Radfahren"
  }
]
dzu@krikkit:/tmp/pictures$

Mass Editing Keywords

Remember, the json file we created in the previous step is not really a database but only a cached version of the data that allows us quicker processing. If something changed in the folder, we need to recreate the file again with exiftool, but until then it is way faster to work on the cached version.

Here is an example where we decided to rename the keyword Radfahren to Radeln with plain sed:

dzu@krikkit:/tmp/pictures$ sed 's/"Radfahren"/"Radeln"/g' < 2023-keywords.json > new.json
dzu@krikkit:/tmp/pictures$ jq . new.json 
[
  {
    "SourceFile": "2023/IMG_20231029_160718.jpg",
    "Keywords": [
      "Radeln",
      "Urlaub"
    ]
  },
  {
    "SourceFile": "2023/IMG_20231029_154527.jpg",
    "Keywords": "Radeln"
  }
]
dzu@krikkit:/tmp/pictures$

Flushing The JSON Database To The Files

This change did not change the tags in the original files, however. In order to do this, we can use exiftool again, but this time in writing mode:

dzu@krikkit:/tmp/pictures$ exiftool -json=new.json 2023
No SourceFile '2023/IMG_20231029_162416.jpg' in imported JSON database
(full path: '/tmp/pictures/2023/IMG_20231029_162416.jpg')
No SourceFile '2023/IMG_20231029_153602.jpg' in imported JSON database
(full path: '/tmp/pictures/2023/IMG_20231029_153602.jpg')
No SourceFile '2023/IMG_20231029_155812.jpg' in imported JSON database
(full path: '/tmp/pictures/2023/IMG_20231029_155812.jpg')
No SourceFile '2023/IMG_20231029_162414.jpg' in imported JSON database
(full path: '/tmp/pictures/2023/IMG_20231029_162414.jpg')
No SourceFile '2023/IMG_20231029_162422.jpg' in imported JSON database
(full path: '/tmp/pictures/2023/IMG_20231029_162422.jpg')
No SourceFile '2023/IMG_20231029_155814.jpg' in imported JSON database
(full path: '/tmp/pictures/2023/IMG_20231029_155814.jpg')
No SourceFile '2023/IMG_20231029_162413.jpg' in imported JSON database
(full path: '/tmp/pictures/2023/IMG_20231029_162413.jpg')
    1 directories scanned
    2 image files updated
dzu@krikkit:/tmp/pictures$

Note that exiftool found only two entries in new.json and complained about all the other files that it encountered. As we intentionally created the JSON file without empty entries, this was to be expected. We can now recreate our JSON database and see that the contents of the original picture files indeed changed:

dzu@krikkit:/tmp/pictures$ exiftool -r -if '$keywords' -keywords -json 2023 > 2023-keywords.json
    1 directories scanned
    7 files failed condition
    2 image files read
dzu@krikkit:/tmp/pictures$ jq . 2023-keywords.json 
[
  {
    "SourceFile": "2023/IMG_20231029_160718.jpg",
    "Keywords": [
      "Radeln",
      "Urlaub"
    ]
  },
  {
    "SourceFile": "2023/IMG_20231029_154527.jpg",
    "Keywords": "Radeln"
  }
]
dzu@krikkit:/tmp/pictures$

Summary Of Data Structures

This setup keeps the tags in the images themselves, and so they will not be lost if images are moved, renamed or copied to other locations. Our <dir>-keywords.json file is only a cached representation for a directory hierarchy, but needs to be recreated when the images have been manipulated by whatever way, be it GThumb, the command line or any other means.

Collecting Images For Tags Into Directories

Keeping this in mind, we can now implement a small script to collect all files for keywords into their own directory. As we use JSON for our data structures, it is convenient to use a scripting language with explicit support for JSON data. I will use the elvish shell which I have come to like a lot lately. This shell allows very concise scripts that are immune to the "special characters in file names" and "unexpected error conditions in pipelines" problems of traditional shells. elvish is so nice that it warrants its own blog post, but today you can get a glimpse of it from a real use case. Option parsing, command line help and colored output are included in this pretty terse implementation.

Without further ado, here is the script:

#!/usr/bin/elvish
#
# populate-tag-dirs.elv: Populate directories with links to tagged
# files.
#
# Extracting the tags into a JSON data structure is done with exiftool
# and so it needs to be installed for this script to run.  The data
# structure is cached to "<dir>-keywords.json" to speedup further
# processing.  There is an automatic check if the cache is invalid
# because there were changes to files more recent than the cache.
#
# Sample call:
# populate-tag-dirs.elv -v 2022

use flag
use str
use path
use re

fn usage {
    echo "usage: "(path:base (src)[name])" [-h] [-n] <dir>" >&2
    echo "       Populates category directories according to database." >&2
    echo "       The database needs to be in '<dir>-keywords.json' and the" >&2
    echo "       directories will be named '<dir>-<keyword>'" >&2
    exit 1
}

# Command argument parsing
var specs = [
    [&short=h &long=help]
    [&short=v &long=verbose]
    [&short=n &long=dry-run]
]

var flags args = (flag:parse-getopt $args $specs)
set flags = (each {|f| put [ $f[spec][long] $f[arg] ] } $flags | make-map)

if (or (has-key $flags help) (== (count $args) 0) (> (count $args) 1)) {
    usage
}

var verbose = (has-key $flags verbose)
var dry-run = (has-key $flags dry-run)
var path = $args[0]
var dbfile = $path""-keywords.json

# UI functions
fn error {|@msg|
    echo (path:base (src)[name]): (styled error: red) $@msg >&2
}

fn info {|@msg|
    echo (styled info: yellow) $@msg >&2
}

fn verbose {|@msg|
    if $verbose {
        echo (styled info: yellow) $@msg >&2
    }
}
fn update_db {|path|
    info Creating database for $path - this may take a while
    exiftool -r -if '$keywords' -keywords -json $path > $path""-keywords.json
}

# Check argument
if ?(test ! -d $path) {
    error $path missing or is not a directory
    exit 1
}

# Check for exiftool
if (not ?(exiftool >/dev/null 2>&1)) {
    error exiftool not installed
    exit 1
}

# If the DB is missing, simply create it
if ?(test ! -f $dbfile) {
    error $dbfile "not found"
    update_db $path
} else {
    # Check if DB needs update
    var recent_file = (find $path -type f -printf "\n%T@ %p" | sort -nr | head -1 | awk '{print $2}')
    if ?(test $recent_file -nt $dbfile) {
        info $dbfile is outdated
        update_db $path
    } else {
        verbose $dbfile is still current, no rescan needed
    }
}

from-json < $dbfile | each {|entry|
    # The entries can be a string or an array of strings. Normalize to
    # an array to map over
    if (is (kind-of $entry[Keywords]) "string") {
        put [ $entry[Keywords] ]
    } else {
        put $entry[Keywords]
    } | each {|kw|
        var dir = $path"-"$kw
        if $dry-run {
            echo "ln "$entry[SourceFile]" "$dir
            continue
        }

        if ?(test ! -d $dir) {
            info Creating directory $dir
            mkdir $dir
        }
        var target = $dir/(basename $entry[SourceFile])
        if ?(test ! -f $dir/(basename $entry[SourceFile])) {
            verbose Linking $entry[SourceFile] to $dir
            ln $entry[SourceFile] $dir
        } else {
            verbose File $entry[SourceFile] already linked to $dir
        }
    } (one)
} (one)

This is how the script works in practice:

dzu@krikkit:/tmp/pictures$ populate-tag-dirs.elv 2023
info: Creating directory 2023-Radeln
info: Creating directory 2023-Urlaub
dzu@krikkit:/tmp/pictures$ ls -l
insgesamt 20
drwxr-xr-x 2 dzu dzu 4096 23. Nov 17:12 2023/
-rw-r--r-- 1 dzu dzu  164 23. Nov 17:15 2023-keywords.json
drwxr-xr-x 2 dzu dzu 4096 23. Nov 17:15 2023-Radeln/
drwxr-xr-x 2 dzu dzu 4096 23. Nov 17:15 2023-Urlaub/
-rw-r--r-- 1 dzu dzu  164 23. Nov 17:11 new.json
dzu@krikkit:/tmp/pictures$

Note that the files are not copied, but hard linked to the tag directories. In this way, the copies do not take up any more space, as they really point to the same storage location as the original images. This way, we can easily create and "rm -r" the tag directories without any additional storage space. But it is much easier to work with the selected files as members of a real directory, so we can apply all the standard Unix tools to the set.

Summary

I like the Unix ecosystem so much, because it allows me to adapt the tools to my workflows instead of the other way round. As I always strive to keep things to where they belong, the idea of storing metadata in the images itself rather than an external database is very natural to me, but may seem completely freakish to others. But equipped with powerful tools like exiftool, sed, jq and elvish, it is easy to build an efficient workflow on top of it.