Image Digital Collections: validation, validation, validation!


This week, I received a task to validate TIFF images with a total size of 190G (shh... nobody really knew how many images there are in those two black drivers before the validation process). The library uploaded one copy to ContentDM, and saved two local copies in the library.

We are in the process of starting another round of digitization soon, and it is imminent to double check the condition of all the images living in those drives in advance. 

The article TIFF format validation: easy-peasy? compared various tools on the market. Here is the list of tools the author tested, and as he recommended, I chose JHOVE. It is insanely easy to install it and run batch script with it on windows machines:

Validation Tool version How to use remark
1 JHOVE  1.14.6 GUI and java library
2 ImageMagick 7.0.3 Command-line, batch-script help for the batch-script via twitter from David Underdown and the ImageMagick people
3 ExifTool  10.37 Command-line, batch-script help for the batch-script from Mario from the German nestor format identification group
4 DPF Manager 3.1 GUI
5 checkit_tiff  0.2.0 runs on linux only yet Andreas, checkit_tiff developer from the SLUB Dresden has run the test suite for me
6 LibTIFF  4.0.7 runs on linux only Heinz from the German nestor format identification group has run the test suite for me

I downloaded and installed JHOVE on a windows machine. By the way, if your library uses Sierra, you do not need to worry about the JAR requirement since that has already been taken care of when Sierra was installed.

Besides JHOVE, I wrote 2 scritps to help with the workload. 

The first is a batch file myjhove.bat containing three lines of commands, which is created in the same jhove directory. These three lines are simple but super powerful. They will identify all TIFF images in a drive or a path you specified, including those living in sub folders, sub sub folders, etc.. (As a newbie librarian, I was amazed by librarians' obsession in creating an abyss out of folders. The experience is like opening Russian nesting dolls. Luckily, with a simple script, the tyniest dolls were all teleported at once. ) It will then run JHOVE TIFF validation against each of them, and save the log into a single text file. 

After running the .bat file for several hours, I got a 30M text log file to deal with.

Of course, nobody has the time to run down the log. I need a robot to read it first and report the result/statistics back. So here comes the second script written in Python, which will convert a JHOVE log file into a well structured csv file. It will also count the total number of TIFF images validated, count the total number of errors if there is any, and save errors in a separate csv file.

Then I got the report back from my buddy computer:

"25,153 TIFF images are checked, and no error found! All of them are sound and healthy TIFFs. "

So far so good. But in the long run, if we want to keep the project sustainable, we probably should run frequent checks on the files like what we did this time. Unfortunately, there is no permanent solution yet. We save what we can. 

Currently unrated