Ubuntu – How does Shotwell detect duplicates

photo-managementshotwell

Couldn't find any answers to this with google. When importing the same folder of pictures twice, Shotwell will skip duplicate photos. But how does it detect duplicates? If I import two different folders of pictures, some of which have the same name for some reason, will Shotwell assume they are duplicates? Or does it also factor in the file size, making false duplicates unlikely? Or does it hash the pictures, making false duplicates all but impossible?

Best Answer

I believe it is more advanced than simple names, I just tried. In fact it would seem that it doesn't base it on name at all.

So I just created the following:

TestDir
  │
  ├─ blue.png #Blue A4 page
  ├─ blue2.png #Blue A4 page
  ├─ green.png #Blue A4 page
  ├─ red.png #Yellow A4 page
  ├─ yellow.png #Yellow A4 page
  ├─ yellow2.png #Yellow A4 page
  └ TestDir2
       │
       ├─ yellow.png #Blue A4 page
       └─ yellow.png #Blue A4 page

Imported the folder TestDir (which imports from any subdirectories too). This was the notice:

6 duplicate photos were not imported:
/home/tim/Pictures/Test/red.png
/home/tim/Pictures/Test/yellow2.png
/home/tim/Pictures/Test/green.png
/home/tim/Pictures/Test/blue2.png
(and 2 more)

2 photos successfully imported.

The two it had imported were blue.png and yellow.png. This is because they were created first (it chooses the oldest if there are duplicates).

This was confirmed by the next test:

TestDir
  │
  ├─ blue.png #Blue A4 page
  ├─ blue2.png #Blue A4 page
  ├─ green.png #Blue A4 page (sorry OCD people)
  ├─ red.png #Yellow A4 page (sorry OCD people)
  ├─ yellow.png #Yellow A4 page
  ├─ yellow2.png #Yellow A4 page
  ├─ pink2.png #Pink A4 page 
  ├─ pink.png #Pink A4 page
  └ TestDir2
       │
       ├─ yellow.png #Blue A4 page
       └─ yellow.png #Blue A4 page

pink2.png and pink.png have been created. pink2.png was created first, then pink.png

8 duplicate photos were not imported:
/home/tim/Pictures/Test/red.png
/home/tim/Pictures/Test/yellow2.png
/home/tim/Pictures/Test/green.png
/home/tim/Pictures/Test/blue2.png
(and 4 more)

3 photos successfully imported.

The successful imported ones were blue.png,yellow.png and pink2.png.

Because of that I assume it uses a hashing algorithm.

It is accurate enough that changing just 1 pixel of colour from green to yellow on an A4 page caused it to not detect as a duplicate. Pretty accurate then!

In fact, I just found this post here:

Once this bug is fixed, Shotwell will use (exclusively) a full MD5 hash of the file to perform duplicate checking, which is a pretty darn accurate way to find duplicates.

In fact in the source code, at line 732 is this: ^{_{Kudos @ Jeremie Miserez}}

imported_full_md5_table.has_key(prepared_file.full_md5)) {

Sounds like it uses a MD5 hash!

_{Shapes for directory tree from here}

My pronouns are He / Him

Best Answer

Related Solutions

Ubuntu – How to delete events in shotwell

Ubuntu – How make Shotwell scan only specific folder

Related Question