Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repetitions in file names and class labels #15

Open
swghosh opened this issue Aug 20, 2019 · 1 comment
Open

Repetitions in file names and class labels #15

swghosh opened this issue Aug 20, 2019 · 1 comment

Comments

@swghosh
Copy link

swghosh commented Aug 20, 2019

It has been observed that the CSV file which is used to download the dataset consists of a few repetitions in terms of URL values (maybe intentional because a simple picture may contain lot of faces); and the assigned class labels for few celebrity name.

The following are referential to two different celebrities, yet possess the same class index.

  • Kanchan - nm0437156
  • Ilias_Kanchan - nm0437156

Apart from that there are a few entries in the dataset that are pure repetition of entries such that each individual entry possesses the same class index, filename, URL pair. (assuming that the format {class_index}_{filename.jpg} should mark a unique entry)

Hope this helps!
Alternatively, please do let me know I was mistaken and those were on purpose like that.

Sample code to reproduce the problem.

import csv
file_a = open('IMDb-Face.csv', 'r')
spreadsheet = csv.DictReader(file_a)
entries = ['%s_%s' % (entry['index'], entry['image']) for entry in spreadsheet]
print(len(entries), 'entries were found.')
unique_entries = set(entries)
print(len(unique_entries), 'unique entries were found.')
+ 1662888 entries were found.
- 1632927 unique entries were found.
@Apich238
Copy link

I downloaded dataset and it looks like "Kanchan" class is trash or error while "Ilias_Kanchan" is real class.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants