princetonvisualai / revise-tool Goto Github PK

REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets --- https://arxiv.org/abs/2004.07999

License: MIT License

Jupyter Notebook 71.33% Python 28.55% Shell 0.12%

revise-tool's Introduction

REVISE: REvealing VIsual biaSEs

A tool that automatically detects possible forms of bias in a visual dataset along the axes of object-based, attribute-based, and geography-based patterns, and from which next steps for mitigation are suggested.

REVISE_demo.mp4

In the sample_summary_pdfs folder there are examples of the kinds of auto-generated summaries our tool outputs along each axis for a dataset. These samples are annotated in orange with some notes on how to interpret them.

Setup:

Clone this repo
Set up the conda environment using the appropriate yml file (information in Potential Environment Issues)

conda env create -f environments/[environment].yml

Download the models with

bash download.sh

Note: we use Amazon Rekognition's proprietery facial detection tool in our analyses, which does incur a charge, and this will need to be set up for each user (instructions on Amazon's site). There are many free facial detection tools available as well, and you can change what is used in attribute_based.py . One such free facial detection tool through cv2 is already implemented, and simply involves changing the FACE_DETECT variable in attribute_based.py from 0 to 1 to use this instead.

Steps to perform analysis:

Note that all scripts are expected to be run from the home directory.

(0.5 optional) To experiment with the tool on the COCO dataset for Object-Based and Attribute-Based metrics (using gender annotations) without having to run all the measurements on a dataset first, follow these steps and then skip to Step 3:

Download the pickle files from here, and place them in a folder in the tool directory called results/coco_example
Download the 2014 COCO dataset as well as gender annotations, and place them in customizable filepaths specified in the code here.
- Those lacking the necessary storage space for the images of the COCO dataset can still try much of the functionality by simply heading to section 1.1 (Initial Setup) on each analysis notebook and changing the dataset class from "CoCoDataset" to "CoCoDatasetNoImages"

(1) Make a dataloader structured like the 'Template Dataset' in datasets.py (add to main_measure.py as well), and fill in with the dataset you would like to analyze. Test that you have properly implemented a dataset by running:

python3 tester_script.py NewDataset

(2) Run main_measure to make a pass through the data and collect the metrics for analysis, for example to get measurements (details in section below) att_siz, att_cnt, att_dis, att_clu, obj_scn, att_scn on COCO and have the file be saved in coco_example:

python3 main_measure.py --measurements 'att_siz' 'att_cnt' 'att_dis' 'att_clu' 'obj_scn' 'att_scn' --dataset 'coco' --folder 'coco_example'

(2.5 optional) To optionally do some of the processing ahead of time so interacting with the notebook can be faster, for the Attribute notebook (att_clu) run

python3 measurements/prerun_analyzeattr.py --dataset 'coco' --folder 'coco_example'

and for the Geography notebook (geo_tag and geo_lng) run

python3 measurements/prerun_analyzegeo.py --dataset 'yfcc' --folder 'yfcc_example'

(3) Still in the home directory, open the jupyter notebook from within the analysis_notebooks folder corresponding to the axis of bias you would like to explore: object, attribute, or geography. Further instructions are at the top of the notebook about how to run them.

Measurements

Measurements that can be run, along with the file and name of the function they are associated with:

Object-Based

(Note: obj_cnt, obj_siz, obj_ppl actually all run the same function, so for main_measure.py it's only necessary to run one of these to get all the measurements)

obj_cnt: Counts the number of times each instance occurs, coocurrence of instances occurs, and supercateogry occurs.

obj_siz: Counts the size and distance from center at the supercategory level.

obj_ppl: Counts how much supercategories are represented with or without people.

obj_scn: Counts overall scenes, scene-supercategory cooccurrences, scene-instance cooccurrences, and gets features per scene per supercategory.

Attribute-Based

att_siz: Gets the size of the person and distance from center, as well as if a face is detected. Performs pairwise comparisons to find the largest/furthest person instances.

att_cnt: Counts how often each attribute occurs with an instance and instance pair. Performs pairwise comparisons to test significance of count differences.

att_dis: Calculates the distance each attribute is from each object. Runs OvR (One-vs-Rest) analysis to find the attribute that is furthest/closest from an object.

att_clu: Gets scene-level and cropped object-level features per object class for each attribute. Runs OvR analysis to find the most linearly seperable attribute.

att_scn: Counts the types of scenes each attribute occurs with.

(Note: To analyze an attribute along an ordinal axis, define boolean "self.ordinal" and array "self.axis" in the dataset class)

Geography-Based

Note: Geography-Based analyses require a mapping from images to location. The 2 formats of geography annotations supported are (ie. String formatted locations like 'Manhattan'), and GPS labels (latitude and longitude coordinate pairs). Namely, the user should specify in their dataset class the geography_info_type to be one of the following:

'GPS_LABEL': datasets with mappings from image to GPS coordinates
'STRING_FORMATTED_LABEL', datasets with mappings from image to string-formatted labels

geo_ctr: Counts the number of images from each region

geo_tag: Counts the number of tags from each region, as well as extracts AlexNet features pretrained on ImageNet for each tag, grouping by subregion

geo_lng: Counts the languages that make up the image tags, and whether or not they are local to the country the image is from. Also extracts image-level features to compare if locals and tourist portray a country differently

Potential Environment Issues

If FileNotFoundError: [Errno 2] No such file or directory: appears from importing basemap at epsgf = open(os.path.join(pyproj_datadir,'epsg')), change the PROJ_LIB variable as suggested here. In the jupyter notebook, this may involve setting it in a cell like

import os
os.environ['PROJ_LIB'] = '/new/folder/location/of/epsg'

If the epsg file is still not found, it can be downloaded manually from here, with the path locaation set as mentioned.

For MacOS, use environments/environment_mac.yml, and if there are errors, try running the following commands first

conda config --set allow_conda_downgrades true
conda install conda=4.6.14

environments/environment.yml for non-Mac machines
Try deleting line 9 of environments/enivronment.yml of _libgcc_mutex=0.1=main if there are compatability errors

Glossary

Supercategory: a higher-order category for image labels. e.g., "couch" and "table" both map to the supercategory of "furniture"

Paper and Citation

REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets. If you find this useful, please cite one or both.

Original ECCV 2020 publication

@inproceedings{revisetool_eccv,
author = {Angelina Wang and Arvind Narayanan and Olga Russakovsky},
title = {{REVISE}: A Tool for Measuring and Mitigating Bias in Visual Datasets},
year = {2020},
booktitle = {European Conference on Computer Vision (ECCV)},
}

Extended IJCV 2022 publication

@article{revisetool_extended,
author = {Angelina Wang and Alexander Liu and Ryan Zhang and Anat Kleiman and Leslie Kim and Dora Zhao and Iroha Shirai and Arvind Narayanan and Olga Russakovsky},
title = {{REVISE}: A Tool for Measuring and Mitigating Bias in Visual Datasets},
year = {2022},
journal = {International Journal of Computer Vision (IJCV)},
}

Funding

This work is partially supported by the National Science Foundation under Grant No. 1763642 and No. 1704444.

revise-tool's People

Contributors

Stargazers

Watchers

Forkers

ml-lab sumanyumuku98 dorucioclea ipsych porom004 is-1 aliu22 anatk2020 leslie-kim ryanzhang22 dorazhao99 ravinsharma7 arjunc99 surbhim18 ektasingh001 dhanush-krishna abhijitpal1247 xinyuelxy running-machin

revise-tool's Issues

resnet18_places365.pth.tar has no attribute 'classifier'

Hello everyone,

has anyone else encountered the problem that when running measurement 4 on a custom dataset the error message AttributeError: 'ResNet' object has no attribute 'classifier' is generated(file gender_based.py)? When printing the module there is also no attribute classifier. How can I resolve this error, am I using the wrong .pth.tar maybe?

This is the output of model:

ResNet(
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
(1): BasicBlock(
(conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer2): Sequential(
(0): BasicBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer3): Sequential(
(0): BasicBlock(
(conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(layer4): Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
(fc): Linear(in_features=512, out_features=365, bias=True)
)

Thank you very much and best regards

Question about Hierarchy Plot

Hi Angelina,

just a quick question. How should the hierarchy visualization be interpreted?

Thank you very much in advance! :)

All the best,
Nastassia

How to use RESIVE on my own dataset？

I want to use it to check the bias of my training data（face dataset）. Do I need to preprocess the dataset with some annotators in advance ?

Measurement 4: old download link for the cifar_resnet110.th?

Hello,

I think the download bash script should be updated.
Using the current script caused the error: _pickle.UnpicklingError: invalid load key, '<'.

By manually downloading from this link rather than the one from download.sh I resolved the error.

Just as a tip for future users. :)

Thank you very much for providing this tool.

Best regards,
Nastassia

Difference between person_center and ann_center?

Hi,

I have a short question regarding the bounding boxes. In measurement 3 (M3) you use bounding boxes from gender and from anns. However, I don't quite understand the difference between those.
In my understanding, gender is passed the person bounding box. But which bounding box is passed to anns?

Thank you very much in advance.

All the best,
Nastassia

Not able to download coco datasets

I am getting following error while loading coco dataset,
FileNotFoundError: [Errno 2] No such file or directory: 'Data/Coco/2014data/annotations/instances_train2014.json'
can anyone please provide the link to download above files?

Thanks in advance.

Running on emotion recognition dataset with multiple people in one image

Hello,
I have a short question about using your custom dataset.
I have followed the instructions about adapting TemplateDataset to your specific dataset.
The dataset I am working on contains emotion labels for each respective person in the image - multiple people can be portrayed in one image, thus multiple emotion labels per image exist.
How should this case be treated?
So far, I tried this:

      
 def from_path(self, file_path):

             ......

        annotation_mat_file_name = annotation_path +image_id.replace('.jpg', '.mat')
        mat_file = loadmat(annotation_mat_file_name) 	

        people = mat_file['people']    


        gender_info = []
        image_anns = []
        for p in range(0,len(people)-1):
            person_bbox = bbox = people['person'][p]['bbx']
            gender = people['person'][p]['gender']
            new_ann = {'bbox': person_bbox, 'label': people['person'][p]['cats']}
            image_anns.append(new_ann)
            gender_info.append([gender, self.biggest_bbox])
        country = None

        scene_group = self.scene_mapping[(self.img_folder + file_path)] # optional
        anns = [image_anns, gender_info, [country], (self.img_folder + file_path), scene_group]

        return image, anns

I saved each labelling per image in a .mat file. People is a struct with numofpersons entries, each containing a bounding box of the person, its gender, labels and other entries.
The problem is the following: The lists inside of 1.pkl (and my guess is, also for the other measurements) are empty. Presumably, because the annotations I pass during from_path are incorrect.

Do you have any suggestions on how to proceed? Should I save one .mat file per annotated person and thus treat each person and its corresponding bounding box as a separate image?

I will be very greatful for your input on this matter and I am very looking forward to your reply.

Best regards,
Nastassia

Gender Labeling

Hi,
it's me again.

Just as a quick tip for other users/you. :)
In datasets.py you stated in the TemplateDataset in the method from_path:

gender = None # optional, we have used 0 for female and 1 for male when these labels exist

So, I followed the instruction and have set:
if gender == "Female": binary_gender = 0 if gender == "Male": binary_gender = 1

I calculated the number of Male and Female persons depicted in the images beforehand, so I know how many of them there are. Unfortunately the number for Male and Female persons is exactly switched in the first measurement (M0) which leads me into believing that you used

0:Male

1:Female

labeling instead.

No critique, just a friendly tip for you/your future users. Since running the measurement 1 took a bit more than two hours for me and now I will rerun it, I want to prevent this for your users. :)

All the best to you and best regards,
Nastassia

Dataset Reference for COCO

Hi,

First of all thanks a lot for releasing wonderful tool. Kudos.
I am struggling with running the tool properly on COCO dataset as this seems to be a missing link-

gender_data = pickle.load(open('Data/Coco/2014data/bias_splits/train.data', 'rb'))

Also, I am not able to understand what is happening in the next line:

self.gender_info = {int(chunk['img'][15:27]): chunk['annotation'][0] for chunk in gender_data}

Can you please guide me with the path to download this pickle file which is acceptable by the code.

M1 distance calculation

Hi Angelina,

sorry, it's me again!
I have a short question regarding M1. You compute the distance from the person to the image center as:
distance = np.linalg.norm(person_center - img_center)
However, you set:
img_center = np.array([.5, .5])

Maybe I am wrong, but did you maybe forget to multiply with the image size itself? Since the person_center is given in pixel values.

Please correct me if I am wrong, I was just wondering. :)

Thanks so much,
Nastassia

Two types of bounding boxes?

Hi,

Thank you very much in advance.

All the best,
Nastassia

Object Analysis for Coco Dataset (M7): "ValueError: zero-dimensional arrays cannot be concatenated"

Hi,

when trying to run the Object Analysis notebook on the Coco dataset I am getting the following error in the "Analyses" section of "(M7) Metric: Size and Distance from Center of Supercategories":

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-38-6ffb36b669b2> in <module>
      1 if first_pass:
----> 2     object_size(cat_to_ent[0][1], None)
      3 ui = HBox(all_things)
      4 out = widgets.interactive_output(object_size, {'object_class': object_class_widget, 'sizes': sizes_widget})
      5 display(ui, out)

<ipython-input-37-46aecec2feb8> in object_size(object_class, sizes)
    144     these_instances = np.concatenate(instances_per, axis=0)
    145     scenes_per = np.array([dataset.from_path(filepath)[1][4] for filepath in filepaths])
--> 146     these_scenes = np.concatenate(scenes_per, axis=0)
    147     num, counts = np.unique(these_instances, return_counts=True)
    148     num = np.array([categories.index(nu) for nu in num])

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: zero-dimensional arrays cannot be concatenated

<Figure size 432x288 with 0 Axes>

The issue is, that the variable scenes_per is simply a list of None's:

[None None None None None None None None None None None None None None
 None None None None None None None None None None None None None None
 None None None None None None None None None None None]

printing dataset.from_path(filepath)[1] for one filepath in filepaths results in:

[[{'bbox': [0.23233333333333334, 0.999, 0.22865625, 1.0], 'label': 1}, {'bbox': [0.28827083333333337, 0.6980625, 0.6325000000000001, 0.88925], 'label': 88}, {'bbox': [0.003, 0.23970833333333333, 0.9039375, 0.987078125], 'label': 65}], [1, [0.23233333333333334, 0.999, 0.22865625, 1.0]], [0], '/n/fs/visualai-scr/Data/Coco/2014data/train2014/COCO_train2014_000000176176.jpg', None]

I verified, that the image at the given path exists. Can anyone help me with this issue? Thanks in advance!

Best regards,
Tobias

while running [python main_measure.py --measurements 'obj_scn' 'obj_siz' --dataset 'cityscapes' --folder '\train']

Traceback (most recent call last):
File "D:\Bias\Image\Bias_in_image\revise-tool\main_measure.py", line 90, in
main()
File "D:\Bias\Image\Bias_in_image\revise-tool\main_measure.py", line 86, in main
index_to_measurement[meas](dataloader, args)
File "D:\Bias\Image\Bias_in_image\revise-tool\measurements\object_based.py", line 176, in obj_scn
group_mapping = dataloader.dataset.group_mapping
File "C:\Users\nipon.chanda.conda\envs\bias_env\lib\site-packages\torch\utils\data\dataset.py", line 83, in getattr
raise AttributeError
AttributeError

princetonvisualai / revise-tool Goto Github PK

revise-tool's Introduction

REVISE: REvealing VIsual biaSEs

Table of Contents

Setup:

Steps to perform analysis:

Measurements

Object-Based

Attribute-Based

Geography-Based

Potential Environment Issues

Glossary

Paper and Citation

Funding

revise-tool's People

Contributors

Stargazers

Watchers

Forkers

revise-tool's Issues

Recommend Projects

Recommend Topics

Recommend Org