greenelab / iscb-diversity-manuscript Goto Github PK

View Code? Open in Web Editor NEW

5.0 8.0 6.0 49.93 MB

Analysis of ISCB Fellows and Keynotes Reveals Disparities

Home Page: https://greenelab.github.io/iscb-diversity-manuscript/

License: Other

Shell 13.72% HTML 86.28%

iscb bioinformatics diversity manuscript manubot

iscb-diversity-manuscript's Introduction

Analysis of ISCB honorees and keynotes reveals disparities

Citation

This manuscript is now published at:

Analysis of scientific society honors reveals disparities Trang T Le, Daniel S Himmelstein, Ariel A Hippen, Matthew R Gazzara, Casey S Greene Cell Systems (2021-08) https://doi.org/gmhq49 DOI: 10.1016/j.cels.2021.07.007

Manuscript description

Professional societies and the conferences that they manage provide an important venue for the dissemination of scientific knowledge. Being invited to deliver a keynote at an international society meeting or named a fellow of such a society is a major recognition. We sought to understand the extent to which such recognitions reflected the composition of their corresponding field. We collected keynote speaker invitations for the international meetings held by the International Society for Computational Biology as well as the names of Fellows. We compared these individuals with last and corresponding author contributions in the society’s partner journals. We used multiple methods to estimate the gender and nationality of authors and the recipients of these honors. Individuals from certain ancestries and countries appear to be under-recognized among honorees.

Manubot

Manubot is a system for writing scholarly manuscripts via GitHub. Manubot automates citations and references, versions manuscripts using git, and enables collaborative writing via GitHub. An overview manuscript presents the benefits of collaborative writing with Manubot and its unique features. The rootstock repository is a general purpose template for creating new Manubot instances, as detailed in SETUP.md. See USAGE.md for documentation how to write a manuscript.

Please open an issue for questions related to Manubot usage, bug reports, or general inquiries.

Repository directories & files

The directories are as follows:

content contains the manuscript source, which includes markdown files as well as inputs for citations and references. See USAGE.md for more information.
output contains the outputs (generated files) from Manubot including the resulting manuscripts. You should not edit these files manually, because they will get overwritten.
webpage is a directory meant to be rendered as a static webpage for viewing the HTML manuscript.
build contains commands and tools for building the manuscript.
ci contains files necessary for deployment via continuous integration.

Local execution

The easiest way to run Manubot is to use continuous integration to rebuild the manuscript when the content changes. If you want to build a Manubot manuscript locally, install the conda environment as described in build. Then, you can build the manuscript on POSIX systems by running the following commands from this root directory.

# Activate the manubot conda environment (assumes conda version >= 4.4)
conda activate manubot

# Build the manuscript, saving outputs to the output directory
bash build/build.sh

# At this point, the HTML & PDF outputs will have been created. The remaining
# commands are for serving the webpage to view the HTML manuscript locally.
# This is required to view local images in the HTML output.

# Configure the webpage directory
manubot webpage

# You can now open the manuscript webpage/index.html in a web browser.
# Alternatively, open a local webserver at http://localhost:8000/ with the
# following commands.
cd webpage
python -m http.server

Sometimes it's helpful to monitor the content directory and automatically rebuild the manuscript when a change is detected. The following command, while running, will trigger both the build.sh script and manubot webpage command upon content changes:

bash build/autobuild.sh

Continuous Integration

Whenever a pull request is opened, CI (continuous integration) will test whether the changes break the build process to generate a formatted manuscript. The build process aims to detect common errors, such as invalid citations. If your pull request build fails, see the CI logs for the cause of failure and revise your pull request accordingly.

When a commit to the master branch occurs (for example, when a pull request is merged), CI builds the manuscript and writes the results to the gh-pages and output branches.

The gh-pages branch uses GitHub Pages to host the following URLs:

HTML manuscript at https://greenelab.github.io/iscb-diversity-manuscript/
PDF manuscript at https://greenelab.github.io/iscb-diversity-manuscript/manuscript.pdf

For continuous integration configuration details, see .github/workflows/manubot.yaml if using GitHub Actions or .travis.yml if using Travis CI.

License

Except when noted otherwise, the entirety of this repository is licensed under a CC BY 4.0 License (LICENSE.md), which allows reuse with attribution. Please attribute by linking to https://github.com/greenelab/iscb-diversity-manuscript.

Since CC BY is not ideal for code and data, certain repository components are also released under the CC0 1.0 public domain dedication (LICENSE-CC0.md). All files matched by the following glob patterns are dual licensed under CC BY 4.0 and CC0 1.0:

*.sh
*.py
*.yml / *.yaml
*.json
*.bib
*.tsv
.gitignore

All other files are only available under CC BY 4.0, including:

*.md
*.html
*.pdf
*.docx

Please open an issue for any question related to licensing.

iscb-diversity-manuscript's People

Contributors

Stargazers

Watchers

Forkers

mrgazzara annagreene jperkel ajlee21 dhimmel

iscb-diversity-manuscript's Issues

R1: first/corresponding author on field-specific journals is not meaningful

A second problem with this approach is the use of the first/corresponding author of papers as background. Honors are given to scientists with significant contributions, where significant can be highly quoted, or original, or correspond to a long trajectory. Aspects, that might not be necessarily correspond to publications in the mentioned three journals. Indeed, many keynote invitations are to authors in related fields not necessarily published in bioinformatics journals. The logic of the comparison is unclear to say the least.

Update beta coefficients

R1: conclusions are obvious

Additionally, the two main conclusions of the paper are so obvious that is difficult to understand the need of the paper. First, conferences and societies in the field are doing an effort to maintain a healthy gender balance, even if clearly far from perfect the interpretation of the results seems to show a positive trend. Instead of analysing the results he authors go into a very long discussion of the causes and consequences, a discussion that is potentially more appropriate for an opinion paper than for a scientific paper in the conference. The second aspect, is the geographical bias. Yes, it is obvious that most of the speakers are white, the important question to understand the origin of the bias is if there are other more influential authors in the literature that the ones that have been selected in these conferences . The paper does not provide the necessary data to assess is this is the case. To assume that the ideal situation will be to have a number of invitations/honors proportional to the number of papers by region does not make any sense from a scientific or from the conference organisation point of view.

R4: nameprism groupings problemmatic

Figure 4 and the category selection for the analysis are highly problematic. I understand that the categories were selected according to NamePrism, but it is ultimately the responsibility of the authors to justify choices. What is compared is a continent (Europe, Africa), a religion (Muslim), a country (Israel), a part of the continent (East Asian, South Asian), a group of different races or ancestries that spans continents (Hispanic), and then there are Celtic English folk. This analysis needs a substantial revision to the broader and more parallel categories, even compared to previous Figures in the manuscript. The genetics community has been successful discussing continental and subcontinental populations though the problem here is to infer those from names that reflect many things. Singling out Israel as overrepresented sends a tricky message when listed right next to Muslim as underrepresented. It surprised me to see this insensitivity in a study that is meant to assess where we are as a community and to promote sensitivities.

R4: abstract mentions only white overrepresentation

The abstract lists only one conclusion in the last sentence: white scientists are overrepresented and non-white scientists are underrepresented. This strikes me as cherry picking of findings. The paper, if I understood correctly, also found that female scientists are not underrepresented. Given the current climate, wouldn't it also be a major conclusion that the honorees are reasonably distributed gender-wise? Instead, the conclusion is "but the proportion has not reached parity". This suddenly changes the focus from ISCB practices relative to the field to ISCB practices relative to the entire society, which is a different background distribution and one where ISCB has little influence on.

Add curation data to Wikidata too?

Hello, congrats on the work!

Wikidata has a good system for this kind of metascientific information (via https://scholia.toolforge.org/), would you be interested to put your curation there too?

It could lead to some interesting SPARQL queries and take advantage of the other Linked Data on Wikidata.

If you have a published table somewhere (e.g. on Zenodo with a DOI) containing:

the name of the speaker
the conference and year of the Key note
the source for the information (e.g. the conference website)

I'd gladly make the reconciliation to the Wikidata IDs.

R5: are the authors unknown to wru known in distribution?

Of 34,050 corresponding authors, over 25% (8,770) had a last name for which wru did not provide predictions. Do the author know if these were random or non-random in distribution.

R1: not sufficient for a publication

In summary, at the scientific level the problem of normalizing names in terms of gender and origin is interesting and even if this paper makes some progress it is still not sufficient for a publication. As an opinion paper, the topic is obviously relevant and the conclusions known (gender balance is difficult and other biases are even more challenging) but the selection of the background (published papers in three bioinformatics journals) is misleading respect to the goal, i.e. selection of speakers and awardees, that is based on other criteria.

ISMB Decision Letter (5 Reviews)

Our manuscript received 5 reviews. This is an unusual number of reviews. As a member of the ISMB Program Committee I participate in reviews, and the papers that I have been assigned generally get three reviews.

Dear Casey S.,

Thank you for your submission to the ISMB/ECCB 2020 Proceedings Track. We regret to inform you that we are unable to accept your paper for publication. There were many strong submissions this year and only a small fraction could be accepted.

A total of 329 papers were submitted. Of these, 64 manuscripts have been conditionally accepted, giving the final acceptance rate of 19.4%. Each paper was reviewed by several members of the Program Committee, overseen by the Senior Program Committee. Papers were individually reviewed and subsequently discussed by their reviewers to encourage a robust debate and possible resolution of discrepancies between reviewers. Based on the reviews and the outcomes of these discussions, papers were nominated for acceptance into the program. A discussion of the Area Chairs, moderated by the Proceedings Chairs, brought us to a final selection of papers for continued consideration.

We hope you will be able to participate at ISMB 2020 in Montreal, Canada, July 12-16. You may wish to consider submitting to the Abstract Track which includes talks and posters. Submission details are available at:

https://www.iscb.org/ismb2020-submit/abstracts.

We understand that the current public health situation is leaving summer travel plans uncertain at this time, but we are considering virtual options for those who may not be able to attend in person.

Conference registration is open at:

https://www.iscb.org/ismb2020-registration

Sincerely,

Nadia El-Mabrouk and Donna Slonim
ISMB 2020 Proceedings Chairs

SUBMISSION: 189
TITLE: Analysis of ISCB honorees and keynotes reveals disparities

----------------------- REVIEW 1 ---------------------
SUBMISSION: 189
TITLE: Analysis of ISCB honorees and keynotes reveals disparities
AUTHORS: Trang T. Le, Daniel S. Himmelstein, Ariel A. Hippen Anderson, Matthew R. Gazzara and Casey S. Greene

----------- Overall evaluation -----------
SCORE: -2 (reject)
----- TEXT:
The paper describes an analysis of origin honorees and keynote speakers in the main bioinformatics conferences compared with the publication in three related journals using a number of methods to assign the names of the main authors to a gender and geographical origin (honors as the term used is debatable in this context)

At the technical level the paper describes, first; a significant effort to recover the names of speakers and awardees from the conferences and second; a number of methods to assign names to gender and geographical origin.

#38 There have been quite a lot of discussion on the methodological weakness of this paper following its release on bioRxiv. The Race and Ethnicity chapter was particularly controversial, for example mixing it with religious terms was clearly an important error. Therefore, even if interesting, the technology proposed in the paper has still to mature to produce reliable results.
#39 Furthermore, classifying names might reveal an origin but not a nationality and certainly not where the work was carried out. All these consideration seems to scape the analysis (i.e. considering the scientists with a greek name - that is easy to recognise as such- and got some of the ISCB award recently represent a minority when they have developed almost all their profesional time in the USA is very unclear).

#40 It is also unclear why this is the right strategy for the identification of the origin of the awardees/speakers. Direct searches for the names in bibliographic databases or in the society affiliations could be alternative possibilities.

#41 A second problem with this approach is the use of the first/corresponding author of papers as background. Honors are given to scientists with significant contributions, where significant can be highly quoted, or original, or correspond to a long trajectory. Aspects, that might not be necessarily correspond to publications in the mentioned three journals. Indeed, many keynote invitations are to authors in related fields not necessarily published in bioinformatics journals. The logic of the comparison is unclear to say the least.

#42 Additionally, the two main conclusions of the paper are so obvious that is difficult to understand the need of the paper. First, conferences and societies in the field are doing an effort to maintain a healthy gender balance, even if clearly far from perfect the interpretation of the results seems to show a positive trend. Instead of analysing the results he authors go into a very long discussion of the causes and consequences, a discussion that is potentially more appropriate for an opinion paper than for a scientific paper in the conference. The second aspect, is the geographical bias. Yes, it is obvious that most of the speakers are white, the important question to understand the origin of the bias is if there are other more influential authors in the literature that the ones that have been selected in these conferences . The paper does not provide the necessary data to assess is this is the case. To assume that the ideal situation will be to have a number of invitatio!
ns/honors proportional to the number of papers by region does not make any sense from a scientific or from the conference organisation point of view.

#43 In summary, at the scientific level the problem of normalizing names in terms of gender and origin is interesting and even if this paper makes some progress it is still not sufficient for a publication. As an opinion paper, the topic is obviously relevant and the conclusions known (gender balance is difficult and other biases are even more challenging) but the selection of the background (published papers in three bioinformatics journals) is misleading respect to the goal, i.e. selection of speakers and awardees, that is based on other criteria.

----------------------- REVIEW 2 ---------------------
SUBMISSION: 189
TITLE: Analysis of ISCB honorees and keynotes reveals disparities
AUTHORS: Trang T. Le, Daniel S. Himmelstein, Ariel A. Hippen Anderson, Matthew R. Gazzara and Casey S. Greene

----------- Overall evaluation -----------
SCORE: -1 (weak reject)
----- TEXT:
The paper has an eye-catching title, partly because of the nature of the content and partly due to
its implication to ISMB as the flagship conference of ISCB.

Most of the technical content is around associating names to genders/geographic origin or curating publications.

#44 Figs 1 and 2 does not strongly support the title. The over-representation and under-representation has
not been quantified, with the appropriate statistical measures.

A few questions:

#45 1. Nationality/region definition: Mixing of religion and geographic regions is somewhat
confusing for the uninitiated.
I would have certainly benefited by understanding a basis for this classification.
Even in "Nameprism" that the authors cite, the basis is unclear.

#46 2. I was not able to parse the subtle implications of
"We suggest that considering equity may be more appropriate than strictly diversity" that the authors
offer in the Conclusion section.

#47 Any comments on LGBTQ, disability representation in STEM or the field of computational biology ?
Finally, while I think the paper is interesting addressing the social/demographics of the
ISCB community, it is outside the scope of the technical program of ISMB.

----------------------- REVIEW 3 ---------------------
SUBMISSION: 189
TITLE: Analysis of ISCB honorees and keynotes reveals disparities
AUTHORS: Trang T. Le, Daniel S. Himmelstein, Ariel A. Hippen Anderson, Matthew R. Gazzara and Casey S. Greene

----------- Overall evaluation -----------
SCORE: 1 (weak accept)
----- TEXT:
#48 This is an important reflective effort to undertake, and I am glad that it was performed. The analysis on the whole seems reasonable. It would be good to see more citations of this type of work in other disciplines and a comparison of the results here to what has been found in other scientific organizations. It also seems that the work could have been a bit more statistically rigorous in spots, and it would be nice to see more analysis of non-Asian ethnicities.

Some specific comments below.

#49 p. 1, last paragraph: I feel that this first paragraph of the introduction is not the strongest and doesn't add much to the paper overall. I would suggest removing or reworking.

#50 p. 3, paragraph 2: research advisors don't seem to be the best proxy for senior faculty who would be invited for keynotes or honored as fellows

#51 p. 3, penultimate paragraph: if the breakdown is approximately 40/60 on first/last authors being corresponding, why take a majority rule? Seems more rigorous to randomize the selection process according to this "weighted die"

#52 p. 4, paragraph 2: It would be great to see citations for Genderize, especially ones that provide insights on accuracy/reliability. It would also be nice to know how inclusive this app is in terms of names from across the world. Without that, the 1578 missing forenames could be biased by any bias in this app. This discussion is had with respect to wru lower down on the page, so it's natural to provide this standard here.

#53 p. 4, paragraph 4: The terms "race" and "ethnicity" are used in the paper, but there isn't really a definition of these terms in terms of categories. It would be nice to see that explicitly laid out, perhaps with a citation.

#54 p. 4, paragraph 4: Why is using the average demographic distribution a reasonable assumption to make?

#55 p. 7, Figure 2: It would be better to see statistics provided as part of this analysis. It strikes me that the problem at hand here is similar to sample-to-sample microbiome population analysis, which use rigorous statistics, and the same types of methods could be used for analysis in Fig. 2. This level of statistical rigor is present as part of Figure 3, which was good to see.

#56 p. 9, Figure 4: The same point can be made here as for Figure 2. It would be nice to see a statistical analysis and a p-value or two to justify claims beyond appealing to figures.

#57 p. 9, Figure 4: The term "other categories" is used but these are never really defined (see previous point). It would be good to do so, perhaps even in a supplement.

----------------------- REVIEW 4 ---------------------
SUBMISSION: 189
TITLE: Analysis of ISCB honorees and keynotes reveals disparities
AUTHORS: Trang T. Le, Daniel S. Himmelstein, Ariel A. Hippen Anderson, Matthew R. Gazzara and Casey S. Greene

----------- Overall evaluation -----------
SCORE: -3 (strong reject)
----- TEXT:
This manuscript looks at the representation issues when honoring scientists in the field of computational biology by a computational biology society (ISCB). It establishes a background distribution based on the authors of relevant journal publications and then studies subgroup representation of keynotes at three of the ISCB-affiliated conferences (ISMB, RECOMB, PSB) and ISCB fellows against that distribution. It analyzes predicted gender, nationalities, ethnicities, and race of honorees. There are several findings but the major and only stated conclusion in Abstract is that there is recognition overrepresentation of white scientists and underrepresentation of non-white scientists.

#58 This topic is out of scope for the proceedings of ISMB, at least by how I interpret the scope. It has its place in the published literature (I recommend with major revisions), but there is no methodology for or the analysis of molecular and biological data. Instead it is a study about the authors of such papers and broadly speaking it characterizes the recognition and reward system for their contributions to the field.

To give feedback to the ISMB senior committee and also the authors about their work, I offer a critique below assuming the manuscript is within scope of the ISMB proceedings.

Overall this is a clearly written manuscript on a clearly important topic. The work has merits and it presents simple conclusions, but it also has deficiencies in analysis and presentation of results that need to be addressed.

#59 Positives:

Social trends in science are important to track and corrective measures are necessary when disparities are identified. The manuscript brings attention to these important issues.

Societies should be kept in check and it is the responsibility of the members (and non-members) to appropriately expose problems and fix problems. It is difficult to strike the right tone and for most of the manuscript the authors did it well.

The analysis is solid for the most part. The focus on corresponding authors is appropriate. I see no better way to address the issues of gender and race/ethnicity but to employ prediction. I'd assume this prediction is roughly correct for gender. I am less sure about race, ethnicity, religion, etc because a few queries at NamePrism did not convince me of a very high accuracy or satisfy me with its groupings. The authors use their own tool, but I have not queried it. Overall though, I am willing to accept that this works well on average to not impact broad conclusions. The authors recognized that.

Although results might appear simple, substantial work has been completed.

Comments on the analysis:
#60 * The analysis focuses on ISCB, its fellows and its conferences. ISCB conferences are listed at: https://www.iscb.org/iscb-conferences however, justification was not provided as to why some conferences made it and others did not. PSB for example was assessed as international but this is hard to justify given its location patterns. GIW/ISCB-Asia is omitted, but no justification was given. This is of course of significant importance for the conclusions. Either justification or expanded analysis is needed.

#61 * The background in the study is created from the publications in well-selected journals (Bioinformatics, BMC Bioinformatics and PLOS Computational Biology). That background is appropriate and provides a useful characterization but is not necessarily the most appropriate. When one talks about a society, an alternative appropriate background would be the professional membership in that society (a society has responsibilities to recognize the work of its members well). The gender information in ISCB should be publicly available so the authors can check the trends and agreements with the current background. The authors could even be able to work out a solution with ISCB to apply the predictors to the data and more deeply test their hypotheses. The conclusions could hold or not and new patterns could emerge.

#62 * Related to above, the selected journals for the background are of global reach (especially Bioinformatics because of rankings and $0 closed-access publication charges) whereas selected ISCB conferences (with their honorees) are more localized in Europe and North America for financial reasons. Considering the field as a whole, there are conferences that are limited to Asia, such as APBC or ISCB-Asia/Genome Informatics Workshop that are both of good quality and well known. APBC 2020 web site also shows that BMC Bioinformatics is among journals for their special issue. This suggests that the journals indeed better represent the field and that conferences are more geographically fragmented, including potentially ISCB conferences. This in turn means that the selection of conferences is critical for the outcome and appropriateness of this analysis. I think it would be appropriate to see a recognition of this fact in the manuscript and adjust the analysis.

#63 * Figure 1. There is no statistical analysis of this data. By eyeballing the graphs it looks like that women are not underrepresented as honorees, and might even be overrepresented. This should be discussed more formally (e.g. statistical tests) and confidence intervals or statistical significance should be provided much like in Figure 2, whatever the conclusions are.

#64 * This may or may not be a problem in this study, but suppose 10 random papers from group A are published by 5 different corresponding authors and another 10 random papers from group B by a single member of that group. The chances that the top scientist is from group A is not 50%. How was this potential problem accounted for? If not, the authors could state how this might impact the outcomes.

#65 * Minor: the keynote speakers for ISMB can be found on ISCB's web site and AAAI web site prior to 2002.

Comments on the presentation:
#66 * The abstract lists only one conclusion in the last sentence: white scientists are overrepresented and non-white scientists are underrepresented. This strikes me as cherry picking of findings. The paper, if I understood correctly, also found that female scientists are not underrepresented. Given the current climate, wouldn't it also be a major conclusion that the honorees are reasonably distributed gender-wise? Instead, the conclusion is "but the proportion has not reached parity". This suddenly changes the focus from ISCB practices relative to the field to ISCB practices relative to the entire society, which is a different background distribution and one where ISCB has little influence on.

#67 * Introduction: "finding that minority scientists tend to apply for awards on topics with lower success rates [1] could be interpreted either as minority scientists select topics in more poorly funded areas or that majority scientists consider topics of particular interest to minority scientists as less worthy of funding." It might look inflammatory to use terminology "less worthy" here. Maybe this was meant in a narrow technical sense, but a reader might decide to go for a broad interpretation and assess that the whole system is discriminatory.

#68 * Figure 4 and the category selection for the analysis are highly problematic. I understand that the categories were selected according to NamePrism, but it is ultimately the responsibility of the authors to justify choices. What is compared is a continent (Europe, Africa), a religion (Muslim), a country (Israel), a part of the continent (East Asian, South Asian), a group of different races or ancestries that spans continents (Hispanic), and then there are Celtic English folk. This analysis needs a substantial revision to the broader and more parallel categories, even compared to previous Figures in the manuscript. The genetics community has been successful discussing continental and subcontinental populations though the problem here is to infer those from names that reflect many things. Singling out Israel as overrepresented sends a tricky message when listed right next to Muslim as underrepresented. It surprised me to see this insensitivity in a study that is meant to asse!
ss where we are as a community and to promote sensitivities.

#69 * I would like to see discussion on the topic of how ISCB should adjust to identify scientists that merit recognition but have fallen through the cracks so far? Can a data-driven help be of use?

----------------------- REVIEW 5 ---------------------
SUBMISSION: 189
TITLE: Analysis of ISCB honorees and keynotes reveals disparities
AUTHORS: Trang T. Le, Daniel S. Himmelstein, Ariel A. Hippen Anderson, Matthew R. Gazzara and Casey S. Greene

----------- Overall evaluation -----------
SCORE: 1 (weak accept)
----- TEXT:
This is an analysis of 411 ISCB honorees who were keynote speakers at ISCB-associated conferences (ISMB, 2002-2019, RECOMB 1997-2019, and PSB 1999-2020) as well as ISCB Fellows named (2009-2019).

The profile of awardees was compared to a corpus of articles in Bioinformatics, BMC Bioinformatics from (2002-), and PLOS Computational Biology (2005-). PMC corresponding author information when it was available (20,696 articles) and the PubMed last author as a fallback when corresponding author information was missing (9,053 articles). It is unknown how many of these articles were published by the name authors, or if the were attributed uniquely to authors. For example, it is custom in some institutes that the dept chair is a senior or corresponding author, and thus some authors might be over-represented by chance in the data. Recent Increases in Asian corresponding authors was primarily driven by recent publications in Bioinformatics and BMC Bioinformatics. The % of Asian authors estimated from these journals suggests a lack of Asian scientists among keynote speakers and honorees.

The authors created the pubmedpy Python package to parse names from articles and used https://genderize.io to predict gender. Race and ethnicity of honorees and authors was inferred using the R package wru and were associated with race/ethnicity category based on US census data

#70 Of 34,050 corresponding authors, over 25% (8,770) had a last name for which wru did not provide predictions. Do the author know if these were random or non-random in distribution.

#71 There are assumptions and analysis issues. Did the author check if the association between predicted race/ethnicity and author Affiliation? Was Asian bias by country or ethnicity? For example, are American-Asians represented in ISCB honorees but Asians who are geographically located in Asia under-represented. Is this associated with geographic attendance at ISCB meetings? One might expect local awardees in the host country. For example, the authors note that ISMB keynotes had more probability attributable to Israel, while RECOMB had more attributable to East Asian countries.

#72 Given this small pool of 411, some names of which were duplicated (fellows and keynotes etc), were they manually checked for accuracy.

#73 There are non-random missing data. European Spanish/Portuguese/Italian are called Hispanic. Assignment of the race (white/black/Asian) based surname is difficult to extrapolate to a multi-cultural society where last name and skin color are maybe discordant.

I will break these out into component issues and edit the block quotes above with the GitHub identifiers.

R1: unclear why selected strategy is right

It is also unclear why this is the right strategy for the identification of the origin of the awardees/speakers. Direct searches for the names in bibliographic databases or in the society affiliations could be alternative possibilities.

R4: a reader might interpret the whole system as discriminatory

Introduction: "finding that minority scientists tend to apply for awards on topics with lower success rates [1] could be interpreted either as minority scientists select topics in more poorly funded areas or that majority scientists consider topics of particular interest to minority scientists as less worthy of funding." It might look inflammatory to use terminology "less worthy" here. Maybe this was meant in a narrow technical sense, but a reader might decide to go for a broad interpretation and assess that the whole system is discriminatory.

R3: interesting but more analysis of non-Asian ethnicities helpful

This is an important reflective effort to undertake, and I am glad that it was performed. The analysis on the whole seems reasonable. It would be good to see more citations of this type of work in other disciplines and a comparison of the results here to what has been found in other scientific organizations. It also seems that the work could have been a bit more statistically rigorous in spots, and it would be nice to see more analysis of non-Asian ethnicities.

Paper: The Diversity–Innovation Paradox in Science

https://www.pnas.org/content/117/17/9284

This paper is quite interesting. They model theses into topics and then examine when links between topics were first made, the distance between those topics in an embedding space, and then the eventual uptake of these links.

They find that underrepresentation is associated with introducing more distal links, but that for women and non-white scientists those links are more likely to receive less uptake (Fig 2F).

One piece of feedback that I have received over and over again from ISCB leadership is that the disparities might not be a problem, as they might reflect real academic value. We have also seen this in our reviews from anonymous reviewers. As a member of ISCB's Equity, Diversity, and Inclusion committee I saw this prompt conveyed to us based on our manuscript: "Are there recommendations from the committee that could guide ISCB in improving its diversity and still maintain fair science review?" This is a misunderstanding of our main point, which is actually that ISCB's disparities arise from failing to maintain fair scientific review.

I think we've done a nice job of pointing out that the perceived value that scientific advances have is conveyed by peers and the community, and if the selection process results in a group that fails to reflect the diversity of the field then that's a failure of the selection process. This paper aligns with that and suggests that discounting the contributions of minority scientists is causing us to miss promising advances. Their conclusions note:

We reveal a stratified system where underrepresented groups have to innovate at higher levels to have similar levels of career likelihoods. These results suggest that the scientific careers of underrepresented groups end prematurely despite their crucial role in generating novel conceptual discoveries and innovation.

We should make sure to discuss this when we have the opportunity to revise the work in response to the critiques.

Nationality Groupings Under Revision message

Should we go ahead and remove this? We've moved to lettered groupings, which I'm more comfortable with. A lot of the criticism centers on the groupings themselves, but I haven't seen an alternative data-driven grouping of names with separable structure that looks better.

R2: nationality/region groupings are unclear

Nationality/region definition: Mixing of religion and geographic regions is somewhat confusing for the uninitiated.
I would have certainly benefited by understanding a basis for this classification.
Even in "Nameprism" that the authors cite, the basis is unclear.

Confusing nationality with citizenship and religion

Seems like Figure 2 confounds citizenship with religion and nationality.
Citizenship is a pretty clear term: there is a fairly straightforward legal definition of what citizenship is in each country.

Nationality os more vague: in the US, it is often confused with citizenship. But actually in the US, a US national may not be a US citizen.

In other countries , there are legal or common-law definition of nationality. They vary, and they may not be post-enlightenment textbook history definitions. Many people identify themselves with their nationality first, and their citizenship second. In countries where a nationality equals a minority or majority equity issue, you may be missing out on a lot of equity issues this paper is supposed to highlight.

Celtic English: an ancestry, at best.

European: regional definition, losing considerable nuances of ethnicity, race, and nationality.

Hispanic: In the US this has discriminated minority connotations, but this can include a variety of people, including hispanic names that are common in former Spanish colnoies in Africa?

East Asian again, like European, a grab-all bag that does not really
Muslim a religion overlaps with all of the above (and below)

South Asian: again: Muslim names from this region, that includes the largest Muslim population in the world, would go to “Muslim”.

African: subsaharan africa is probably the most diverse region on earth -- genetically as well as ethnically -- lumped in one category.

Israeli: names in the example are all of Israeli Jews mostly of certain disaspora origins. Israelis named Muhammad, Sergey, Adisu would go to the Muslim, European, and African categories, respectively.

Bottom line: not sure what to do, but don’t call it “nationality”. Perhaps “Rough historical name groupings”.

Limitations

I wanted to compile different special cases that may affect our race/ethnicity and gender predictions.

Names from Iberia (Spain and Portugal) vs. names from Spanish and Portuguese-speaking countries in Latin America.
Women scientists who changed their last name to their husband's. If they have different race/ethnicity, this would affect the race/ethnicity prediction.
Missing gender predictions: Hyphenated names are more unique and thus harder to gender match. Most of these names are predicted to be of Asian origin (87%). Of the 2673 author names that were not gender matched, approximately 25% has a hyphen and 50% has fewer than 4 characters which are likely initials (AB, A.B, A B).

R4: a thought experiment about two groups

This may or may not be a problem in this study, but suppose 10 random papers from group A are published by 5 different corresponding authors and another 10 random papers from group B by a single member of that group. The chances that the top scientist is from group A is not 50%. How was this potential problem accounted for? If not, the authors could state how this might impact the outcomes.

R3: more statistics would be nice for race/ethnicity

p. 7, Figure 2: It would be better to see statistics provided as part of this analysis. It strikes me that the problem at hand here is similar to sample-to-sample microbiome population analysis, which use rigorous statistics, and the same types of methods could be used for analysis in Fig. 2. This level of statistical rigor is present as part of Figure 3, which was good to see.

The figure in question is the wru-based study of the field vs. honorees.

R3: why use average demographic distribution?

p. 4, paragraph 4: Why is using the average demographic distribution a reasonable assumption to make?

The section in question is:

However, in the case of names that were not observed in the census, the function’s behavior was to use the average demographic distribution from the census. We modified the function to return a status denoting that results were inconclusive instead.

R3: for first/last corresponding, why take majority rule?

p. 3, penultimate paragraph: if the breakdown is approximately 40/60 on first/last authors being corresponding, why take a majority rule? Seems more rigorous to randomize the selection process according to this "weighted die"

The paragraph in question is:

We performed further analysis on PMC authors to learn more about corresponding author practices. First, we developed and evaluated a method to infer a corresponding author when the coded corresponding status was not available. For papers with multiple authors and at least one corresponding author, the first author was corresponding 43% of the time, whereas the last author was corresponding 62% of the time. Therefore, we assumed the last author was corresponding when coded corresponding author status was not available (120 articles from PMC and all articles from PubMed).

R1: technology not mature

There have been quite a lot of discussion on the methodological weakness of this paper following its release on bioRxiv. The Race and Ethnicity chapter was particularly controversial, for example mixing it with religious terms was clearly an important error. Therefore, even if interesting, the technology proposed in the paper has still to mature to produce reliable results.

R3: the paragraph noting that majority favoritism can be a pernicious bias did not add to the paper

p. 1, last paragraph: I feel that this first paragraph of the introduction is not the strongest and doesn't add much to the paper overall. I would suggest removing or reworking.

The paragraph in question is:

Scientists’ roles in society include identifying important topics of study, undertaking an investigation of those topics, and disseminating their findings broadly. The scientific enterprise is largely self-governing: scientists act as peer reviewers on papers and grants, comprise hiring committees in academia, make tenure decisions, and select which applicants will be admitted to doctoral programs. A lack of diversity in science could lead to pernicious biases that hamper the extent to which scientific findings are relevant to minority communities. For example, finding that minority scientists tend to apply for awards on topics with lower success rates [1] could be interpreted either as minority scientists select topics in more poorly funded areas or that majority scientists consider topics of particular interest to minority scientists as less worthy of funding. Consequently, it is important to examine peer recognition in different scientific fields.

R3: please provide a p-value instead of 95% confidence intervals

p. 9, Figure 4: The same point can be made here as for Figure 2. It would be nice to see a statistical analysis and a p-value or two to justify claims beyond appealing to figures.

The figure in question is the analysis of name origins.

R4: keynote speakers earlier to 2002 available

Minor: the keynote speakers for ISMB can be found on ISCB's web site and AAAI web site prior to 2002.

R4: positive elements

Overall this is a clearly written manuscript on a clearly important topic. The work has merits and it presents simple conclusions, but it also has deficiencies in analysis and presentation of results that need to be addressed.

Positives:

Social trends in science are important to track and corrective measures are necessary when disparities are identified. The manuscript brings attention to these important issues.

Societies should be kept in check and it is the responsibility of the members (and non-members) to appropriately expose problems and fix problems. It is difficult to strike the right tone and for most of the manuscript the authors did it well.

The analysis is solid for the most part. The focus on corresponding authors is appropriate. I see no better way to address the issues of gender and race/ethnicity but to employ prediction. I'd assume this prediction is roughly correct for gender. I am less sure about race, ethnicity, religion, etc because a few queries at NamePrism did not convince me of a very high accuracy or satisfy me with its groupings. The authors use their own tool, but I have not queried it. Overall though, I am willing to accept that this works well on average to not impact broad conclusions. The authors recognized that.

Although results might appear simple, substantial work has been completed.

R4: discuss how to better honor meritorious scientists

I would like to see discussion on the topic of how ISCB should adjust to identify scientists that merit recognition but have fallen through the cracks so far? Can a data-driven help be of use?

R5: missing names are non-random & may not match skin color

There are non-random missing data. European Spanish/Portuguese/Italian are called Hispanic. Assignment of the race (white/black/Asian) based surname is difficult to extrapolate to a multi-cultural society where last name and skin color are maybe discordant.

R4: ISCB conferences are not worldwide meetings

Related to above, the selected journals for the background are of global reach (especially Bioinformatics because of rankings and $0 closed-access publication charges) whereas selected ISCB conferences (with their honorees) are more localized in Europe and North America for financial reasons. Considering the field as a whole, there are conferences that are limited to Asia, such as APBC or ISCB-Asia/Genome Informatics Workshop that are both of good quality and well known. APBC 2020 web site also shows that BMC Bioinformatics is among journals for their special issue. This suggests that the journals indeed better represent the field and that conferences are more geographically fragmented, including potentially ISCB conferences. This in turn means that the selection of conferences is critical for the outcome and appropriateness of this analysis. I think it would be appropriate to see a recognition of this fact in the manuscript and adjust the analysis.

R3: citations for genderize would be helpful

p. 4, paragraph 2: It would be great to see citations for Genderize, especially ones that provide insights on accuracy/reliability. It would also be nice to know how inclusive this app is in terms of names from across the world. Without that, the 1578 missing forenames could be biased by any bias in this app. This discussion is had with respect to wru lower down on the page, so it's natural to provide this standard here.

The paragraph in question is:

We predicted the gender of honorees and authors using the https://genderize.io API, which produces predictions trained on over 100 million name-gender pairings collected from the web. We used author and honoree first names to retrieve predictions from genderize.io. The predictions represent the probability of an honoree or author being male or female. We used the estimated probabilities and did not convert to a hard group assignment. For example, a query to https://genderize.io on January 26, 2020 for “Casey” returns a probability of male of 0.74 and a probability of female of 0.26, which we would add for an author with this first name. Because of the limitations of considering gender as a binary trait and inferring it from first names, we only consider predictions in aggregate and not as individual values for specific scientists.

R5: is representation driven by geography?

There are assumptions and analysis issues. Did the author check if the association between predicted race/ethnicity and author Affiliation? Was Asian bias by country or ethnicity? For example, are American-Asians represented in ISCB honorees but Asians who are geographically located in Asia under-represented. Is this associated with geographic attendance at ISCB meetings? One might expect local awardees in the host country. For example, the authors note that ISMB keynotes had more probability attributable to Israel, while RECOMB had more attributable to East Asian countries.

Focus race analysis in the US

Because wru is trained on US census, we will analyze and discuss results of race analysis only on US-affiliated authors/honorees. Global trends can still be mentioned and linked to analysis notebook.

Exact method for computing confidence intervals of enrichment (RR)

Currently, the confidence intervals of enrichment are estimated via var(log(RR)) by the delta method (1/x1 - 1/n1 + 1/x2 - 1/n2) and a small continuity-like correction is applied to avoid dividing by 0. A potentially better estimate is to use an exact method (e.g., as proposed here) that calculates a lower bound even when the estimate is at infinity.

WIP.

R2: difference b/w equity & diversity unclear

I was not able to parse the subtle implications of "We suggest that considering equity may be more appropriate than strictly diversity" that the authors offer in the Conclusion section.

R2: figures 1/2 do not support the title (disparities)

Most of the technical content is around associating names to genders/geographic origin or curating publications.
Figs 1 and 2 does not strongly support the title. The over-representation and under-representation has not been quantified, with the appropriate statistical measures.

Regarding the names of name origin groups, what do you think about these new names:

Celtic/English names
European names
East Asian names
Hispanic names
South Asian names
Arabic names
Hebrew names
African names
Greek names
Nordic names

@cgreene @dhimmel @arielah

Originally posted by @trang1618 in #56 (comment)

Manuscript Title

The manuscript needs a title. Key results thus far:

Gender has not reached parity.
Asian honorees are under-represented.
White honorees are over-represented.

The results thus far for other race/ethnic groups are hard to interpret. Our current analysis lumps the Iberian peninsula in with other Spanish and Portuguese speaking communities, which is likely to under-state the degree to which Latin American scientists are under-represented among honorees.

Our working title has been: "Analysis of ISCB Fellows and Keynotes Reveals Disparities"

R4: women may already be overrepresented / add conf intervals / discuss

Figure 1. There is no statistical analysis of this data. By eyeballing the graphs it looks like that women are not underrepresented as honorees, and might even be overrepresented. This should be discussed more formally (e.g. statistical tests) and confidence intervals or statistical significance should be provided much like in Figure 2, whatever the conclusions are.

Editorial Feedback

We have received some elements of editorial feedback that we must address before this can be sent for review:

Thank you for sending your manuscript "Analysis of ISCB honorees and keynotes reveals disparities" to eLife. If you are able to address the points listed below, we would be happy to send your work to external referees for in-depth review as a Feature Article. (In our experience, addressing these points will help to ensure that the reviewers focus on the essential content of your manuscript, rather than being side-tracked by other issues.)

Abstract

a) Please reword the abstract as follows, replacing XXXX/YYYY/ZZZZ with the relevant figures
Delivering a keynote talk at a conference organized by a scientific society, or being named as a fellow by such a society, indicates that a scientist is held in high regard by their colleagues. To explore if the distribution of such indicators of esteem in the field of bioinformatics reflects the composition of this field, we compared the gender, country of affiliation, race/ethnicity and name-origin of 412 researchers who had been recognized by the International Society for Computational Biology (75 fellows and 337 keynote speakers) with XXXX researchers who had been the corresponding authors on papers in three leading bioinformatics journals between YYYY and ZZZZ. The proportion of female fellows and keynote speakers was similar to that of the field overall, However, fellows and keynote speakers with an affiliation in the United States were over-represented by a factor of 1.6; moreover, almost half of this excess was accounted for by a deficit of 41 fellows and keynote authors from China, France and Italy. Furthermore, within the US we found an excess of white fellows and keynote speakers, and a depletion of Asian fellows and keynote speakers. Globally, names of East Asian origin have been persistently underrepresented among fellows and keynote speakers

Introduction

b) At present the last four sentences of the first paragraph are about bias against African-American/black scientists, which is followed by four sentences about gender bias, which is followed by two sentences about bias against Asian/black/African-American scientists. Please revised so that all of the discussion about bias against Asian/black/African-American scientists come before or after the discussion about gender bias. (I would suggest after to reflect the order of the discussion in the abstract and in the rest of the article)

Materials and methods

c) eLife uses the introduction/results/discussion/methods format, and if your article is accepted for publication you will need to move the Materials and Methods section to the end of the article. You don't need to this now, but please delete figure 1 as the figures should be reserved for the results of your study. (If you article is accepted, we can discuss reinstating this figure as a supplement later in the article).

Results

d) Please delete the subsection heading "Curated honorees and . . ."

e) Please change the sub-section heading "Assessing gender diversity . . . " to a heading that summarizes the findings of this sub-section.

f) Figure 2: The caption for this figure could be a lot clearer if it described what is shown in the left panel, then described what is shown in the right panel, and then compared the two panels. Also, why is the value of the first column in the left panel zero?

g) The sub-section "Predicting name origin groups . . . ." belongs in the Materials and methods section

h) Please move the sub-section "Assessing the name origin diversity . . . " to later in the article so that your results are presented in the following order: i) gender; ii) country of affiliation; iii) biases within the US; iv) name of origin. Please also change the heading of this sub-section to a heading that that summarizes the findings of the sub-section.

i) Figure 4: This caption could also be a lot clearer, so please revise along the lines suggested for figure 2.

j) Please change the sub-section heading "Affiliation analysis" to a heading that summarizes the findings of this sub-section.

k) Does table 2 include any information/data that are not already available if figure 5? if no, it might best to make table 2 a source data file or an additional file.

l) Please change the sub-section heading "Assessing the racial and ethic diversity. . . " to a heading that summarizes the findings of this sub-section.

m) Figure 6 would be easier to follow if panel C became panel B (as in figure 2 and figure 4). Also, how essential are panels B and D? If they are not essential they could become a figure supplement to this figure? Also, this caption could also be a lot clearer, so please revise along the lines suggested for figure 2.

n) Please delete the heading "Assessing the name origin diversity of US-affiliated authors and honorees" so that the analysis of biases within the US is in a single sub-section. If your article is accepted for publication we may need to consider the similarity of figure 4 and figure 7, but we don't need to do anything about this for now.

o) This final point is a generic point that is based on our experience with previous manuscripts based on surveys. Given the large number of figures and tables in your manuscript, please check that when the text says (see Figure X ), Figure X is indeed the relevant figure (and please do the same for tables). Likewise, please ensure that numbers (like the number of respondents) are consistent throughout the manuscript. In our experience, referees respond badly when a manuscript refers the reader to the wrong figure or table, or when numbers are non consistent throughout the manuscript.

R2: Any comments on LGBTQ / disability representation in STEM/comp bio?

Any comments on LGBTQ, disability representation in STEM or the field of computational biology ?
Finally, while I think the paper is interesting addressing the social/demographics of the ISCB community, it is outside the scope of the technical program of ISMB.

R4: background distribution != society membership

The background in the study is created from the publications in well-selected journals (Bioinformatics, BMC Bioinformatics and PLOS Computational Biology). That background is appropriate and provides a useful characterization but is not necessarily the most appropriate. When one talks about a society, an alternative appropriate background would be the professional membership in that society (a society has responsibilities to recognize the work of its members well). The gender information in ISCB should be publicly available so the authors can check the trends and agreements with the current background. The authors could even be able to work out a solution with ISCB to apply the predictors to the data and more deeply test their hypotheses. The conclusions could hold or not and new patterns could emerge.

R3: "other categories" is used but not defined

p. 9, Figure 4: The term "other categories" is used but these are never really defined (see previous point). It would be good to do so, perhaps even in a supplement.

R4: which conferences are included is poorly justified

The analysis focuses on ISCB, its fellows and its conferences. ISCB conferences are listed at: https://www.iscb.org/iscb-conferences however, justification was not provided as to why some conferences made it and others did not. PSB for example was assessed as international but this is hard to justify given its location patterns. GIW/ISCB-Asia is omitted, but no justification was given. This is of course of significant importance for the conclusions. Either justification or expanded analysis is needed.

Update Table 1 of example name for each name origin group

TODO after completing and merging #95.

R1: name origin != nationality

Furthermore, classifying names might reveal an origin but not a nationality and certainly not where the work was carried out. All these consideration seems to scape the analysis (i.e. considering the scientists with a greek name - that is easy to recognise as such- and got some of the ISCB award recently represent a minority when they have developed almost all their profesional time in the USA is very unclear).

R3: the terms "race" and "ethnicity" are used in the paper but not defined

p. 4, paragraph 4: The terms "race" and "ethnicity" are used in the paper, but there isn't really a definition of these terms in terms of categories. It would be nice to see that explicitly laid out, perhaps with a citation.

The paragraph in question is:

We predicted the race and ethnicity of honorees and authors using the R package wru. wru implements methods described in Imai and Khanna [14] to predict race and ethnicity using surname and location information. The underlying data used for prediction are derived from the US Census. We used only the surname of author or honoree to make predictions via the predict_race() function. However, in the case of names that were not observed in the census, the function’s behavior was to use the average demographic distribution from the census. We modified the function to return a status denoting that results were inconclusive instead. This prediction represents the probability of an honoree or author selecting a certain race or ethnicity on a census form if they lived within the US.

R4: out of scope for conference

This topic is out of scope for the proceedings of ISMB, at least by how I interpret the scope. It has its place in the published literature (I recommend with major revisions), but there is no methodology for or the analysis of molecular and biological data. Instead it is a study about the authors of such papers and broadly speaking it characterizes the recognition and reward system for their contributions to the field.

R5: were names manually checked for accuracy?

Given this small pool of 411, some names of which were duplicated (fellows and keynotes etc), were they manually checked for accuracy.

R3: research advisers are not the best proxy for senior faculty

p. 3, paragraph 2: research advisors don't seem to be the best proxy for senior faculty who would be invited for keynotes or honored as fellows

The paragraph in question is:

We assumed that research advisors in the field would be those most likely to be invited for keynotes or to be honored as Fellows. Therefore, we collected corresponding author names to assess the composition of the field, weighted by the number of corresponding authors per publication.

greenelab / iscb-diversity-manuscript Goto Github PK

iscb-diversity-manuscript's Introduction

Analysis of ISCB honorees and keynotes reveals disparities

Citation

Manuscript description

Manubot

Repository directories & files

Local execution

Continuous Integration

License

iscb-diversity-manuscript's People

Contributors

Stargazers

Watchers

Forkers

iscb-diversity-manuscript's Issues

Recommend Projects

Recommend Topics

Recommend Org