clauswilke / dataviz Goto Github PK

View Code? Open in Web Editor NEW

3.1K 3.1K 688.0 477.5 MB

A book covering the fundamentals of data visualization

Home Page: https://clauswilke.com/dataviz

License: Other

R 0.22% CSS 2.43% TeX 0.54% Shell 0.05% HTML 95.20% Python 0.17% JavaScript 1.38%

dataviz's People

Contributors

Stargazers

Watchers

Forkers

eleakin rfsaldanha juadiegaitan alabarga cuulee darencard steveputman allensmile raquelredo peranti ruixiangliu flopezo florianhartig manlius themiurgo anhnguyendepocen ominux bgtamang uraboer ccbio ax3man qinlab meumesmo bawcos joey711 giserh rcleoni guhjy youngforever222 nblakhani sibusiso16 xiuying lecy gabrielsartori xtmgah alebecerra rajaldebnath ivonnechen teslaa22 yanghaha11514 jgarces02 nicolelhy95 ianmadlenya pabloi jamesscottbrown pjbitterman ytlogos mcdussault kartechbabu narayananr dougneedham ashishtele rvaughan osirisjs thiagoarrais kormilitzin mohisen sisov aboussetta kanishkd4 paralax fishinwind xchromosome219 gjimenezucm mycarta hail2thief duthedd malcolmbarrett byuidatascience yenhsieh tiffapedia chenxofhit buyuhuang noahpieta hal2001 rougier zoldello nemochina2008 cation98 legisam cnyuanh pcarbo danielarantes dleutnant smartgamer errear emanuelfontelles nikolayvoronchikhin ephuach rolex524 faithman bilguis92 chandudasari1988 yasserfarha forked-oilgains theoreticalecology pleiby swmpkim malorean jdbarillas

dataviz's Issues

Typo - "the Econommist" should be "The Economist"

In choosing_plotting_software, "the Econommist" should be "The Economist".

Typo in Preface

In "Good, bad, and ugly figures" there is a missing "to"
FROM:
Throughout this book, I show many different versions of the same figures, some as examples of how make a good visualization and some as examples of how not to.

TO:
Throughout this book, I show many different versions of the same figures, some as examples of how to make a good visualization and some as examples of how not to.

I.e. from "...how make a good visualization..." to "...how to make a good visualization..."

Replace volcano image in 3d chapter with something rayshaded

Typos/Grammar

Preface (P1, L6) - "a obscure" needs to be "an obscure."

Directory of visualizations: Plots of association (scatter, contour, etc.)

First, this looks fantastic! can't wait to see the final product.

Second, I realize it is still a work in process so I may be jumping the gun here, but I didn't notice anything in your directory of visualizations about plots of association (not sure that is the best term...) I am thinking of things like scatter plots, countour plots, etc. Essentially anything with one dist on x, one on y (and I suppose one on z).

Just wondering if you plan on adding these in and again, this looks great!

Rework data:ink ratio chapter

The chapter on data:ink ratio should be reworked to take into account the feedback by @steveharoz. See this Twitter thread: https://twitter.com/sharoz/status/1005868631023112192

Transcribed:

Why perpetuate the myth of the importance of a data-to-ink ratio? It's based entirely on Tufte's opinion books rather than empirical evidence. Debunked many times.

Bateman et al. CHI 2010
Borgo et al. TVCG 2012
Borkin et al. TVCG 2013
Haroz et al. CHI 2015
Skau et al. CGF 2015

Collectively, these articles refute the notion that “ink” or non-minimal graphical elements is predictive of performance: 1. Bateman et al 2010 and Haroz et al 2015 showed that some embellishments improve performance. 2. Bateman et al 2010, Haroz et al 2015, and Skau et al 2015 failed to find a measureable performance hit for some embellishments. 3. Haroz et al 2015 and Skau 2015 showed that some embellishments harm performance. So non-minimal ink can either improve, reduce, or not affect performance. It’s an irrelevant dimension. Of course, not every form of ink (e.g. grids and backgrounds) was tested. That could become a bit of a no true scotsman issue. But they do show that ink quantity fails to predict much of anything. So why use the term at all? And what evidence is there that it’s worth the effort to minimize contrast of grids or outlines? It's fine to like and advocate the style. But no need for the psuedosciency term. As for Borkin et al 2013, it showed an improvement in recognizability, which I completely agree is not the same as memorability.

Plan for revisions:

Rename the chapter to "Balance the data:ink ratio".
I consider the data:ink ratio useful to think about extreme cases: all the way to one end or all the way to the other end figures become ugly. In the middle, though, there is a large range of options that can work well.
Cite some of the relevant research literature.
Add a version of Figure 18.2 with a frame around the plot panel, as proposed by @hadley.
Make it clearer that many of the recommendations in this chapter are design choices that are guided to some extent by personal taste. Different people may make different choices, and that's fine.

Section about visualization of intersecting sets

Currently, the book doesn't have a section dedicated to the representation of multiple intersecting sets. This subject may be within the scope of the book and it's inclusion should be interesting.

I would suggest a discussion about Venn diagrams and UpSet plots.

Discrepancy in the Okabe and Ito palette

In the book, #999999 is listed as a color of the Okabe and Ito (2008) palette. But this color is not listed in their site, they use #000000 instead.

Is there a reason to use #999999 in the book? As a deutan, I find #000000 much easier to see, as it contrasts better with the other colors of the palette.

Thank you for all the work you put on this book!

Use of color and legend order in some Titanic figures (Chapter 5)

Thanks for sharing your work at this early stage, looking forward to get (and recommend!) the final book.

Two details to improve the overall coherence of figures using the Titanic dataset in chapter 5:

Figure 5.9, which is OK by itself, shows females distribution in blue, while all the other figures using this dataset use blue for males.

Also, you could consider to rearrange the order of genders in the legend in figure 5.6 to match the ones in 5.7 and 5.8; or, as I assume that the reordering is due to the same reasons explained later in figures 14.5 and 14.6, it would suffice to add a reference to that explanation.

Small typo in chapter 1

Hi Claus,

I discovered your book today via social media and I am very much enjoying it (and learning a bunch of stuff along the way!), thank you.

FYI, in chapter 1 there is a small typo:
"Let’s put things into practice. We can take the dataset shown in Table 1.2, map tempterature onto the y axis,"
Obviously it should be "temperature".

Best regards,
Andrew

Typo in description of image 17.4

Figure 17.4: Density estimates of the sepal lengths of three different iris species. By using solid, colored lines we have solved the probme of Figure 17.3 that...

I figure this should be problem

CDF Typo in Chapter 7

Both the chapter and section title, along with one instance in the first paragraph refer to the ecdf as the empirical cumulative density function instead of the distribution function. (BTW this book is great)

Minor typo in Figure 12.8

draditional -> traditional

Book looks awesome! Looking forward to reading!

Add link to rendered book in the repo description?

Just a suggestion ☺ I see it is in the README but I think that it'd be even better to have it on top as well.

And in any case, thanks for writing this book! 👌👏

Redraw figures for image format chapter

The two manually drawn figures of the image format chapter (https://serialmentor.com/dataviz/image-file-formats.html) should be redrawn, as they currently don't match the style of all other figures in the book:

Use Myriad Pro font
Make sub-plot labels 14pt at 6 inch wide
For the raster vs. vector graphics figure, use a figure that will actually be in the final book. The figure currently shown is outdated and uses an old design.

Potential Typo in Color Pitfalls

https://serialmentor.com/dataviz/color-pitfalls.html

Instead, they will typically have difficulty to distinguish certain types of colors, for example red and green (red–green color-vision deficiency) or blue and green (blue–yellow color-vision deficiency).

Is it supposed to be blue and yellow instead?

Section about continuous color scales/colormaps

I think the book would benefit greatly with a section describing the importance of using perceptually-uniform colormaps.

This is a good resource about this matter: https://bids.github.io/colormap/

Revise chapter: Handling overlapping points

This chapter needs some revisions:

The 2d histogram section should link to the 1d histogram chapter (Visualizing Distributions I).
The contour lines section could do with better examples. The blue jay dataset will work better.
The discussion about trend lines should be linked to the yet-to-be-written chapter about visualizing trends.
Add one example of many contour lines in different colors showing different subsets, labeled "bad": When there are too many different subsets, the resulting figure becomes undecipherable.

Add subsection about memorable figure.

In the "telling a story" figure, it might make sense to add a brief section called "Make a memorable figure". Research has shown that embellished figures can be more memorable than plain figures (e.g.:
Bateman et al. 2010). I just need a good idea for an embellished figure.

9.2 The case for side-by-side bar charts (and maybe line plots?)

This is a great read through, thank you so much for your hard work and great communication!

I was reading through chapter 9, section 9.2 and I totally agree the side-by-side bar chart is the most logical choice in comparison to stacked bar charts and pie charts. I don't know if you cover this later, but I was thinking to myself I would have actually done a line plot where each line is a company, the x axis is the year, and the y axis is the share percent. My only grievance with this kind of plot is that the overlapping lines could obfuscate trends whereas the side-by-side bar chart doesn't have that problem. However, it is a bit tougher to see yearly trends of of the companies just by tracing the height of a bar across each group (but that's minor). I was hoping to hear what you think of the use of a line plot in this case? Thanks!

Tiny typo

From "where as" to "whereas" in "3 Color scales":

Both states are in the South, they are immediate neighbors, and yet one state (Texas) was the fifth-fastest growing state within the U.S. where as the other was the third slowest growing from 2000 to 2010.

Online version: List of contents cannot be hidden

When viewing the online version of the book from a mobile device like a tablet, the list of contents is relatively wide and cannot be hidden, see here.

Explain correlation examples

The figure showing examples of correlations needs some more explanatory text.
https://serialmentor.com/dataviz/visualizing-associations.html#fig:correlations

Typo s5.2: explicity

The final sentence of section 5.2 should end "explicit y axis" not "explicity axis".

Suggestion for 1:1 ratio line in daily temperature normals plots

I think it would be beneficial to add a translucent, perhaps dotted, 1:1 ratio line which can highlight that Houston is generally warmer than San Diego most of the time.

(Please excuse this poor edit done in Paint)

Cut off axis tick label

In the chapter about ECDFs, one figure has a cut-off axis tick label:
http://serialmentor.com/dataviz/ecdf-qq.html#fig:county-populations-tail-log-log

Types of bad

(continuing a discussion from twitter)

While I like that visualizations are labeled as bad or ugly, it'd be informative to make those designations more consistent and clear.

Here are possible categories:

Wrong - The wrong information is shown on the screen (e.g., log scaled axis where the label also says that it's log - making it double log)
Deceiving - The information may be misperceived unless you pay careful attention (e.g., small multiples with different y-axes)
Imprecise - Not necessarily the wrong information but may not be good for reading/comparing individual values (e.g. pie charts with many slices or stacked bar charts)
Not optimal - Some tasks may be difficult (e.g., difficult to find stuff with out of order bars)
Ugly - Claus doesn't like it (e.g., angled x-axis text)

You'll probably want to combine some of those categories for simplicity.

What's tough is that a lot of these depend on which information a person wants. Stacked bars are imprecise for individual comparison, but do well for comparing the total size of the stack to another stack.

Write chapter: Visualizing trends

This chapter will talk about linear and non-linear fits, moving averages, and detrending. Will also talk about common pitfalls, such as that many smoothers are unreliable or misleading at the edges of the data range.

Feature request: direct links to code

Hi Claus,

It would be a really good addition, I think, to see if either it's possible to make figures into links in Rmarkdown, or similarly have footnotes or captions throughout the book, directly linking each fig to its source code. That way, people reading the online version can immediately jump to the code they need to reproduce.

-Stephanie

LaTeX formula

The little formula for log scales in "2.2 Nonlinear Axes" should be (use curly brackets):

$10^{0.5} = \sqrt{10} \approx 3.16$

in order to render correctly.

Issues to resolve before next full site rebuild

Issues to resolve before the next full site rebuild for public posting:

#39 Explain correlation examples
#40 Basic PCA examples
#44 Discrepancy in the Okabe and Ito palette
#47 Update iris figures in line drawings chapter
#49 Rework data:ink ratio chapter

Principles of figure design - choosing a font

First, piggybacking off of what Jeff said, I'm so very excited for this book. It looks awesome so far!

Realizing it's a work in progress, I was wondering if you had considered explaining how to change/control font on graphs in R/ggplot2. I use the 'extrafont' package to do this, but I would surely be willing to change my approach. This may be outside the scope of the book and apologies if I've missed it in your plan, but figured I'd mention it!

Figure 12.4 speaks about colored labels, but there aren't any.

Typo @ 21.2p2

A general, the program mangers told me, should be able to look at each figure and immediately see how what we were doing was improving upon or exceeding prior capabilities.

This should probably be managers or is it intentional...?

The need to see thought process

What an awesome book Claus!

I just want to take this opportunity to suggest something a few things that I believe might add to the book.

See, with subjects like these, I think two things are of great value:

1. Thought process explanation

I think that in the end, you could add a few visualizations where you explain the thought process into why that is a good choice (subjectively, of course) and what would be the wrong and alternative ways you could have constructed the visualization at hand.

A book that I think really nails this aspect (though not for data vis, but for statistical modelling) is Regression Modeling Strategies in which he explains quite nicely decisions that he took for modelling phenomena and what could've been some right choices and what would be some wrong ones. Even though data viz can be some what more subjective, I think you could emphasize that aspect and still provide value for people with this idea.

2. Some advice on reporting it self

I think that not enough time is spent on how reports should be developed around data visualizations.
Some principles and examples could help orchestrate a set of many visualizations. I'm thinking on how to coordinate fonts, colors, when to deviate from a chosen color scheme, how to mix different viz on the same type of data (ie. many visualizations of proportions) in a report. How to title and annotate viz coherently, throughout a report.

So those are a few two cents 😄

Congrats on the great book!

Data Viz Human Research and Typography

Hi,

I have finished reading the first two sections of Fundamentals of Data Visualization online and I
am really enjoying it. At the moment, I'm using it to create an interactive data visualization to embed in digital scientific papers (https://datavis-demo.herokuapp.com/).

In the final version, I'd love to read about how data visualization research with human subjects supports the arguments you make in the text (as well as more general data viz human research). Experiments supporting certain practices and principles would be great to read about.

I have also been wondering about the role of typography in creating readable data visualizations. Are there any research-supported guidelines? Finally, I found one typo/awkward sentence in chapter 10 :

"The archetypal such visualization is the pie chart"

The word "such" is not needed.

Create other language for visualization examples

I read through your preview chapters and I liked it. It would be nice if you used other languages for you examples, like Python.

Write chapter: Visualizing uncertainty

This chapter will discuss various approaches to visualizing uncertainty, such as error bars, confidence bands, credible intervals, posterior distributions, hypothetical outcomes, etc.

Things to be completed for final draft

New chapters

#52 Visualizing trends
#53 Visualizing uncertainty
#54 Visualizing geospatial data

Substantially revised chapters

#63 Finalize directory of visualizations
#55 Revise "Handling overlapping points"
#48 Add section on memorable figures

Minor issues

#59 Swap first and second figure in figure captions chapter.
#56 Redraw figures in image format chapter
#61 Revisit Tufte-style bar graphs
#62 Replace volcano image
#64 Fix sina plots
#65 Fix margins in ridgeline plots
#66 Attribute data sources in all figure captions
#74 Add a beeswarm plot?

include the figure cited from The Economist 2011

In section 1 of chapter 19, "Figure titles and captions", the dataviz "Corrosive corruption" from The Economist is cited and a point is made for deviations from its design in figure 19.1.
IMHO not having the original available is a problem because it does not allow for a quick and easy comparison with the proposed changes.
It could maybe worth asking permission to The Economist to include the figure in the book...(the worst you can get is a no ;-)

Finalize directory of visualizations chapter

Write chapter: Visualizing geospatial data

This chapter will provide a basic intro to making maps. Topics to be addressed are projections and choropleths. In particular, will discuss how choropleths can be misleading when different geographic regions have different sizes, and how to work around this issue.

Review recommendation to avoid Tufte-style broken bar plots

The isotype paper by Haroz et al (http://steveharoz.com/research/isotype/ISOTYPE_Visualization_CHI2015_Haroz_Kosara_Franconeri.pdf) recommends Tufte-style gridlines because they help visually assessing length. In my chapter on data and ink, I currently recommend against this style (Fig. 20.12). This section needs to be revisited and likely revised.

Enable HTTPS?

Hi Claus, I wonder if it is too much trouble for you to enable HTTPS for your website https://serialmentor.com/dataviz/ (you may consider Netlify if you have not used it). I'm asking because I wish to list this book on the homepage of bookdown.org. Thank you!

Section 2.2 Log Scale Explanation

The text above Figure 2.4 reads:

$100^{0.5} \approx 31.6$

However

$100^{0.5} = 10$

Update iris figures in line drawings chapter

The figures using the iris dataset in the chapter on line drawings should be updated to look like the figures in the chapter on redundant coding. This means species names should be spelled out fully ("Iris setosa" instead of "setosa") and put in italics.

Dimensional reduction for time series

The end of the time series chapter mentions dimensional reduction. Here's a possibly useful reference http://www.aviz.fr/~bbach/timecurves/

typos

Since you asked

Section 1: First sentence parallel structure, 'convert' -> 'converting'

Section 1.2: 'tempterature'

Section 6.2: 'useles'

Section 6.3: 'acutal'

Section 7.1: 'wisker' -> 'whisker'

Section 14.2: 'lifes'

Basic PCA examples

Add a low-dimensional PCA example, maybe using the blue jays dataset. Plot head-length vs. body mass and then draw PC1 & 2 into that plot, and then plot PC2 vs PC1.

typo in last sentence of second paragraph of section 9.2

"This is a general problem of stacked-bar plots, and the main reason why I normally not recommend this type of visualization."

I think there is a missing "do" in that sentence: "I normally do not recommend"