Resources
Weeks 10 and 11: Visualization for Big Data and Machine Learning
In class during Week 11 I mentioned two resources, and added a third via email after class, so I wanted to collect them all in one place here:
Number of Datapoints vs. Number of Dimensions
To “unify” the topics of Week 10 vs. Week 11 I talked about how, if \(n\) is our number of datapoints (e.g., sample size) and \(p\) is the dimensionality of our data (how many “pieces of information” we have about each of the \(n\) datapoints), then
- Classical statistical methods were developed to handle settings where \(n \gg p\)1, so that we have a small number of parameters that we fit (estimate) using a high number of datapoints. Basic two-variable linear regression models, for example, can take in \(n \gg 2\) datapoints and estimate a line of best fit, where the line itself is characterized by its two parameters, slope and intercept (\(m\) and \(b\) in \(y = mx + b\)).
- In modern machine learning, however, scientists began to wrestle with an additional problem, that of extremely high-dimensional data, where the problem was in a sense the “opposite” of the classical-statistics scenario: now we have \(p \gg n\). In the modern field of biostatistics, for example, we talked in class about how the human genome contains about 6 billion nucleotide pairs, but researchers in the UK have mapped-out genomes for 500,000 people, so that now the number of dimensions we have for each person is \(\frac{\text{6 billion}}{500 000} = 12 000\) times the number of people we have…
For me, this notion of thinking about \(n \gg p\) versus \(p \gg n\) comes from learning statistics through an awesome textbook that is freely available in PDF format (legally!) here: https://hastie.su.domains/ElemStatLearn/
Week 6: Plotly
- Here is a direct link to the starter code for Lab 5 that I began writing at the end of class today!
Week 5: Observable JS and jQuery
For Week 4 I spent some time on the chalkboard and showing demos in the browser, as “sidebars” to the main content in the slides, so I wanted to share some of this sidebar material here!
ObservableJS
- The ObservableJS demo I showed in class, originally made for a microeconomics course I TAed back in the day but relevant to you as DSAN 5200 students if you reveal the JS code “behind” each of the UI elements (the sliders in this case), and how changing some aspect of the data in these JS cells automatically updates the visualization to reflect these changes.
- (For more of my Observable examples—as in, examples where I can therefore walk you through exactly how they work!—feel free to look at some of the other notebooks in my ObservableHQ account from that same course, like this one)
- More generally, though, ObservableHQ has a really rich set of tutorials covering a bunch of different use cases! At the bottom of that page, for example, you’ll see some which are tailored specifically to Python and Matplotlib users.
- One resource I remember helping me a lot was the official set of Observable resources on how inputs (sliders, checkboxes, etc.) work in OJS. To me this subtopic is especially helpful/fundamental since, once you see how data from the input elements (like sliders or checkboxes) flows into all other code cells which use this data as inputs, the notion of “topological data flow” that I mentioned in class today will hopefully feel less abstract.
- Lastly, though, if you’re like me and the weirdness around ordering of code cells in OJS feels very bizarre at first, I think this notebook is really really helpful for clarifying that in a way I didn’t have time to walk through in detail today. I basically keep that link open in a tab whenever I work in OJS, so I can go back and reference it when I get confused about why one cell ran before/after another.
jQuery
- The jQuery website has a “Learn” section here. It’s probably overkill, though, since jQuery is so popular and widely-used (though, falling out of favor to some extent for various reasons) that you should be able to use Google and StackOverflow to find answers to most issues that will arise with learning the basics and then using it.
Core Texts, Weeks 1-4
Wilkinson, The Grammar of Graphics (2006)
A key resource for the course, in terms of weaving many of the seemingly-disparate “strands” of the course together, is Leland Wilkinson’s book The Grammar of Graphics (Wilkinson 2006). Since it’s extremely hard to find in bookstores and extremely large and difficult to carry around even if you do find it, for educational purposes you can access a PDF of the book using this link.
Nathan Yau, Visualize This! (2011) and Data Points (2013)
Another pair of resources that we’ll be drawing on a lot throughout the semester! Way less technical than Wilkinson, and way easier/cheaper to find at bookstores or online, so I feel much less good posting this one online… But if your situation is anything like mine was when I was a grad student, and you’re living off approx. $0 per month of disposable income, here are EPUB copies of both books (which I am ethically justifying by reference to grad-student-Jeff’s experience of professors assigning 5 books/week without a sole PDF or EPUB in sight 😵, along with the fact that all 4 profs for the course are paid subscribers to Yau’s blog!)
Edward Tufte, The Visual Display of Quantitative Information (1983)
Tufte (1983) is (in my experience, at least) much easier to find than any of the books above, but in varying conditions of used-ness, so here is a PDF of this one as well.
(Tufte also has more books than the previous two authors, and any of them are interesting and informative, so if you like this one you can go out and buy/obtain the others as well!)
References
Footnotes
Here the \(\gg\) symbol can be read as “much greater than” (by analogy to the standard \(>\) for “greater than”)↩︎