Postulate is the best way to take and share notes for classes, research, and other learning.
At the invitation of a friend of mine, I showed up to a classroom in Pomona's science building at 11 AM today and spent two hours creating a visualization for the week's #tidytuesday dataset. The visualization was straightforward:
But it's interesting to think about how much previous learning and experience went into these two straightforward hours: basic Python/Pandas learned in a Kaggle tutorial two summers ago, d3 in the spring before that, and tons of React/NextJS knowledge ultimately wrapping everything together. Plus -- new learning about setting up local Jupyter notebooks, exporting JSON files from Pandas, and plotting maps in d3!
Every week the R for Data Science community releases a dataset online, encouraging R learners to create a visualization based on the dataset and post it on Twitter.
This week, the dataset was a giant CSV of about 60,000 "alternative fuel" stations across the U.S., from the Department of Energy.
I'm primarily interested in web-based data visualizations, so instead of using R I use d3.js and React for visualizations and Python for data processing.
The first thing I did when I sat down at 11 AM was set up a local Jupyter notebook and import pandas. I don't do much Python development so I was initially a bit terrified of setting up any sort of Python dev environment, but it turned out to be straightforward: I ran pip install jupyter
, added the Python scripts directory to my PATH, and one jupyter notebook
command later I was up and running at localhost:8888
. Turns out I already had Pandas installed.
There are a few dozen columns in the stations.csv
file provided, but as this was my first Tidy Tuesday and I hadn't done real data visualization work in a while, I decided to keep it simple: a plot of stations on a map of the U.S., with React controls to filter them down and updating stats on the side.
I downloaded the stations.csv
and used pandas to export out a JSON with only three columns: X
, Y
, and FUEL_TYPE_CODE
. The first two variables are coordinates, and the last a code like ELEC
or E85
corresponding to labels like Electric
and Ethanol (E85)
. The Python code looked like this:
df_raw = pd.read_csv("stations.csv")
df_locs_with_types = df_raw[["X", "Y", "FUEL_TYPE_CODE"]].copy()
df_locs_with_types.to_json("locs-with-types.json", orient="records")
et voila:
[
{"X":-86.2670210002,"Y":32.3679160003,"FUEL_TYPE_CODE":"CNG"},
{"X":-84.3988370001,"Y":33.7458430003,"FUEL_TYPE_CODE":"CNG"},
{"X":-84.3674609996,"Y":33.8219110003,"FUEL_TYPE_CODE":"CNG"},
{"X":-84.5438220004,"Y":33.7602560001,"FUEL_TYPE_CODE":"CNG"},
...
]
I had never done map-based visualizations in d3 before, though I had seen some examples. At first I simply applied a linear scale to the X and Y coordinates, then remembered that longitude and latitude coordinates aren't Cartesian and need to be projected in a relatively complex way (compared to a linear scale, anyways) to fit on a 2D plane.
Thankfully I soon discovered d3's built in geo
tooling. In particular, d3.geoAlbersUsa()
creates a projection function (takes in globe coordinates, spits out 2D coordinates) that even takes into account putting Alaska and Hawaii into insets. I combined this with Mike Bostock's TopoJSON US shapefile (derived from the Census Bureau's shapefiles) to get a US map up on the screen.
With a projection function set up, it was then easy to put dots for each station on the map. All told, the basic d3 code looks something like this:
import * as topojson from "topojson-client";
import * as d3 from "d3";
import us from "../data/2022-03-01/counties-10m.json"; // bostock's us-atlas data
import data from "../data/2022-03-01/locs-with-types.json"; // my data from Python
const w = 800, h = 500;
// set up projection functions
const projection = d3.geoAlbersUsa()
.translate([w/2, h/2])
.scale(1000);
const path = d3.geoPath()
.projection(projection);
// set up main SVG (a peek of React code here -- more on this later)
const svg = d3.select(svgRef.current);
svg.attr("width", w).attr("height", h);
// draw map
svg.append("path").attr("d", path(topojson.feature(us, us.objects.nation))).style("stroke", "black");
// draw stations
svg.selectAll(".point")
.data(data)
.enter()
.append("circle")
.attr("class", "point")
.attr("r", 2)
.attr("cx", d => {
const projectedPoints = projection([d.X, d.Y]);
return projectedPoints && projectedPoints[0];
})
.attr("cy", d => {
const projectedPoints = projection([d.X, d.Y]);
return projectedPoints && projectedPoints[1];
})
.attr("fill", "red");
The filtering part of React was pretty straightforward: I created a list of buttons controlled by a string useState
, then hooked that state variable up to a useEffect
that controlled d3. I won't include that part of the code here but you can find it in the GitHub repo.
Hooking d3 up to React was the slightly tricky part. d3 manipulates the DOM directly, which we accommodate for by putting the d3 code in a useEffect
hook that will only run once the page (and thus DOM) loads. But we only want the initial render code to run once, and different update code to run after that. To accomplish this, I created a ref called didMount
storing a boolean that would be set to true on first useEffect
run:
const didMount = useRef<boolean>(false);
useEffect(() => {
if (!didMount.current) {
// run initial code
didMount.current = true;
} else {
// run update code
}
}, [stateVarThatTriggersUpdate]);
This is a common trick to prevent useEffect
code from running on initial mount that I've adapted to run different code here.
The actual d3 code I put in the conditionals is rather crude, as I was running out of time. Instead of performing a proper update, I rather condensed my station-drawing code to a function, then removed all circles and re-ran the function with new data on update. Here's that same useEffect
code with a bit of psuedo-code:
const didMount = useRef<boolean>(false);
useEffect(() => {
if (!didMount.current) {
const svg = setUpSVG();
appendCircles(svg, originalData);
didMount.current = true;
} else {
// run update code
const svg = d3.select(svgRef.current);
const filteredData = originalData.filter(d => d.FUEL_TYPE_CODE === selectedFuel);
appendCircles(svg, filteredData);
}
}, [selectedFuel]);
And with that, the interactive visualization comes together. Again, you can see the real, full, messier code in the GitHub repo.
I quickly added some static React code, including a homepage, and deployed the site on Vercel. Such is the advantage of d3-based data visualization: a bit more setup for a much more streamlined web deployment.
Most of what I included above I either newly learned or re-learned today. None of it is completely foreign to me -- they're natural extensions of my existing Python and Javascript knowledge -- but nevertheless I was able to pick up some new, very practical skills while gaining the satisfaction of visualizing a real dataset, all in the span of two hours (and one more writing this blog post).
This visualization was extremely simple and arguably lacks real functional and aesthetic value, but I feel like I'm slowly collecting the building blocks I need to make bigger and better things. Hell, I spent an hour just setting up Python and re-figuring out React/d3 code from my old repos today -- even next week's visualization from me should be a lot better.
I also love the attitude of Prof. Hardin -- the Pomona professor running the classroom group sessions -- towards the project: "the goal is to create one plot while you're here." She worked alongside us and created an animated plot of the construction of each type of station over time in R.
With that I'll call Tidy Tuesday #1 a close!
Notes on data visualization projects