Postulate is the best way to take and share notes for classes, research, and other learning.
After a two-week break due to being in Finland, I'm back for a second #tidytuesday visualization! This week's visualization, made in Python, d3 and React in two hours: the most popular US baby names since 1880.
From my first #tidytuesday post:
Every week the R for Data Science community releases a dataset online, encouraging R learners to create a visualization based on the dataset and post it on Twitter.
I ignore the "learn R" part of the instructions and constrain myself to two hours per visualization. This week's data contained a CSV of the top 1000 baby names per sex per year from 1880 to 2017 in the following format:
year,sex,name,n,prop
1880,F,Mary,7065,0.07238359
1880,F,Anna,2604,0.02667896
1880,F,Emma,2003,0.02052149
1880,F,Elizabeth,1939,0.01986579
...
I started by firing up a Jupyter notebook to turn the original 48 MB CSV into smaller JSON files I could actually load into a webpage.
I split the data by sex, then chained some pandas functions together to get a new table of the top names overall from 1880-2017:
df_male = df[df.sex=="M"]
name_counts_male = df_male[["name", "n"]].groupby(["name"]).sum().sort_values("n", ascending=False)
// new df with index column name and value column n summed from 1880-2017
top_male_names = name_counts_male.head(20).reset_index()["name"].to_numpy()
// array(['James', 'John', 'Robert', 'Michael'...])
I could then take the top 20 names and filter down the overall dataset into a 180 KB JSON file with just rows for those names.
df_male_top = df_male.query("name in @top_male_names")
df_male_top.to_json("./male-top.json", orient="records")
// the json file
[
{"year":1880,"sex":"M","name":"John","n":9655,"prop":0.08154561},
{"year":1880,"sex":"M","name":"William","n":9532,"prop":0.08050676},
{"year":1880,"sex":"M","name":"James","n":5927,"prop":0.05005912},
...
]
I did the same for female data.
I was able to re-use a good amount of react/d3 boilerplate from the previous week's setup, so check that post or the GitHub repo out if you're curious.
Pretty quickly after seeing the data I had an idea of the graph I wanted to make, inspired by ones I had seen in previous #tidytuesday weeks online:
The actual visualization code boiled down, essentially, to two paths, one stroked and one filled:
nameGroups.append("path")
.datum(d => dataTop.filter(x => x.name === d))
.attr("fill", "none")
.attr("stroke", "white")
.attr("stroke-width", 4)
.attr("d", d3.line()
.x(d => padding + labelWidth + xScale(d.year) + xOffset)
.y(d => padding + titleHeight + (topNames.findIndex(x => x === d.name)) * labelHeight + yScale(d.n))
);
nameGroups.append("path")
.datum(d => dataTop.filter(x => x.name === d))
.attr("fill", "steelBlue")
.attr("stroke", "none")
.attr("d", d3.area()
.x(d => padding + labelWidth + xScale(d.year) + xOffset)
.y0(d => padding + titleHeight + (topNames.findIndex(x => x === d.name) + 1) * labelHeight)
.y1(d => padding + titleHeight + (topNames.findIndex(x => x === d.name)) * labelHeight + yScale(d.n))
);
...and that's as far as I got in two hours.
I'm hoping to speed up my ability to make stuff in d3 and consequently make more interesting stuff, especially more interactive stuff. The d3 part of this visualization took at most an hour of today's work, with the rest spent re-learning how to use pandas and attempting to set up a JS kernel for Jupyter notebooks (the kernel/notebook worked but JS was way too inefficient for the amount of data involved).
For this visualization, the obvious interactive element to add would be to specify the year range in consideration. The top 20 names displayed are the top 20 of all time (in the dataset); if you looked only at the last 20 years or the 20 years after WWII or whatever else you'd get a much different list. A technical challenge here is dynamically filtering through 48 MB of data: a solution I was planning to implement but didn't get around to is to pre-export a few different chunks of data, e.g. full set, post-1900, post-1945, post-1990, and post-2010, and then simply swap between them rather than filtering in real-time.
There are also plenty of follow-up questions to dive into:
Notes on data visualization projects