Making a d3 ridgeline graph: #tidytuesday #2

After a two-week break due to being in Finland, I'm back for a second #tidytuesday visualization! This week's visualization, made in Python, d3 and React in two hours: the most popular US baby names since 1880.

What is #tidytuesday again?

From my first #tidytuesday post:

Every week the R for Data Science community releases a dataset online, encouraging R learners to create a visualization based on the dataset and post it on Twitter.

I ignore the "learn R" part of the instructions and constrain myself to two hours per visualization. This week's data contained a CSV of the top 1000 baby names per sex per year from 1880 to 2017 in the following format:

year,sex,name,n,prop
1880,F,Mary,7065,0.07238359
1880,F,Anna,2604,0.02667896
1880,F,Emma,2003,0.02052149
1880,F,Elizabeth,1939,0.01986579
...

Data processing

I started by firing up a Jupyter notebook to turn the original 48 MB CSV into smaller JSON files I could actually load into a webpage.

I split the data by sex, then chained some pandas functions together to get a new table of the top names overall from 1880-2017:

df_male = df[df.sex=="M"]
name_counts_male = df_male[["name", "n"]].groupby(["name"]).sum().sort_values("n", ascending=False)
// new df with index column name and value column n summed from 1880-2017
top_male_names = name_counts_male.head(20).reset_index()["name"].to_numpy()
// array(['James', 'John', 'Robert', 'Michael'...])

I could then take the top 20 names and filter down the overall dataset into a 180 KB JSON file with just rows for those names.

df_male_top = df_male.query("name in @top_male_names")
df_male_top.to_json("./male-top.json", orient="records")

// the json file
[
    {"year":1880,"sex":"M","name":"John","n":9655,"prop":0.08154561},
    {"year":1880,"sex":"M","name":"William","n":9532,"prop":0.08050676},
    {"year":1880,"sex":"M","name":"James","n":5927,"prop":0.05005912},
    ...
]

I did the same for female data.

Visualization

I was able to re-use a good amount of react/d3 boilerplate from the previous week's setup, so check that post or the GitHub repo out if you're curious.

Pretty quickly after seeing the data I had an idea of the graph I wanted to make, inspired by ones I had seen in previous #tidytuesday weeks online:

The actual visualization code boiled down, essentially, to two paths, one stroked and one filled:

nameGroups.append("path")
    .datum(d => dataTop.filter(x => x.name === d))
    .attr("fill", "none")
    .attr("stroke", "white")
    .attr("stroke-width", 4)
    .attr("d", d3.line()
        .x(d => padding + labelWidth + xScale(d.year) + xOffset)
        .y(d => padding + titleHeight + (topNames.findIndex(x => x === d.name)) * labelHeight + yScale(d.n))
    );

nameGroups.append("path")
    .datum(d => dataTop.filter(x => x.name === d))
    .attr("fill", "steelBlue")
    .attr("stroke", "none")
    .attr("d", d3.area()
        .x(d => padding + labelWidth + xScale(d.year) + xOffset)
        .y0(d => padding + titleHeight + (topNames.findIndex(x => x === d.name) + 1) * labelHeight)
        .y1(d => padding + titleHeight + (topNames.findIndex(x => x === d.name)) * labelHeight + yScale(d.n))
    );

Conclusion

...and that's as far as I got in two hours.

I'm hoping to speed up my ability to make stuff in d3 and consequently make more interesting stuff, especially more interactive stuff. The d3 part of this visualization took at most an hour of today's work, with the rest spent re-learning how to use pandas and attempting to set up a JS kernel for Jupyter notebooks (the kernel/notebook worked but JS was way too inefficient for the amount of data involved).

For this visualization, the obvious interactive element to add would be to specify the year range in consideration. The top 20 names displayed are the top 20 of all time (in the dataset); if you looked only at the last 20 years or the 20 years after WWII or whatever else you'd get a much different list. A technical challenge here is dynamically filtering through 48 MB of data: a solution I was planning to implement but didn't get around to is to pre-export a few different chunks of data, e.g. full set, post-1900, post-1945, post-1990, and post-2010, and then simply swap between them rather than filtering in real-time.

There are also plenty of follow-up questions to dive into:

the "popularity span" of female names seems significantly shorter than male ones. Would be interesting to quantify and visualize this
presently names are cordoned off by sex; what popular names have been androgynous/are present in both sets of data?
the dataset also has a field for proportion of total births that a name accounts for. How has the diversity in names varied over time? How many distinct names have made up the 50th, 80th, or 95th percentiles?

Making a d3 ridgeline graph: #tidytuesday #2

What is #tidytuesday again?

Data processing

Visualization

Conclusion

Comments (loading...)

Data visualization