One of the most common questions that newcomers to data [science/visualization/analysis] ask is: "What tools should I use to create data visualizations?" While I always recommend learning design principles before tools, I thought I'd take a stab at answering that question by analyzing what tools the /r/DataIsBeautiful community uses.
For the uninitiated, /r/DataIsBeautiful is an online community dedicated to data analysis and visualization, where people post and discuss various data visualizations from around the web. Sometimes /r/DataIsBeautiful community members create and share their own data visualizations -- called "OC," or Original Content -- which I have always found to be a great source of ideas and inspiration.
As part of the /r/DataIsBeautiful posting rules, every OC contributor must include a comment on their post describing the data source(s) and tool(s) they used to create their data visualization. Thus, analyzing their tool usage over the years was a fairly simple n-gram analysis of all comments made by OC contributors on /r/DataIsBeautiful that mention the word "tool." For this article, I analyzed thousands of comments made by OC contributors to /r/DataIsBeautiful between January 2014 and January 2016.
(Unfortunately, it was difficult to parse out mentions of the "R" language with the n-gram analysis, so we'll have to use ggplot2 as a proxy.)
The most popular tools on /r/DataIsBeautiful are:
Tool | Free? | Requires programming? | Typical uses |
Excel | Paid | No | Basic data analysis and visualization |
Python | Free | Yes | General-purpose scripting language that is typically used for data scraping, cleaning, and wrangling |
D3.js | Free | Yes | JavaScript-based library for interactive data visualization on the web |
Tableau | Paid, with limited free option | No | Advanced interactive data visualizations for the web |
ggplot2 | Free | Yes | Advanced data visualization library for the R scripting language |
R | Free | Yes | Scripting language designed for statistical analysis, modeling, and data visualization |
matplotlib | Free | Yes | Python-based visualization library for making basic data visualizations |
As expected, Excel dominates the list as the primary tool that most beginners use: In this case, there have been at least 643 OC data visualizations on /r/DataIsBeautiful that were made with Excel. Excel is a great tool to start with, but you should eventually move on to more advanced tools that allow you to programmatically generate visualizations such as matplotlib/Seaborn, D3.js, or ggplot2. If programming isn't your forte, Tableau is a much better option than Excel.
Here's descriptions for the rest of the tools:
Tool | Free? | Requires programming? | Typical uses |
JavaScript | Free | Yes | Scripting language for the web |
Highcharts | Free for non-commercial projects | Yes | JavaScript-based library for programmatically creating interactive data visualizations for the web; easier to use but less flexibility than D3.js |
Datawrapper | Free | No | Basic online interactive visualizations |
Gephi | Free | No | Network visualization |
Plotly | Free | No | Web-based GUI for creating interactive data visualizations |
CartoDB | Free (limited) | No | Web-based tool for creating interactive online maps |
Seaborn | Free | Yes | Python-based visualization library for advanced statistical data visualization |
Matlab | Paid | Yes | Powerful analysis, modeling, and data visualization tool |
Google Charts | Free | Yes | Simple JavaScript-based visualization library for creating interactive online visualizations |
Leaflet.js | Free | Yes | Simple JavaScript-based visualization library for creating interactive online map visualizations |
LaTeX | Free | Yes | Document preparation system that it somehow used to create visualizations (???) |
Google Fusion | Free | No | Web-based tool for creating interactive online data and map visualizations |
Bokeh | Free | Yes | Python-based visualization library for creating interactive data and map visualizations |
I was also curious about temporal trends in library usage, so I grouped the tool mentions by year and plotted them below.
GUI-based visualization tools such as Tableau and Gephi are seeing steady growth, whereas Python and matplotlib (oddly) seem to be waning in relative popularity. D3.js and ggplot2 are similarly experiencing steady growth, although I should note that 2016's counts are only based on January 2016's comments and may change by the end of the year. We'll have to revisit these trends come 2017.
Hopefully that answers all of your data visualization tool-related questions! If you have any more questions or concerns, please leave them in the comments.
If you'd like to repeat this analysis yourself, run the following SQL statement on the Google BigQuery database.
SELECT body, created_utc FROM [fh-bigquery:reddit_comments.2016_01], [fh-bigquery:reddit_comments.2015_12], [fh-bigquery:reddit_comments.2015_11], [fh-bigquery:reddit_comments.2015_10], [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08], [fh-bigquery:reddit_comments.2015_07], [fh-bigquery:reddit_comments.2015_06], [fh-bigquery:reddit_comments.2015_05], [fh-bigquery:reddit_comments.2015_04], [fh-bigquery:reddit_comments.2015_03], [fh-bigquery:reddit_comments.2015_02], [fh-bigquery:reddit_comments.2015_01], [fh-bigquery:reddit_comments.2014] WHERE LOWER(BODY) LIKE "%tool%" AND subreddit == "dataisbeautiful"