Data visualizations are the flashy result of data analysts’ work, and there is no shortage of beautiful and engaging examples these days. Still, unless an audience member is a data analyst him or herself, it’s difficult to see all the work that goes into creating data masterpieces. Before data can become a visualization, there is an enormous amount of effort and expertise that must be applied to prepare it. It’s much like the metaphorical iceberg; what is visible is but a small portion of the whole effort. The work underneath – the mining, organizing, cleaning, and refining – is what supports the data visualization or dashboard.
Why Data Preparation is Important (and Challenging)
Data preparation serves one ultimate purpose: to ensure that your data meets the needs of your plans for that data. Data preparation helps data analysts prep data in such a way that by the time it reaches the exploration process, it can provide clear and relevant insights for more intelligent decision making.
Businesses today have the opportunity to explore data from dozens of sources—including those from the web, smartphones, corporate data bases, sensors, and more—for discoveries and insights. The visualization itself is the iceberg. Everything that happens beneath—the mining, organizing, cleaning, and refining—is the toil that allows the iceberg to exist.
According to a report written by Steve Lohr of The New York Times, data scientists spend 50-80% of their time prepping data. Of that 80%, analysts spend –
- 60% organizing and cleaning data;
- 19% collecting datasets;
- 9% mining the data to draw patterns;
- 3% training datasets;
- 4% refining algorithms; and
- 5% on other tasks.
While the formal term for these processes is typically “data preparation,” data scientists have their own monikers for their work. “Data munging,” “data wrangling,” even “data janitor work.” Even though data science was declared the “sexiest job of the 21st Century,” it’s clear that the folks actually doing the work understand that it can be challenging, a headache, and downright dirty.
Though it’s impossible to take all of the dirty work out of data prep, you can remove a good chunk of it with the right data aggregation platform and the right information.
Demystifying Data Preparation
While the practice of data preparation is quite technical, the purpose is fairly simple: make the data useful. To start, let’s define: what is useful data? Well, useful data is first and foremost data that can be accessed. Useful data must be trustworthy. It must have context and clear relationships. It should be describable, and can be understood alongside necessary labeling and (again) in the right context. Finally, the useful data must be in a format that is easily used for future analysis. This means all the inconsistencies between the data must be resolved.
Create a Data Lake
Part of the headache of wrangling data is just that—the wrangling aspect. Yet, batch data preparation only works when all of your data is in one place. iDashboards provides users with an in-product central data repository, known as the Data Hub., The Data Hub is where users can upload, sync, and blend data from various sources, including Excel, data warehouses, databases, and cloud applications. Uploading is easy and requires no coding or programming—simply drag and drop from connected applications and let the Hub do the painstaking work for you. Once everything is uploaded, you can begin performing your transformations.
Make it Flow
Once all of your data is in one place, you can begin to manipulate and prepare it for its eventual use in a dashboard, data visualization, webpage, or report. Creating a workflow that runs automatically is critical to the efficiency of data preparation.
Introductions to Your Transformations
Data transformation is probably the most involved portion of the ETL (i.e. Extract, Transformation, Load) process. In this stage, data that was extracted from your various data sources is prepped for loading into the end target, which is your visual. Some data does not require any transformation at all, in which case, a “direct move” or “pass through” would be all that is necessary to complete the process. Most data, however, requires cleaning so that only relevant data makes its way to the target.
Transformation is challenging when you’re dealing with one data source, but when you’re dealing with multiple, it can be even more so. Dashboards takes the headache out of the transformation process by interfacing and communicating with relevant systems on your behalf to do some the following (and many more):
- Select: Before any data transformation can take place, the columns of data that are needed to achieve the final result must be identified. Often the data needed for a particular job is contained in a dataset with a lot of unnecessary data. Only choose the columns of data needed, much like only the ingredients need to make a dinner recipe are removed from the fridge. For many jobs, this must be done with more than one dataset. Once all the columns of data have been selected, the process of transforming the data can truly begin.
- Join: Before and data transformation can take place, the columns pf data that are needed to achieve the final result must be identified. Often the data needed for a particular job is contained in a dataset with lots of data that isn’t needed. Only choose the columns of data needed much like only the ingredients need to make a dinner recipe are removed from the fridge. For many jobs this must be done with more than one dataset. Once all the columns of data have been selected, the process of transforming the data can begin.
- Filter: At any stage in the data preparation process, there may be particular rows of data need to be removed. For example, if the job is to only make a data set for a specific date range or department. In this case the columns with the date and department would be identified. The Filter transformation can now be used to remove any rows that do not meet the conditions sought after. A special type of filter can be used to remove any duplicate rows called, unsurprisingly, “de-duplicate”.
- Aggregate: Once your data hub contains only relevant, non-duplicate data, it’s time to aggregate it. Dashboards can provide summaries of each column, row, or combination of columns and rows.
- Date Difference: The Date Difference function allows you to view the difference between multiple datasets between a start an end date, by far one of the most essential functions of data analysis. After all, if you cannot see how you’re progressing in various departments from, say, September of last year and now, how can you know if you’re making sound business decisions or not?
- Validate: Part of data prepping involves removing bad data. You can do this by applying any form of validation. Any failed validation may result in full rejection of the data, partial rejection, or no rejection at all.
Make Sure It Makes Sense
Testing your workflow is critical to your final representation, as any new set of data dropped into the interface has the potential to change everything. You want to make sure you’ve filtered out the correct metrics, and are getting the results that you expect to get. Test-run your workflow at every stage to see how new data affects the other data. You can compare your data set If an error occurs, don’t worry—the original data source remains unchanged, so you can always go back and make adjustments.
Finally, take a good look at your hard work via an engaging and insightful visualization. Use color, lines, charts, graphs, and other design elements to tell a data story your audience not only will understand but one they will also be invested in. Ultimately, your data story should inspire thoughts and actions and drive informed decision making throughout your organization.
To get to the fun part of data analysis—the reporting and visualizing—you need to put in the leg work. Data preparation can be overwhelming, time consuming, and all-around frustrating, but with the right tools, it can be a little less painstaking and a lot more fun (we promise!).
The amount of data we’re given is not decreasing either. If anything, it is multiplying at a rapid rate, and whereas the term “less is more” applies to many situations, it does not apply to data analysis. The more information we have, the better, as that means more accurate insights and better decision-making. The trick is to develop a plan to manage, prepare, and analyze that data effectively.
Want to tackle data preparation? Click here to learn more about the iDashboards Data Hub.
Get the Guide Psychology of Data Vizualization
Take a primer in cognitive psychology, the science of perception, and neuroaesthetics and learn how to make dashboards even more effective.