What Makes Data Science Work?
Anatomy of a Data Science Project
The following are organised in the sequence typical of an end-to-end data science project. The boundaries between stages are often fuzzy or interactive.
ETL, Data Integration & Wrangling
This is how we get data into an analytic platform. ETL (extract, transform, load) is the traditional process of moving data from databases, websites, files, and feeds into a local database or data warehouse, in a highly structured format. When the investment in that approach is unwarranted (or we need to make progress fast), it may be preferable to use the analytic platform itself to integrate data from a data lake, scrape websites, exploit APIs or database connectors, or use specialised tools.
Wrangling data is basically light-weight data integration — getting what we need into a place where we can manage it and analyse it without getting distracted by data engineering.
Clean datasets are like flying saucers: we’ve all heard of them, but… Even data warehouses can have issues that need to be dealt with before getting into the analysis. Incorrectly formatted dates, times, addresses, phone numbers, e-mail and website addresses, missing values, wrong data types, invalid entries, strange characters … the list goes on.
Many companies specialise in cleaning just one kind of data! It takes skill and experience to figure out how to clean a dataset and just how clean it needs to be before it can start delivering information.
Data Munging & Conforming
Even a clean dataset may not be ready for analysis. Tabular data can be organised in strange ways, and spreadsheets can feature multiple schemes in the same table. A dataset usually needs to be organised so that each row is a sample and each column is a unique dimension or measure — what database designers call Codd’s 3rd normal form (see Hadley Wickham’s paper on Tidy Data!)
Datasets often come as a set of tables with logical connections between one or more columns or rows of two or more tables. These connections can be simple lookups (e.g. product codes that can be expanded to product names) or may represent hierarchies (parent-child relationships or categories) that carry additional meaning or enable analyses to be conducted at different levels or grains.
Data modelling is the process of interpreting the meaning of the columns and forming the connections between the various tables so that all the data can be analysed as a single, composite entity.
Exploratory Data Analysis
This often under-appreciated process is where the basic statistical characteristics of the data are revealed, such as range, distribution, and quality (e.g. the frequency of missing, ambiguous, or impossible values). Early, basic insights are often gained from this stage, and a feel for the meaning contained within the data is acquired.
The insights gained in this stage will enrich the later stages of the project.
Statistical modelling begins in the data exploration stage and extends into the predictive analytics stage, so it is not really an analytic stage in its own right, but important insights can sometimes be achieved by investing more time here. In particular, data that exhibit non-normal distributions may require special transformation before applying machine learning or may dictate the use of a specialised algorithm.
Predictive Analytics & Machine Learning
It is in this phase that the ‘sexiness’ of data science manifests itself. Here, the magic of high-dimensional model-fitting can turn a jumble of low-quality data layers into a surprisingly accurate predictive tool. Patterns too deeply buried or too chaotically concealed for the human mind to detect may suddenly emerge from the statistical magic of machine learning.
Visualisation is the ROI point for most projects. Once the data has been made to tell its story using compelling — but uncomplicated — graphic aids, the fog of confusion lifts and the sponsors find themselves able to make decisions with confidence. (See Edward Tufte’s six fundamental principles of analytical design here, for example.)
Data products are the tangible, end results of a data science project. They should be declared at the very beginning (although it may not be possible to completely specify them until the latter stages).
A data product could be a decision support tool (e.g. a dashboard), a predictive tool (e.g. a trained model implemented as a production system), a visualisation (e.g. an infographic that documents a discovery process or explains the workings of a complex system), a social networking engine, or a web or mobile-based app that seems “smart”.
A data product is really anything that provides powerful functionality while concealing the magnitude and complexity of the underlying data from the user.
We can also provide data warehouse design, cloud platform setup, user interface development, and all the nuts and bolts that take a prototype to an operational system.
Through our consulting partners, we can provide stand-alone, web, and mobile solutions to implement data products.
If your needs are really cutting-edge, we can even help with algorithm development.