A data analysis primer


Author: Bhanu Prakash Mallya

Date created: 20 Oct. 25

Introduction

A structured dataset is usually presented in tabular form of rows and columns. This table can be interpreted in various ways, depending upon the nature of the data, and the kind of queries expected to be answerable upon further analysis of the dataset. Such as,

a) A dataset can be thought of as being the result of an experimental study. We are measuring the effect of any number of variables on the instances of the object under study. These effects are called values of the variables. The values may or may not be numeric. One observation consists of pairs of variables and values. This is equivalent to saying that an observation tells us that the experiment recorded a certain value for a given variable for each of the individual instances of the objects under study.

b) A dataset can be considered as describing various attributes (features) of the object under study

There are some subtleties that are not immediately obvious across datasets.

a) While an experiment is usually performed on individual instances of the object under study, there may be a need to remove the identity of the individual instance.

b) The dataset could even be a further condensed version of the original data, where grouping has been used to eliminate any reference to individual members of the object under study. This new dataset that you need to work with will have the identified group as the object of study, and the corresponding measurements would already have been aggregated in the table. An example of this case could be survey responses. If you think about it, the survey must have been carried out on actual people, but the table that is provided may have no information about the people surveyed. So, you do not know how many people were sent the original survey and how many people responded unless that is mentioned somewhere other than the main table.

Typical issues

For the purposes of data collection/entry, the dataset in its original form may be a condensed version that makes it difficult to directly use for analysis purposes. It then becomes essential to identify what is actually being measured and on what.

Typical issues in datasets that are designed more for ease of data collection than analysis:

- Column headers are values, not variable names.

Eg: Various numerical ranges as individual columns

- Multiple variables are stored in one column.

Eg: m1520 -> sex, age_range are combined into one column in the given dataset

- Variables are stored in both rows and columns.

Eg: tmin, tmax values in rows, dates of measurement in the columns

- Multiple types of observational units are stored in the same table.

Eg: If dealing with denormalized tables, it would be better to normalize the tables (split as per the various objects under study)

- A single observational unit is stored in multiple tables.

Eg: This can occur if the raw data were to be collected from different sources, each of which generated its own tables. The provider of the dataset has clearly not taken the time to merge these various files into a single table. (Note that sometimes, the filename itself could be data that needs to be in the final merged table)

Definitions of basic terminology used in this document

Data

Any collection of entities from which useful information can be gathered.

Examples

  1. Structured data: A table with rows and columns, there being a column header identifying the content of the entries contained within the column. An individual row of this table would contain entries related to each other. You need to figure out what that relationship is, as it may not be immediately obvious.

  2. Unstructured data: Any other set of entities from which useful information can be gathered, such as files spread across various filesystems, data that has not been saved in row and column format, or perhaps part of some incoming stream etc

  3. Metadata (data about data): Any data that gives more details about the entity in question. While the content of a media file is considered data, the details like creator, various dates, sharing permissions etc is generally considered metadata. If collected and tabulated, it becomes structured and is usually only available in unstructured form in the wild.

Some examples of data:

a) A csv file containing multiple rows of data, there being a column header to identify the structure of the dataset.

b) A set of images from a medical study

c) Any kind of data generated by the operation of a business

d) Time series data obtained from sensors / real time market data

e) Videos and other media files which are related to a particular study

f) Logs generated by software applications, typically on backend servers

g) Curated datasets available online for machine learning applications / contests

Dataset

A collection of data in various forms. A dataset, upon analysis, can be used to drive business decisions / make predictions / infer qualities unique to the dataset which provides some useful information for further study.

Experiment

An activity that is carried out to measure the values of attributes of the object under study. The outcome of an experiment should be a dataset.

Study

A planned set of experiments

Measurement

Using a device to physically measure a value / Collecting the relevant value from other sources

Observation

The process of looking at a dataset and noticing it’s structure.

Effect

The change of value of a variable caused due to running an experiment

Object under study

A study is usually carried out on a class / group / collection of related entities that can be precisely stated, the entire set being termed the object under study.

Instance

Any distinct member of the object under study

Variable

Any of a number of parameters which are measured during an experiment

Value

The specific value attached to a variable for a particular instance

Observing a dataset from the wild

You need to ask relevant questions of the given dataset such that:

a) A given query known in advance can be answered by transformation of the dataset and further analysis.

b) You may not know the queries you want answered when you come across a dataset, but by observation, you may be able to deduce that you can ask certain queries to the dataset upon further analysis.

c) You may need to figure out how to restructure the data so that relevant visualizations/plots can be generated using various tools.

The following may be helpful to start with.

A manual scan of the original dataset

The Experiment

1) What is the experiment/purpose of having collected this data? Does the owner of the dataset claim that the table provides all the information that is required for an understanding of the experiment in question? Or are there multiple tables being provided?

2) What is the experiment recording? Is it just one thing or multiple things?

3) Is it possible to immediately identify the object of study, the variables and corresponding values?

The column headers

1) Are there column headers which look like numerical values/ranges or are they all strings?

Note: Values in the column header could imply the existence of a variable that is not explicitly defined in the table structure. Do you have to infer this and give it a name yourself or is there a name provided in some documentation that accompanies the data? Is there some title/note/legend that describes the data and how to read it?

2) Are there column headers that share a common prefix?

Note: This could imply that the prefix denotes the variable that is actually being measured by the values in those columns. So, whatever comes after the prefix are actually values of the variable denoted by the prefix.

The object of the study

1) What is the object under study? What do instances of the object under study represent?

2) Is the object under study that you have identified present in the table as a column header?

3) What is being measured for an instance of the object being studied?

4) Does each row in the original table provide all the information about a single instance of the object under study? If not, look for repeated values in that column; there probably are.

The variables

1) What are the variables being measured and their values?

2) Where are variables present in the table? Is it present as column headers? or as values in some of the columns?

Note: a) If the entries corresponding to one of the columns look like multiple values of some variable, does the header of that column accurately represent the variable in question?

b) If the column headers themselves are hiding a potential variable by using a common prefix to differentiate between what is ideally values of the variable, can the columns be combined into a single column? Refer note above in “The column headers”.

c) A column that represents a variable is generally referred to as a ‘colvar’

d) Sometimes, multiple colvars taken together may represent higher level concepts. Eg: religion, income as colvars taken together represent some kind of a demographic unit. This may be useful in further analysis of that particular table. It may also be possible to combine certain columns into a single one right away. Eg. Some kind of initial value and an offset may be provided (like start_date, 1week_later, 2weeks_later etc)

The values

1) What kind of values are present in the table? Numeric/Non-numeric (string, date, time, range)

2) Where are the values present in the table? Are they in the interior of the table or do some values appear in the column headers? Pick one of the values and verify that it appears to record a measurement of the variable the value belongs to for an instance of the object under study. Does additional information obtained by combining some of the columns add to the specific interpretation of this value?

3) Are there multiple columns that contain date/time information? Do they relate to the same instant of time or are they different instants?

Transforming the original dataset into a form suitable for analysis

Data cleaning may typically involve the following operations:

  • melting

Transpose some of the columns into the rows

  • string splitting

Separate out individual variables from concatenated column header names into independent columns

  • casting

Transpose the values of a particular column which contains variables as values into the columns (inverse of melting).

  • filtering

Creating a subset or removing observations based on some condition.

  • transforming variables

Adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).

  • aggregation

Collapsing multiple values into a single value (e.g., by summing or taking the mean).

  • sorting

changing the order of observations.

Note:

  1. A more complicated situation occurs when the dataset structure changes over time. For example, the datasets may contain different variables, the same variables with different names, different file formats, or different conventions for missing values. This may require you to clean up each file individually and then combine them. An example of this type of tidying is illustrated in https://github.com/hadley/data-fuel-economy, which shows the cleaning of EPA fuel economy data for over 50,000 cars from 1978 to 2008. The raw data is available online, but each year is stored in a separate file and there are four major formats with many minor variations, making cleaning this dataset a considerable challenge.

  2. Some other typical tasks involved in cleaning data are: parsing dates and numbers, identifying missing values, correcting character encodings (for international data), matching similar but not identical values (created by typos) and filling in structural missing values, not to mention model-based data cleaning that identifies suspicious values.

A note on melting

This involves transposing some of the columns headers into the rows as values of a variable. This variable must be identified initially as the variable that is actually being measured by the experiment. You can refer to this as the extracted variable, and note that it represents a concept that is common across all the columns that used the extracted variable as a header (typically using a prefix on the column header).

This process would typically increase the number of rows in the table as it would need to duplicate values across one or more columns. Thus, melting increases the number of observations, but maintains the idea that the table is actually a record of the measurement of variables (some now concepts) on instances of the object under study.

Ps: This also implies that we cannot take a melted table and assume that each row now corresponds to one instance of the object under study. Due to the repeated values in the columns, it will be necessary to use aggregation operations to extract all the information about a particular instance.

References:

  1. Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.