In the «Exploratory Data Analysis» tool you can find the specified information
on graphics in each of the following sections:
Dataset overview
This section displays brief data statistics of your training dataset and provides the
following information: problem type, dataset dimension and number of missing values
recorded.
Continuous data distribution and relation to the target variable
Visualization of each continuous variable yields two plots:
Variable density distribution chart
A Density plot visualizing the distribution of data across all rows in the dataset.
This chart is a variation of a Histogram that uses kernel smoothing to plot values, allowing
for smoother distributions by smoothing out the noise. The peaks of a Density plot help
display where values are concentrated over the interval.
Feature relation to the target variable (different for regression and classification task
types)
This chart is presented in one of the following two formats: line chart, indicating the
continuous variable changes with the changes in the continuous target variable (regression
task type) or histogram, showing the mean continuous variable value for each of the classes
of the target variable (classification task type).
Discrete data distribution and relation to the target variable
Visualization of each categorical variable yields two plots:
Histogram displaying feature categories count
Feature categories relation to the target variable (different for regression and
classification task types)
This chart is presented in one of two formats, depending on task type: a histogram
displaying the mean target variable for each of the feature categories (regression task
type) or a histogram displaying the number of each of the target classes in each of the
feature classes (classification task type).
Feature correlations
Visualization of the correlations in the data yields two plots:
Heatmap displaying the binary correlation of the 10 most important
variables, between each
other and with the target variable (the 10 most important features are selected based on the
binary correlation of the features with the target variable).
Histogram (horizontal) displaying the level of high mutual correlation
between independent variable pairs. Pairs are selected if the value of their mutual
correlation exceeds 0.7.
Target variable distribution
Visualization of the target variable statistics is presented in one of two formats:
Violin plot displaying the distribution, median and outliers in the target
variable (regression task type).
Histogram/count plot displaying the number and percentage of each of the
target classes throughout the whole dataset (classification task type).
Outliers Visualization
Outliers Visualization of the outliers in the data is presented in one of
the two plots:
Scatter plot displaying the variable distribution in relation to the target
variable (regression task type);
Box plot displaying the variable distribution/quantiles/median and the
outliers (classification task type). Outliers are marked purple according to the plots
legends.
Time Dependencies
A Time dependency plot is created if a date-time type column is presented in the data.
Visualization of time dependency yields three plots, each displaying a line chart of the
target variable changes over time. The difference between the charts is the level of data
aggregation:
Chart 1: No aggregation. Target variable value is plotted against each date
point in the data.
Charts 2 and 3 will dynamically aggregate data into
years/months/weeks/days/hours/minutes.
Aggregation options are automatically selected based on the data timeframe.
Missing Values Visualization
Missing Values Visualization yields two histograms of the missing values in
the data, displaying each data feature as an equal bar with missing values indicator against
corresponding data indexes.
The «Missing Values Map Overall Dataset» plot displays all the
data feature bars without feature names and with missing values percentage indicator. The
purpose of this plot is to give an overall visual representation of the missing values in
the data.
The «Missing Values Map and percentage» plot displays only the
columns which contain missing
values with feature names, missing values percentage and the corresponding locations of the
missing values in the dataset.