How to Better Understand Your Machine Learning Data in Weka
It is important to take your time to learn about your data when starting on a new machine learning problem.
There are key things that you can look at to very quickly learn more about your dataset, such as descriptive statistics and data visualizations.
In this post you will discover how you can learn more about your data in the Weka machine learning workbench my reviewing descriptive statistics and visualizations of your data.
After reading this post you will know about:
- The distribution of attributes from reviewing statistical summaries.
- The distribution of attributes from reviewing univariate plots.
- The relationship between attributes from reviewing multivariate plots.
Let’s get started
Better Understand Your Data With Descriptive Statistics
The Weka explorer will automatically calculate descriptives statistics for numerical attributes.
- Open The Weka GUI Chooser.
- Click “Explorer” to open the Weka Explorer.
- Load the Pima Indians datasets from data/diabetes.arff
The Pima Indians dataset contains numeric input variables that we can use to demonstrate the calculation of descriptive statistics.
Firstly, note that the dataset summary in the “Current Relation” section. This panel summarizes the following details about the loaded datasets:
- Dataset name (relation).
- The number of rows (instances).
- The number of columns (attributes).
Click on the first attribute in the dataset in the “Attributes” panel.
Take note of the details in the “Selected attribute” panel. It lists a lot of information about the selected attribute, such as:
- The name of the attribute.
- The number of missing values and the ratio of missing values across the whole dataset.
- The number of distinct values.
- The data type.
The table below lists a number of descriptive statistics and their values. A useful four number summary is provided for numeric attributes including:
- Minimum value.
- Maximum value.
- Mean value.
- Standard deviation.
You can learn a lot from this information. For example:
- The presence and ratio of missing data can give you an indication of whether or not you need to remove or impute values.
- The mean and standard deviation give you a quantified idea of the spread of data for each attribute.
- The number of distinct values can give you an idea of the granularity of the attribute distribution.
Click the class attribute. This attribute has a nominal type. Review the “Selected attribute panel”.
We can now see that for nominal attributes that we are provided with a list of each category and the count of instances that belong to each category. There is also mention of weightings, which we can ignore for now. This is used if we want to assign more or less weight to specific attribute values or instances in the dataset.
Need more help with Weka for Machine Learning?
Take my free 14-day email course and discover how to use the platform step-by-step.
Click to sign-up and also get a free PDF Ebook version of the course.
Univariate Attribute Distributions
The distribution of each attribute can be plotted to give a visual qualitative understanding of the distribution.
Weka provides these plots automatically when you select an attribute in the “Preprocess” tab.
We can follow on from the previous section where we already have the Pima Indians dataset loaded.
Click on the “preg” attribute in the “Attributes panel” and note the plot below the “Selected attribute” panel. You will see the distribution of preg values between 0 and 17 along the x-axis. The y-axis shows the count or frequency of values with each preg value.
Note the red and blue colors referring to the positive and negative classes respectively. The colors are assigned automatically to each categorical value. If there were three categories for the class value, we would see the breakdown of the preg distribution by three colors rather than two.
This is useful to get a quick idea of whether the problem is easily separable for a given attribute, e.g. all the red and blue are cleanly separated for a single attribute. Clicking through each attribute in the list of Attributes and reviewing the plots, we can see that there is no such easy separation of the classes.
We can quickly get an overview of the distribution of all attributes in the dataset and the breakdown of distributions by class by clicking the “Visualize All” button above the univariate plot.
Looking at these plots we can see a few interesting things about this dataset.
- It looks like the plas, pres and mass attributes have a nearly Gaussian distribution.
- It looks likes pres, skin, insu and mass have values at 0 that look out of place.
Looking at plots like this and jotting down things that come to mind can give you an idea of further data preparation operations that could be applied (like marking 0 values as corrupt) and even techniques that might be useful (like linear discriminant analysis and logistic regression that assume a Gaussian distribution in input variables).
Visualize Attribute Interactions
So far we have only been looking at the properties of individual features, next we will look at patterns in combinations of attributes.
When attributes are numeric we can create a scatter plot of one attribute against another. This is useful as it can highlight any patterns in the relationship between the attributes, such as positive or negative correlations.
We can create scatter plots for all pairs of input attributes. This is called a scatter plot matrix and reviewing it before modeling your data can shed more light on further preprocessing techniques that you could investigate.
Weka provides a scatter plot matrix for review by default in the “Visualise” tab.
Continuing on from the previous section with the Pima Indians dataset loaded, click the “Visualize” tab, and make the window large enough to review all of the individual scatter plots.
You can see that all combinations of attributes are plotted in a systematic way. You can also see that each plot appears twice, first in the top left triangle and again in the bottom right triangle with the axes flipped. You can also see a series of plots starting in the bottom left and continuing to the top right where each attribute is plotted against itself. These can be ignored.
Finally, notice that the dots in the scatter plots are colored by their class value. It is good to look for trends or patterns in the dots, such as clear separation of the colors.
Clicking on a plot will give you a new window with the plot that you can further play with.
Note the controls at the bottom of the screen. They let you increase the size of the plots, increase the size of the dots and add jitter.
This last point about jitter is useful when you have a lot of dots overlaying each other and it is hard to see what is going on. Jitter will add some random noise to the data in the plots, spread out the points a bit and help you see what is going on.
When you make a change to these controls, click the “Update” button to apply the changes.
For example, below are the same plots with a larger dot size that makes it easier to see any trends in the data.
Summary
In this post you discovered how you can learn more about your machine learning data by reviewing descriptive statistics and data visualizations.
Specifically, you learned:
- That Weka automatically calculates descriptive statistics for each attribute.
- That Weka allows you to review the distribution of each attribute easily.
- That Weka provides a scatter plot visualization to review the pairwise relationships between attributes.
Do you have any questions about descriptive statistics and data visualization in Weka or about this post? Ask your questions in the comments below and I will do my best to answer them.
Want Machine Learning Without The Code?
Develop Your Own Models in Minutes
…with just a few a few clicks
Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, build models, tuning, and much more…
Finally Bring The Machine Learning To
Your Own Projects
Skip the Academics. Just Results.
相關推薦
How to Better Understand Your Machine Learning Data in Weka
Tweet Share Share Google Plus It is important to take your time to learn about your data when st
How to Normalize and Standardize Your Machine Learning Data in Weka
Tweet Share Share Google Plus Machine learning algorithms make assumptions about the dataset you
How to Transform Your Machine Learning Data in Weka
Tweet Share Share Google Plus Often your raw data for machine learning is not in an ideal form f
How To Load CSV Machine Learning Data in Weka (如何在Weka中載入CSV機器學習資料)
How To Load CSV Machine Learning Data in Weka 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/load-csv-machine-learning-data-weka/
How To Get Started With Machine Learning Algorithms in R
Tweet Share Share Google Plus R is the most popular platform for applied machine learning. When
How To Load Your Machine Learning Data Into R
Tweet Share Share Google Plus You need to be able to load data into R when working on a machine
Creating visualizations to better understand your data and models (Part 1)
The Cancer Genome Atlas Breast Cancer DatasetThe Cancer Genome Atlas (TCGA) breast cancer RNA-Seq dataset (I’m using an old freeze from 2015) has 20,532 fe
How To Handle Missing Values In Machine Learning Data With Weka
Tweet Share Share Google Plus Data is rarely clean and often you can have corrupt or missing val
Learn How to Code and Deploy Machine Learning Models on Spark Structured Streaming
This post is a token of appreciation for the amazing open source community of Data Science, to which I owe a lot of what I have learned. For last few month
Cool Factor: How to Steal Styles with Machine Learning, Turi Create, and ResNet
Turi Style TransferFirst of all, follow the Turi Create installation instructions on GitHub. It’s imperative to create a Python 2.7 environment with the sp
How to Assess Startups Using Machine Learning: Part II
The GASPBecause there is no standard industry practice in venture capital to assess startups, we took it on ourselves to design a framework that can be use
How to Work Through a Regression Machine Learning Project in Weka Step
Tweet Share Share Google Plus The fastest way to get good at applied machine learning is to prac
How to Get Started with Machine Learning in Python
Tweet Share Share Google Plus The Python conference PyCon2014 has held recently and the videos f
How to Tune a Machine Learning Algorithm in Weka
Tweet Share Share Google Plus Weka is the perfect platform for learning machine learning. It pro
How To Get Started With Machine Learning in R (get results in one weekend)
Tweet Share Share Google Plus How do you get started with machine learning in R? R is a large an
How to Clean Text for Machine Learning with Python
Tweet Share Share Google Plus You cannot go straight from raw text to fitting a machine learning
Ask HN: How to implement caching for dynamic user data in sites like HN, Reddit?
Why would you start by caching it?What are you storing the data in currently? If relational, I'd advise starting with simple relational tables (post_commen
Save And Finalize Your Machine Learning Model in R
Tweet Share Share Google Plus Finding an accurate machine learning is not the end of the project
How to Load and Explore Time Series Data in Python
Tweet Share Share Google Plus The Pandas library in Python provides excellent, built-in support
Use Watson Knowledge Studio to build a custom machine learning model in the medical domain
About this webcast One of the key benefits of building a machine learning annotator is the ability to train Watson in a complex domain such as medicine.