1. 程式人生 > >How To Work Through A Problem Like A Data Scientist

How To Work Through A Problem Like A Data Scientist

In a 2010 post Hilary Mason and Chris Wiggins described the OSEMN process as a taxonomy of tasks that a data scientist should feel comfortable working on.

The title of the post was “A Taxonomy of Data Science” on the now defunct dataists blog. This process has also been used as the structure of a recent book, specifically “

Data Science at the Command Line: Facing the Future with Time-Tested Tools” by Jeroen Janssens published by O’Reilly.

In this post we take a closer look at the OSEMN process for working through a data problem.

Work Through A Problem Like A Data Scientist

Work Through A Problem Like A Data Scientist
Photo by U.S. Army RDECOM, some rights reserved

OSEMN Process

OSEMN is an acronym that rhymes with “possum” or “awesome” and stands for Obtain, Scrub, Explore, Model, and iNterpret.

It is a list of tasks a data scientist should be familiar and comfortable working on. Although, the authors point out that no data scientist will be an expert at all of them.

In addition to a list of tasks, OSEMN can be used as a blueprint for working on data problems using machine learning tools.

From the process, the authors point out that data hacking fits into the “O” and “S” tasks and machine learning fits into the “E” and “M” tasks, and that data science requires a combination of all elements.

1. Obtain Data

The authors point out that manual processes of data collection do not scale and that you must learn how to automatically obtain the data you need for a given problem.

They point to manual processes like pointing and clicking with a mouse and copy and pasting data from documents.

The authors suggest that you adopt a range of tools and use the one most suitable for the job at hand. They point to unix command line tools, SQL in databases, web scraping and scripting using Python and shell scripts.

Finally, the authors point to the importance of using APIs to access data, where an API may be public or internal to your organization. Often data is presented in JSON and scripting languages like Python can make data retrieval a lot easier.

2. Scrub Data

The data that you obtain will be messy.

Real data can have inconsistencies, missing values and various other forms of corruption. If it was scraped from a difficult data source, it may require tripping and cleaning up. Even clean data may require post-processing to make it uniform and consistent.

Data cleaning or scrubbing requires “command line fu” and simple scripting.

The authors point out that data cleaning is the least sexy part of working on data problems but good data cleaning may provide the most benefits in terms of the results that you can achieve.

A simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.

The authors point to simple command line tools such as sed, awk, grep and scripting languages like Python and Perl.

For more information, take a look at the Data Preparation Process.

3. Explore Data

Explore in this case refers to exploratory data analysis.

This is where there is no hypothesis that is being tested and no predictions that are being evaluated.

Data exploration is useful for getting to know your data, for building an intuition for it’s form and for getting ideas for data transforms and even predictive models to use later on in the process.

The authors list a number of methods that may be helpful in this task:

  • Command Line Tools for inspecting the data like more, less, head, tail or whatever.
  • Histograms to summarize the distribution of individual data attributes.
  • Pairwise Histograms to plot attributes against each other and highlight relationships and outliers
  • Dimensionality Reduction methods for creating lower dimensional plots and models of the data
  • Clustering to expose natural groupings in the data

4. Model Data

Model accuracy is often the ultimate goal for a given data problem. This means that the most predictive model is the filter by which a model is chosen.

often the ‘best’ model is the most predictive model

Generally the goal is to use a model predict and interpret. Prediction can be evaluated quantitatively, whereas interpretation is softer and qualitative.

A model’s predictive accuracy can be evaluated by how well it performs on unseen data. It can be estimated using methods such as cross validation.

The algorithms that you try and your biases and reduction on the hypothesis space of possible models that can be constructed for the problem. Choose wisely.

5. Interpret Results

The purpose of computing is insight, not numbers

— Richard Hamming

The authors use the example of handwritten digit recognition. They point out that a model for this problem does not have a theory of each number, rather it is a mechanism to discriminate between numbers.

This example highlights that the concerns of predicting may not be the same as model interpretation. In fact, they may conflict. A complex model may be highly predictive, but the number of terms or data transforms performed may make understanding why specific predictions are made in the context of the domain nearly impossible.

The predictive power of a model is determined by its ability to generalize. The authors suggest that the interpretative power of a model are its abilities to suggest the most interesting experiments to perform next. It gives insights into the problem and the domain.

The authors point to three key concerns when choosing a model to balance predictive and interpretability of a model:

  • Choose a good representation, the form of the data that you obtain, most data is messy.
  • Choose good features, the attributes of the data that you select to model
  • Choose a good hypothesis space, constrained by the models and data transforms you select.

Summary

In this post you discovered the OSEMN proposed by Hilary Mason and Chris Wiggins.

OSEMN stands for Obtain, Scrub, Explore, Model, and iNterpret.

相關推薦

How To Work Through A Problem Like A Data Scientist

Tweet Share Share Google Plus In a 2010 post Hilary Mason and Chris Wiggins described the OSEMN

How to Work Through a Regression Machine Learning Project in Weka Step

Tweet Share Share Google Plus The fastest way to get good at applied machine learning is to prac

How To Work Through a Multi

Tweet Share Share Google Plus The Weka machine learning workbench is so easy to use that working

How To Work Through a Binary Classification Project in Weka Step

Tweet Share Share Google Plus The fastest way to get good at applied machine learning is to prac

How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation

January 11, 2018 - Apache Flink Robert Metzger and Chris Ward A favorite session from Flink Forward Berlin 2017 was Robert

How to Determine Real Space used by a Table (Below the High Water Mark) (文件 ID 77635.1)

PURPOSE This article describes how to find out how many blocks are really being used within a table ie. are not empty. Please note that this article

How to extract WeChat chat messages from a smartphone running Android 7.x or above

ever all wan 分享 rtai function complete effort log A friend of mine she was frustarted in extracting WeChat chat messages from suspect‘s s

How to get the IP address of a Linux system

之前在 Windows/Mac OS 取得 ip address 透過 import socket print socket.gethostbyname(socket.gethostname()) 都沒問題。但在  Linux 裡出問題了。 print socket.gethostbyname_ex(s

How to Check if an Array Contains a Value in Java Efficiently?

evel equal following ren ood fir -s nano -a How to check if an array (unsorted) contains a certain value? This is a very useful and freq

How to Live an XS life on a 5c Budget

How to Live an XS life on a 5c BudgetPassions are hard these days. In the old days, you could get a remote control car, or a woodworking project, or a kite

how to run all Butler tools with a single command

The beauty of Docker — how to run all Butler tools with a single commandDocker is great.Docker is one of those tools that have the potential to fundamental

Designing a dashboard: how to make sure it will show useful data

Find out moreSurvey & workshop idea to help you gather more insights:Initial survey — what information is important to your users?By now, you will have

How to Remove all Unused imports in a Java file

How to remove all unused imports in Eclipse Eclipse IDE gives warning "The import XXX is never used" whenever it detects unused import in a Java source

How to turn your Shopify Store into a Chatbot using Chatfuel in 5 Minutes

How to Make More using MessengerImagine if you could increase your stores sales by 10% or even 20% in the next 10 minutes. Chatbots offer this type of pote

How To Get Started In Machine Learning: A Self

Tweet Share Share Google Plus Specifically, the original poster of the question had completed t

How to estimate the time required for a program.

Once an algorithm is given for a problem and decided to be correct, an important step is to determine how much in the way of resources,su

USBView & How to get the Serial Number from a USB disk & qextserialport

Introduction Most flash-based USB disk devices have a unique serial number assigned by the manufacturer.  (However, some earlier v1.1-based USB devices ma

how to get the return value from a thread in python?

Jake's answer is good, but if you don't want to use a threadpool (you don't know how many threads you'll need, but create them as needed) then a good w

How to Make AI Count Your Calories: A Working Prototype in 5 Minutes

Whether you ate too much this Thanksgiving holiday, or just want to be more careful about what you eat in general, I'm here to show you a Clarifai visual r

【譯】How To Size Your Apache Flink® Cluster: A Back-of-the-Envelope Calculation

來自Flink Forward Berlin 2017的最受歡迎的會議是Robert Metzger的“堅持下去:如何可靠,高效地