DataScience101¶

What does a day in the data science life look like?¶

You want to dive into the fascinating world of data and AI and yes, data science is the stepping stone to it but sometimes, it’s hard to decide from where to start it from. Just looking at all the technologies you have to understand and tools you’re supposed to master can be dizzying. Often beginners in the field start off with applying fancy algorithms without any preprocessing and hence don’t get expected results. What data science steps do you take first?

Having a standard workflow for data science projects ensures that the various teams within an organization are in sync so that any further delays can be avoided. The end goal of any data science project is to produce an effective data product. The usable results produced at the end of a data science project is referred to as a data product. A data product can be anything -a dashboard, a recommendation engine or anything that facilitates business decision-making) to solve a business problem. So, lessgo STEP BY STEP!

#1 Understanding the business problem¶

The first thing you have to do before you solve a problem is to define exactly what it is. You need to be able to translate data questions into something actionable. You should ask relevant questions which make you understand the problem which you are going to solve.

You should ask multiple WHY? questions and get answers from the client or the stakeholder or the person who told you to start up with the project. For example:

Who are the customers?
Why are they buying our product?
How do we predict if a customer is going to buy our product?

#2 Data acquisition¶

Once you’ve understood the problem, you’ll need data to give you the insights needed to turn the problem around with a solution. This part of the process involves thinking through what data you’ll need and finding ways to get that data, whether it’s querying internal databases, or purchasing external datasets. So, after deciding the features and metrics to be used to solve the business problem, the next step is to gather the data. You can use sources like Databases, APIs, web scraper, online repositories, etc.

For example, a company stores all of its sales data in a CRM or a customer relationship management software platform. You can export the CRM data in a CSV file for further analysis.

#3 Data preparation¶

Now that you have all of the raw data, you’ll need to process it before you can do any analysis. Oftentimes, data can be quite messy, especially if it hasn’t been well-maintained. You’ll see errors that will corrupt your analysis: values set to null though they really are zero, duplicate values, and missing values. It’s up to you to go through and check your data to make sure you’ll get accurate insights.

In a nutshell, this involves two important things namely, data cleaning and data transformation. Data cleaning is like check missing values, inconsistent data types, etc. and data transformation is a process of modifying the data based on predefined rules.

#4 Exploratory Data Analysis¶

When your data is clean, you should start playing with it!

One very important question arises here, “What exactly you want from the data?”. The most important step which can get you the exact weighted features for modelbuilding apart from some other useful insights.

The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that are likely to turn into insights. You’ll have a fixed deadline for your data science project, so you’ll have to prioritize your questions. For example:

The most interesting patterns that can help explain why sales are reduced for this group. You might notice that they don’t tend to be very active on social media, with few of them having Twitter or Facebook accounts. You might also notice that most of them are older than your general audience. From that you can begin to trace patterns you can analyze more deeply.

#5 Data modeling i.e. the best fit¶

Data modeling is the process of producing a descriptive diagram of relationships between various types of information that are to be stored in a database. One of the goals of data modeling is to create the most efficient method of storing information while still providing for complete access and reporting.

This step of the process is where you’re going to have to apply your statistical, mathematical, and technological knowledge and leverage all of the data science tools at your disposal to crunch the data and find every insight you can.

You can now combine all of those qualitative insights with data from your quantitative analysis to craft a story that moves people to action.

This is the most important part where you will be finding the model the best fit the business requirements. You might be doing multiple iterations on the test and the train data to find the best performing model.

For example, you might have to create a predictive model that compares your underperforming group with your average customer. You might find out that age and social media activity are significant factors in predicting who will buy the product.

REFERENCE : Analytics Vidhya, KDNuggets

Machine learning book