Trending December 2023 # Effective Strategies For Handling Missing Values In Data Analysis (Updated 2023) # Suggested January 2024 # Top 19 Popular

You are reading the article Effective Strategies For Handling Missing Values In Data Analysis (Updated 2023) updated in December 2023 on the website Achiashop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Effective Strategies For Handling Missing Values In Data Analysis (Updated 2023)

Introduction

If you are aiming for a job as a data scientist, you must know how to handle the problem of missing values, which is quite common in many real-life datasets. Incomplete data can bias the results of the machine learning models and/or reduce the accuracy of the model. This article describes missing data, how it is represented, and the different reasons data values get missed. Along with the different categories of missing data, it also details out different ways of handling missing values with dataset examples.

Learning Objectives

In this tutorial, we will learn about missing values and the benefits of missing data analysis in data science.

You will learn about the different types of missing data and how to handle them correctly.

You will also learn about the most widely used imputation methods to handle incomplete data.

What Is a Missing Value?

Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.

Source: analyticsindiamag

How Is a Missing Value Represented in a Dataset?

In the dataset, the blank shows the missing values.

In Pandas, usually, missing values are represented by NaN. It stands for Not a Number.

Source: medium

The above image shows the first few records of the Titanic dataset extracted and displayed using Pandas.

Why Is Data Missing From the Dataset?

There can be multiple reasons why certain values are missing from the data. Reasons for the missing of data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.

Some of the reasons are listed below:

Past data might get corrupted due to improper maintenance.

Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.

The user has not provided the values intentionally

Item nonresponse: This means the participant refused to respond.

Types of Missing Values

Formally the missing values are categorized as follows:

Source: theblogmedia

Missing Completely At Random (MCAR)

In MCAR, the probability of data being missing is the same for all the observations. In this case, there is no relationship between the missing data and any other values observed or unobserved (the data which is not recorded) within the given dataset. That is, missing values are completely independent of other data. There is no pattern.

Missing At Random (MAR)

MAR data means that the reason for missing values can be explained by variables on which you have complete information, as there is some relationship between the missing data and other values/data. In this case, the data is not missing for all the observations. It is missing only within sub-samples of the data, and there is some pattern in the missing values.

For example, if you check the survey data, you may find that all the people have answered their ‘Gender,’ but ‘Age’ values are mostly missing for people who have answered their ‘Gender’ as ‘female.’ (The reason being most of the females don’t want to reveal their age.)

So, the probability of data being missing depends only on the observed value or data. In this case, the variables ‘Gender’ and ‘Age’ are related. The reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable, but you can not predict the missing value itself.

Suppose a poll is taken for overdue books in a library. Gender and the number of overdue books are asked in the poll. Assume that most of the females answer the poll and men are less likely to answer. So why the data is missing can be explained by another factor, that is gender. In this case, the statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the missing data.

Missing Not At Random (MNAR)

Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data can not explain it, then it is considered to be Missing Not At Random (MNAR).

If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR. It can happen due to the reluctance of people to provide the required information. A specific group of respondents may not answer some questions in a survey.

For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll. So, in this case, the missing value of the number of overdue books depends on the people who have more books overdue.

Another example is that people having less income may refuse to share some information in a survey or questionnaire.

In the case of MNAR as well, the statistical analysis might result in bias.

Why Do We Need to Care About Handling Missing Data?

It is important to handle the missing values appropriately.

Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.

You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.

Missing data can lead to a lack of precision in the statistical analysis.

Practice Problem

Let’s take an example of the Loan Prediction Practice Problem from Analytics Vidhya. You can download the dataset from the following link.

Checking for Missing Values in Python

The first step in handling missing values is to carefully look at the complete data and find all the missing values. The following code shows the total number of missing values in each column. It also shows the total number of missing values in the entire data set.



From the above output, we can see that there are 6 columns – Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History having missing values.

IN: #Find the total number of missing values from the entire dataset train_df.isnull().sum().sum() OUT: 149

There are 149 missing values in total.

List of Methods to handle missing values in a dataset

Here is a list of popular strategies to handle missing values in a dataset

Deleting the Missing Values

Imputing the Missing Values

Imputing the Missing Values for Categorical Features

Imputing the Missing Values using Sci-kit Learn Library

Using “Missingness” as a Feature

Handling Missing Values

Now that you have found the missing data, how do you handle the missing values?

Analyze each column with missing values carefully to understand the reasons behind the missing of those values, as this information is crucial to choose the strategy for handling the missing values.

There are 2 primary ways of handling missing values:

Deleting the Missing values

Imputing the Missing Values

Deleting the Missing value

Generally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values. If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.

If the missing value is of type Missing At Random (MAR) or Missing Completely At Random (MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while missing observations are assumed to be completely random (MCAR) and addressed through pairwise deletion.)

There are 2 ways one can delete the missing data values:

Deleting the entire row (listwise deletion)

If a row has many missing values, you can drop the entire row. If every row has some (column) value missing, you might end up deleting the whole data. The code to drop the entire row is as follows:

IN: df = train_df.dropna(axis=0) df.isnull().sum() OUT: Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64

Deleting the entire column

If a certain column has many missing values, then you can choose to drop the entire column. The code to drop the entire column is as follows:

IN: df = train_df.drop(['Dependents'],axis=1) df.isnull().sum() OUT: Loan_ID 0 Gender 13 Married 3 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64 Imputing the Missing Value

There are many imputation methods for replacing the missing values. You can use different python libraries such as Pandas, and Sci-kit Learn to do this. Let’s go through some of the ways of replacing the missing values.

Replacing with an arbitrary value

If you can make an educated guess about the missing value, then you can replace it with some arbitrary value using the following code. E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with ‘0’.

IN: #Replace the missing value with '0' using 'fiilna' method train_df['Dependents'] = train_df['Dependents'].fillna(0) train_df[‘Dependents'].isnull().sum() OUT: 0

Replacing with the mean

This is the most common method of imputing missing values of numeric columns. If there are outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first. You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’ with the mean of the respective column values.

IN: #Replace the missing values for numerical columns with mean train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean()) train_df['Credit_History'] = train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean()) OUT: Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64

Replacing with the mode

Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’

IN: #Replace the missing values for categorical columns with mode train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0]) train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()[0]) train_df['Self_Employed'] = train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()[0]) train_df.isnull().sum() OUT: Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64

Replacing with the median

The median is the middlemost value. It’s better to use the median value for imputation in the case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’ with the median value.

train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())

Replacing with the previous value – forward fill

In some cases, imputing the values with the previous value instead of the mean, mode, or median is more appropriate. This is called forward fill. It is mostly used in time series data. You can use the ‘fillna’ function with the parameter ‘method = ffill’

IN: import pandas as pd import numpy as np test = pd.Series(range(6)) test.loc[2:4] = np.nan test OUT: 0 0.0 1 1.0 2 Nan 3 Nan 4 Nan 5 5.0 dtype: float64 IN: # Forward-Fill test.fillna(method=‘ffill') OUT: 0 0.0 1 1.0 2 1.0 3 1.0 4 1.0 5 5.0 dtype: float64

Replacing with the next value – backward fill

In backward fill, the missing value is imputed using the next value.

IN: # Backward-Fill test.fillna(method=‘bfill') OUT: 0 0.0 1 1.0 2 5.0 3 5.0 4 5.0 5 5.0 dtype: float64

Interpolation

Missing values can also be imputed using interpolation. Pandas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ The default method is ‘linear.’

IN: test.interpolate() OUT: 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 5 5.0 dtype: float64 How to Impute Missing Values for Categorical Features?

There are two ways to impute missing values for categorical features as follows:

Impute the Most Frequent Value

We will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use mean or median, but we can use the most frequent value and constant.

IN: import pandas as pd import numpy as np X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]}) X Shape OUT: 0 square 1 square 2 oval 3 circle 4 NaN IN: from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='most_frequent') imputer.fit_transform(X) OUT: array([['square'], ['square'], ['oval'], ['circle'], ['square']], dtype=object)

As you can see, the missing value is imputed with the most frequent value, ’square.’

Impute the Value “Missing”

We can impute the value “missing,” which treats it as a separate category.

IN: imputer = SimpleImputer(strategy='constant', fill_value='missing') imputer.fit_transform(X) OUT: array([['square'], ['square'], ['oval'], ['circle'], ['missing']], dtype=object)

In any of the above approaches, you will still need to OneHotEncode the data (or you can also use another encoder of your choice). After One Hot Encoding, in case 1, instead of the values ‘square,’ ‘oval,’ and’ circle,’ you will get three feature columns. And in case 2, you will get four feature columns (4th one for the ‘missing’ category). So it’s like adding the missing indicator column in the data. There is another way to add a missing indicator column, which we will discuss further.

How to Impute Missing Values Using Sci-kit Learn Library?

We can impute missing values using the sci-kit library by creating a model to predict the observed value of a variable based on another variable which is known as regression imputation.

Univariate Approach

In a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median, or some constant value.

Let’s see an example:

IN: import numpy as np from sklearn.impute import SimpleImputer imp = SimpleImputer(missing_values=np.nan, strategy='mean') imp.fit([[1, 2], [np.nan, 3], [7, 6]]) OUT: SimpleImputer() IN: X = [[np.nan, 2], [6, np.nan], [7, 6]] print(imp.transform(X)) OUT: [[4. 2. ] [6. 3.666...] [7. 6. ]] Multivariate Approach

In a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.

Let’s take an example of a titanic dataset.

Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also younger and people with higher fares are also older. In that case, it would make sense to impute low age for low fare values and high age for high fare values. So here, we are taking multiple features into account by following a multivariate approach.

IN: import pandas as pd cols = ['SibSp', 'Fare', 'Age'] X = df[cols] X

 SibSpFareAge017.250022.01171.283338.0207.925026.03153.100035.0408.050035.0508.4583NaN

IN: from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer impute_it = IterativeImputer() impute_it.fit_transform(X) OUT: array([[ 1. , 7.25 , 22. ], [ 1. , 71.2833 , 38. ], [ 0. , 7.925 , 26. ], [ 1. , 53.1 , 35. ], [ 0. , 8.05 , 35. ], [ 0. , 8.4583 , 28.50639495]])

Let’s see how IterativeImputer works. For all rows in which ‘Age’ is not missing, sci-kit learn runs a regression model. It uses ‘Sib sp’ and ‘Fare’ as the features and ‘Age’ as the target. And then, for all rows for which ‘Age’ is missing, it makes predictions for ‘Age’ by passing ‘Sib sp’ and ‘Fare’ to the training model. So it actually builds a regression model with two features and one target and then makes predictions on any places where there are missing values. And those predictions are the imputed values.

Nearest Neighbors Imputations (KNNImputer)

Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean distance is used to find the nearest neighbors. Let’s take the above example of the titanic dataset to see how it works.

IN: from sklearn.impute import KNNImputer impute_knn = KNNImputer(n_neighbors=2) impute_knn.fit_transform(X) OUT: array([[ 1. , 7.25 , 22. ], [ 1. , 71.2833, 38. ], [ 0. , 7.925 , 26. ], [ 1. , 53.1 , 35. ], [ 0. , 8.05 , 35. ], [ 0. , 8.4583, 30.5 ]])

In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing values. In this case, the last row has a missing value. And the third row and the fifth row have the closest values for the other two features. So the average of the ‘Age’ feature from these two rows is taken as the imputed value.

How to Use “Missingness” as a Feature?

In some cases, while imputing missing values, you can preserve information about which values were missing and use that as a feature. This is because sometimes, there may be a relationship between the reason for missing values (also called the “missingness”) and the target variable you are trying to predict. In such cases, you can add a missing indicator to encode the “missingness” as a feature in the imputed data set.

Where can we use this?

Suppose you are predicting the presence of a disease. Now, imagine a scenario where a missing age is a good predictor of the disease because we don’t have records for people in poverty. The age values are not missing at random. They are missing for people in poverty, and poverty is a good predictor of disease. Thus, missing age or “missingness” is a good predictor of disease.

IN: import pandas as pd import numpy as np X = pd.DataFrame({'Age':[20, 30, 10, chúng tôi 10]}) X

 Age020.0130.0210.03NaN410.0

IN: from sklearn.impute import SimpleImputer # impute the mean imputer = SimpleImputer() imputer.fit_transform(X) OUT: array([[20. ], [30. ], [10. ], [17.5], [10. ]]) IN: imputer = SimpleImputer(add_indicator=True) imputer.fit_transform(X) OUT: array([[20. , 0. ], [30. , 0. ], [10. , 0. ], [17.5, 1. ], [10. , 0. ]])

In the above example, the second column indicates whether the corresponding value in the first column was missing or not. ‘1’ indicates that the corresponding value was missing, and ‘0’ indicates that the corresponding value was not missing.

If you don’t want to impute missing values but only want to have the indicator matrix, then you can use the ‘MissingIndicator’ class from scikit learn.

Conclusion

Key Takeaways

It is critical to reduce the potential bias in the machine learning models and get a precise statistical analysis of the data. Handling missing values is one of the challenges of data analysis.

Understanding different categories of missing data help in making decisions on how to handle it. We explored different categories of missing data and the different ways of handling it in this article.

Frequently Asked Questions

Q1. What are the types of missing values in data?

A. The three types of missing data are Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).

Q2. How do you handle missing values?

A. We can use different methods to handle missing data points, such as dropping missing values, imputing them using machine learning, or treating missing values as a separate category.

Q3. How does pairwise deletion handle missing data?

A. Pairwise deletion is a method of handling missing values where only the observations with complete data are used in each pairwise correlation or regression analysis. This method assumes that the missing data is MCAR, and it is appropriate when the missing data is not too large.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

blogathonHandling Missing ValuesMissing Datamissing value imputationMissing Values Treatment

Recommended For You

You're reading Effective Strategies For Handling Missing Values In Data Analysis (Updated 2023)

4 Effective Strategies For Supporting Newcomer English Learners

Each year, teachers across the country open their doors to new students at the start of the new academic year. Being a new student is overwhelming when it’s your first year, but can you imagine the added fear and anxiety that goes with being an English language (ELL) or multilingual learner (MLL)? On top of designing interactive bulletin boards and planning lessons, teachers who are new to working with ELLs/MLLs also wrestle with the concern of how to appropriately support their newcomer students both linguistically and culturally while also delivering content that’s meaningful.

Over the years, I’ve turned to simple strategies to encourage my students and help them to become better acclimated with our new school culture. For me, the best way to do this was to begin by acknowledging theirs first.

1. Make a Good First Impression

Introductions matter. Learn students’ names and nicknames. Learning how to pronounce your students’ names or using their preferred names shows them that you value their identity. Using icebreakers or name games is a fun way to learn students’ names. Be sure to model the instructions for the class before you begin. When I did this, I could visibly see my students becoming more comfortable with one another by making eye contact and smiling when they spoke and giving a thumbs-up or high five to encourage their peers.

A friendly greeting and a smile is a simple way to acknowledge newcomer students. Smiling is a universal language, which directly sends a message that you’re there for your students. It’s one way to give a warm welcome and invite them to feel relaxed and ready to learn.

2. Cultivate a Supportive and Caring Environment

As a teacher, I found that an easy way to support my high school students with classroom vocabulary was to use signage. Posters with sentence stems and labels of classroom supplies are one way in which students can internalize the English language through visuals and real objects.

Clear and consistent classroom routines also help newcomers to understand what is expected. Whether they are presented on a classroom agenda or introduced through modeling, predictable classroom routines help students feel safe. For example, introducing students to common classroom expressions accompanied by a visual can reinforce routines and give students the opportunity to practice the target language. Teaching students to ask and answer questions about classroom actions, such as finding the page number or asking for a writing utensil, supports newcomers in expressing their needs to others. When their questions are answered, their needs are met because they know they were heard.

Creating a caring climate means that care is used to prepare the physical space and contents within the classroom that make it easier for newcomers to transition into school. Preparing a welcome folder with access to important information such as Google Classroom codes or codes and passwords to websites that are used throughout the year says “We’ve been expecting you” to your new student and fosters a sense of belonging.

3. Use Effective Language Strategies for Newcomers

I frequently use the following strategies with my students to build their vocabulary, practice pronunciation, and strengthen the use of their receptive and expressive language skills.

Choral repetition: This is a technique whereby the teacher models reading fluency, pace, and pronunciation. Choral repetition creates a low affective filter or safe low-risk environment where newcomers can participate in whole-class reading.

Wait time: Newcomers need time to translate the question they’re given, think about a response, translate the answer into English, and then verbalize the response. I like to silently count to 15 seconds when I ask one-word-or-longer phrased questions based on vocabulary or to check my students’ understanding. If I notice that more time or clarification is needed, then I begin to repeat or sometimes rephrase the question to assist my students. Usually activities with text-based or inferential questions require the use of a timer for seven minutes or more, depending on the length or complexity of the activity.

Visual representations: These provide another way for students to express their thoughts and ideas. Accompanying these drawings with short sentences validates students’ perspectives and encourages them to share their stories. It’s vitally important that students feel seen and heard. Illustrations and artistic projects allow newcomers experiencing a silent period to express themselves without the fear of presenting to the whole class. Language researcher Stephen Krashen described the silent period as the first stage of language acquisition, where the language learner is silent for several weeks or more as they begin to adjust and acquire the new language.

4. Use Tech and Peer Support to Help Students Take Risks

This is where technology comes in handy. Newcomers are often uncomfortable with speaking or reading aloud. Our class has enjoyed the use of Flip, Classkick, and Book Creator to record their presentations or answer questions. While technological apps have made it easier for students to create and record their presentations, newcomers may need additional time to work on projects and presentations.

Some students may need an extra class day or an extra week. I use discretion based on my students’ needs as they strive to complete the assignment and on the demands of the activity itself. The truth is that a tech-friendly environment can help learners take language risks in a way that makes them feel comfortable.

To facilitate collaboration and communication, clock buddies or map buddies is a buddy system where students are paired up to support newcomers by translating for teachers or helping them to acclimate to their new school environment. Whether it’s a buddy or small group collaboration, these interpersonal activities equip newcomers to decipher language and content. Online translators, dictionaries, and visual dictionaries allow students to access the definition, image, and pronunciation of words in the target language. Students can also create their own glossaries using Quizlet.

Teachers can practice these simple strategies with the intention of constructing a safe and inclusive environment so that newcomer learners know that they are cared for and valued. The classroom becomes a space where all learners feel self-acceptance and a sense of belonging. It’s a place of purpose where all learners can fearlessly be who they are and confidently take on challenges together.

Data Analysis Using Python Pandas

In this tutorial, we are going to see the data analysis using Python pandas library. The library pandas are written in C. So, we don’t get any problem with speed. It is famous for data analysis. We have two types of data storage structures in pandas. They are Series and DataFrame. Let’s see one by one.

1.Series

Series is a 1D array with customized index and values. We can create a Series object using the pandas.Series(data, index) class. Series will take integers, lists, dictionaries as data. Let’s see some examples.

Example # importing the pandas library import pandas as pd # data data = [1, 2, 3] # creating Series object # Series automatically takes the default index series = pd.Series(data) print(series) Output

If you run the above program, you will get the following result.

0 1 1 2 2 3 dtype: int64

How to have a customized index? See the example.

Example # importing the pandas library import pandas as pd # data data = [1, 2, 3] # index index = ['a', 'b', 'c'] # creating Series object series = pd.Series(data, index) print(series) Output

If you run the above program, you will get the following result.

a 1 b 2 c 3 dtype: int64

When we give the data as a dictionary to the Series class, then it takes keys as index and values as actual data. Let’s see one example.

Example # importing the pandas library import pandas as pd # data data = {'a':97, 'b':98, 'c':99} # creating Series object series = pd.Series(data) print(series) Output

If you run the above program, you will get the following results.

a 97 b 98 c 99 dtype: int64

We can access the data from the Series using an index. Let’s see the examples.

Example # importing the pandas library import pandas as pd # data data = {'a':97, 'b':98, 'c':99} # creating Series object series = pd.Series(data) # accessing the data from the Series using indexes print(series['a'], series['b'], series['c']) Output

If you run the above code, you will get the following results.

97 98 99 2.Pandas

We have how to use Series class in pandas. Let’s see how to use the DataFrame class. DataFrame data structure class in pandas that contains rows and columns.

We can create DataFrame objects using lists, dictionaries, Series, etc.., Let’s create the DataFrame using lists.

Example # importing the pandas library import pandas as pd # lists names = ['Tutorialspoint', 'Mohit', 'Sharma'] ages = [25, 32, 21] # creating a DataFrame data_frame = pd.DataFrame({'Name': names, 'Age': ages}) # printing the DataFrame print(data_frame) Output

If you run the above program, you will get the following results.

               Name    Age 0    Tutorialspoint    25 1             Mohit    32 2            Sharma    21

Let’s see how to create a data frame object using the Series.

Example # importing the pandas library import pandas as pd # Series _1 = pd.Series([1, 2, 3]) _2 = pd.Series([1, 4, 9]) _3 = pd.Series([1, 8, 27]) # creating a DataFrame data_frame = pd.DataFrame({"a":_1, "b":_2, "c":_3}) # printing the DataFrame print(data_frame) Output

If you run the above code, you will get the following results.

   a  b  c 0  1  1  1 1  2  4  8 2  3  9  27

We can access the data from the DataFrames using the column name. Let’s see one example.

Example # importing the pandas library import pandas as pd # Series _1 = pd.Series([1, 2, 3]) _2 = pd.Series([1, 4, 9]) _3 = pd.Series([1, 8, 27]) # creating a DataFrame data_frame = pd.DataFrame({"a":_1, "b":_2, "c":_3}) # accessing the entire column with name 'a' print(data_frame['a']) Output

If you run the above code, you will get the following results.

0 1 1 2 2 3

10+ Simple Yet Powerful Excel Tricks For Data Analysis

Overview

Microsoft Excel is one of the most widely used tools for data analysis

Learn the essential Excel functions used to analyze data for business analytics

Data Analysis with Excel serves as a precursor to Data Science with R or Python

*This article was originally published in 2023 and updated in April 2023.

Introduction

I’ve always admired the immense power of Excel. This software is not only capable of doing basic data computations, but you can also perform data analysis using it. It is widely used for many purposes including the likes of financial modeling and business planning. It can become a good stepping stone for people who are new to the world of business analytics.

In fact, we have designed an entire comprehensive program on Business Analytics for you, with Excel as a key component! Make sure you check it out and give yourself the gift of a business analytics career.

I feel fortunate that my journey started with Excel. Over the years, I’ve learned many tricks to work to deal with data faster than ever. Excel has numerous functions. It becomes confusing at times to choose the best one.

In this article, I’ll provide you some tips and tricks to work on Excel and save you time. This article is best suited to people keen to upgrade their data analysis skills.

Commonly used functions

Syntax: =VLOOKUP(Key to lookup, Source_table, column of source table, are you ok with relative match?)

For the above problem, we can write the formula in cell “F4” as =VLOOKUP(B4, $H$4:$L$15, 5, 0) and this will return the city name for all the Customer id 1 and post that copy this formula for all Customer ids.

Tip: Do not forget to lock the range of the second table using a “$” sign – a common error when copying this formula down. This is known as relative referencing.

Syntax: =Concatenate(Text1, Text2,.....Textn)

Above problem can be solved using formula, =concatenate(B3, C3) and copy it.

Tip: I prefer using the “&” symbol, because it is shorter than typing a full “concatenate” formula, and does the exact same thing. The formula can be written as  “= B3&C3”.

3. LEN() – This function tells you about the length of a cell i.e. number of characters including spaces and special characters.

Syntax: =Len(Text)

Example: =Len(B3) = 23

4. LOWER(), UPPER() and PROPER() –These three functions help to change the text to lower, upper, and sentence case respectively (First letter of each word capital).

Syntax: =Upper(Text)/ Lower(Text) / Proper(Text)

Syntax: =Trim(Text)

6. IF(): I find it one of the most useful functions in excel. It lets you use conditional formulas that calculate one way when a certain thing is true and another way when false. For example, you want to mark each sales as “High” and “Low”. If sales are greater than or equals to $5000 then “High” else “Low”.

Syntax: =IF(condition, True Statement, False Statement) Generating inference from Data

1. Pivot Table: Whenever you are working with company data, you seek answers for questions like “How much revenue is contributed by branches of North region?” or “What was the average number of customers for product A?” and many others.

Above, you can see that table on the left has sales detail against each customer with the region and product mapping. In the table to the right, we have summarized the information at region level which now helps us to generate an inference that the South region has the highest sales.

Above, you can see that we have arranged “Region” in row, “Product id” in column and sum of “Premium” is taken as value. Now you are ready with pivot table which shows Region and Product wise sum of premium. You can also use count, average, min, max and other summary metrics.

2. Creating Charts: Building a chart/ graph in excel requires nothing more than selecting the range of data you wish to chart and press F11. This will create an Excel chart in default chart style but you can change it by selecting different chart style. If you prefer the chart to be on the same worksheet as the data, instead of pressing F11, press ALT + F1.

Of course, in either case, once you have created the chart, you can customize to your particular needs to communicate your desired message.

Data Cleaning

Above, you can see that values are separated by semicolon “;”. Now to split these values in a different column, I will recommend to use the “Text to Columns” feature in excel. Follow the below steps to convert it to different columns:

Select the range A1:A6

Above, we have two options “Delimited” and “Fixed width”. I have selected delimited because the values are separated by a delimiter(;). If we would be interested to split data based on the width such as the first four character to the first column, 5 to 10th character to the second column, then we would choose Fixed width.

Essential keyboard shortcuts

Keyboard shortcuts are the best way to navigate cells or enter formulas more quickly. We’ve listed our favorites below.

Ctrl + Shift + Down/Up Arrow: Selects all the cells above or below the current cell

Ctrl+ Home: Navigates to cell A1

Ctrl+End: Navigates to the last cell that contains data

Alt+F1: Creates a chart based on selected data set.

Ctrl+Shift+L: Activate auto filter to data table

Alt+Down Arrow: To open the drop-down menu of auto filter

Alt+D+S: To sort the data set

Ctrl+O: Open a new workbook

Ctrl+N: Create a new workbook

F4: Select the range and press F4 key, it will change the reference to absolute, mixed and relative.

End Notes

Looking to get started in the data field? We have curated the perfect multi-course program for you! Check out the Certified Business Analytics Program and launch your career!

Related

The Top 10 Most Effective Business Analysis Techniques

blog / Business Analytics 10 Most Popular Business Analysis Techniques Used by Professionals

Share link

Business analysis is a structured way of introducing and managing organizational change while providing value to all business stakeholders. It includes identifying new opportunities, optimizing costs, understanding required capabilities, and finding solutions to help businesses achieve their goals. If you are an aspiring business analyst seeking information on commonly used techniques or an existing professional in this field looking to upskill yourself, here are the top 10 business analysis techniques you must know to be effective in your role.

Top 10 Business Analysis Techniques SWOT Analysis

SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis is a four-quadrant analysis where the business analyst groups information and data as per the strengths, weaknesses, opportunities, and threats related to a company. It helps to get a clear picture of a company’s standing, both internal as well as external factors, allowing for more informed decisions.

Pros and Cons

SWOT analysis helps improve planning because it gives a clear picture of various attributes of the business. However, it may sometimes be an oversimplified analysis. For example, the high price of a product may not necessarily be a threat to the company. It can, in fact, be a strength as it may create the perception of luxury in customers’ minds.

MOST Analysis

MOST, a short form for Mission, Objectives, Strategies, and Tactics, helps keep these four factors relevant and aligned with the business. It is a powerful tool used to assess the organization’s strategic plan and gives a clear vision to each organization member regarding the direction of their work. This ultimately helps bring all functions and levels in alignment with each other. 

Pros and Cons

It is a clean and simple way of communicating business strategy to all the stakeholders and helps them align in a common direction. However, this technique is not self-sufficient and needs the support of other tools and analyses to define business strategy fully. 

Business Process Modeling (BPM)

Business process modeling is a data-driven and illustrative representation of an organization’s business processes. It provides insights into the various functions of a business process, including events and activities, owners, decision points, and timelines. It uses an ‘as-is’ approach instead of a ‘to-be’ approach, thus allowing better visibility into existing processes and helping improve them.

Pros and Cons

Business process modeling helps align business operations with strategy, improves communication among various stakeholders, and helps achieve operational efficiencies. The only downside to this technique is the risk of overanalyzing the process, especially if the business problem is not very complex. Also, if not implemented properly, this technique may underutilize the resources spent on creating the model.

Use Case Modeling

Use case modeling depicts how users interact with a system to solve problems. It defines the user’s objective, various interactions between the system and the user, and the system’s behavior to fulfill the user’s objectives. It can be done using various tools such as Microsoft’s Visio, Lucidchart, and IBM Rational Rose.

Pros and Cons

As a user-centered technique, it helps develop a system from the user’s point of view. It also helps visualize complex projects more simply. A major drawback of use case modeling is that it is not object-oriented (i.e., not made up of data fields with unique attributes and behavior). It may also lead to miscommunication due to the use of non-technical language.

Brainstorming

Brainstorming is a creative group activity undertaken to develop an exhaustive list of ideas and identify multiple possible solutions for a problem at hand. It requires freewheeling thinking and discussion to ensure no possibility is left unexplored.

Pros and Cons Non-Functional Requirement Analysis

Non-functional requirements are necessary for a system to perform well but do not directly contribute to the primary functions. For example, a functional requirement of a word editor can be the ability to write text. In contrast, a non-functional requirement can be software that automatically saves text if a user forgets to save it manually. Thus the Non-Functional Requirement technique analyzes various non-functional requirements such as security, reliability, performance, maintainability, scalability, and usability. It helps understand various operational capabilities and constraints that need to be considered in the design of systems.

Pros and Cons

Non-functional requirement analysis ensures legal and other compliance requirements. It also creates ease of operations and a good experience for the user. One major con of non-functional capabilities is that it is difficult to alter once the design phase is complete.

PESTLE Analysis

This strategic tool analyzes the external environmental factors that may impact a business and its future performance.  PESTLE analysis includes the following factors: 

E – Environmental

Pros and Cons Requirement Analysis

Requirement analysis is undertaken to capture the user expectations for a new product. It starts with identifying the relevant stakeholders, conducting interviews to capture requirements, categorizing requirements, interpreting and documenting requirements, and finally, signing off on the requirements that need to be worked upon.

Pros and Cons User Stories

User stories convey what users want to achieve in a simple, non-technical way. They help provide context to the development team and help them understand why they are building what they’re building and how it impacts the end user. These stories are a core component of Agile programs.

Pros and Cons

The pros of user stories as a business analysis technique include outcomes that are a user-centric and improved collaboration between the product team and the users. Major cons include fulfilling compliance needs and ensuring correct interpretations of the user stories.

CATWOE

CATWOE is a generic business analysis technique that helps define and analyze the perspectives of various business stakeholders. It is an acronym that stands for:

E – Environmental Constraints

Pros and Cons

It is one of those business analysis techniques that help consider the perspectives of different stakeholders and give due credit to various requirements. The only possible con is that it may result in the confusion caused by rising conflicting views.

ALSO READ: What is Business Analytics? Why Should You Know More About it?

Fast-Track Your Business Analysis Career with Emeritus

According to a recent survey, 88% of executives reported that their companies had increased investments in data, analytics, and AI during 2023. The future is bright for professionals pursuing a career in these areas. Suppose you are planning to pursue or enhance your skills in business analysis. In that case, we hope this article has given you a fair understanding of each business analysis technique. However, the specific technique you choose might differ based on the industry and goal of your organization. 

ALSO READ: What Does a Business Analyst Do? Key Responsibilities, Skills Needed, Tools Used

If you want to learn more about business analytics concepts and applications, explore these online business analytics courses offered by world-class universities via Emeritus, and build your skills for a successful career in analytics.

Write to us at [email protected]

Data Observability: A New Frontier For Data Reliability In 2023

Figure 1. Interest in data observability.

In today’s data-driven world, data observability has emerged as a critical strategy for ensuring data reliability. With data becoming more important for decision-making and analysis, data quality and accuracy are more important than ever. The practice of measuring and monitoring data systems to ensure their dependability, integrity, and accuracy is known as data observability. 

Importance

Use cases

Benefits

Best practices

What is data observability?

Figure 2. Data observability components.

Data observability is the practice of continuously measuring, monitoring, and analyzing data systems to ensure data systems’: 

Reliability 

Integrity 

Accuracy 

Data observability encompasses the ability to: 

Trace data flow 

Understand data lineage 

Track data quality metrics throughout the data pipeline. 

Data observability can be considered as an essential aspect of modern data management because it can: 

Provide a comprehensive view of the data ecosystem 

Enable organizations to proactively address data quality issues.

5 Pillars of Data Observability

Figure 3. 5 Pillars of data observability.

Data observability can be crucial for maintaining reliable and high-quality data systems; the five key pillars that can contribute to achieving this goal are:

Distribution: Distribution is analyzing data patterns and distributions to identify anomalies and maintain data quality.

Freshness: Freshness is ensuring that data is regularly updated and accessible to provide timely insights and maintain relevance.

Schema: Schema is monitoring schema consistency and structure to prevent data integrity issues and maintain a robust data model.

Lineage: Lineage is tracking data provenance and lineage to enhance traceability and facilitate debugging of data-related issues.

Volume: Volume is managing data scale and storage to optimize performance and resource usage while handling large amounts of information.

Note that the order of the pillars above can only provide a rough sequence from data collection (distribution) to system performance (volume). The specific importance of the pillars can vary depending on requirements of a particular project or data system.

Top 5 data observability use cases

Figure 5. Data Ops diagram.

Choosing a data observability tool or data platform is determined by your business requirements. As a result, we would like to share with you five common data observability tools use cases.

1. Anomaly detection

Using data observability, organizations can detect: 

Anomalies 

Inconsistencies

Errors in their data systems. 

By continuously monitoring data quality metrics and implementing automated processes, organizations can identify potential issues before they escalate. This can lower the risk of inaccurate analyses and decision-making.

2. Data pipeline optimization

Data operations (DataOps) methodologies, in conjunction with data observability, can assist organizations in identifying: 

Bottlenecks 

Inefficiencies 

And areas for improvement in data pipelines. 

As a result, data processing can be optimized, operational efficiency can be increased, and data warehouse management becomes more agile.

3. Data governance 

Data observability, when combined with DataOps practices, can help data governance initiatives by providing transparency into data lineage and metadata. This can: 

Enable better control over data assets

Ensure that data policies and standards are consistently followed through automation and collaboration

4. Regulatory compliance

Together, data observability and data operations help organizations meet regulatory requirements by ensuring data: 

Accuracy 

Consistency

Traceability. 

By automating compliance checks and validation processes, the data observability platform can lower the risk of noncompliance penalties.

5. Root cause analysis

Data observability, when combined with DataOps, can enable organizations to conduct root cause analyses and data pipeline monitoring by tracing data issues back to their source. Root cause analysis allows for a better understanding of the underlying factors contributing to data quality issues. This can enable organizations to take corrective actions to prevent future occurrences through continuous improvement and iterative development.

6 data observability best practices

If you consider data observability for your data engineering tools, you should consider the following best practices. To successfully implement data observability, we share six best practices of data observability:

1. Define data quality metrics

Create clear metrics to assess data quality, such as: 

These data health metrics should be in line with business goals and data needs.

2. Implement data lineage tracking

Map the data flow from source to destination. This can allow organizations to trace data issues back to their source and understand the impact of changes on downstream systems.

3. Develop a data catalog

Create a centralized repository that documents all data assets, including information related to: 

This can enable data engineers to be in a better collaboration and understanding of data across the organization.

4. Monitor data in real-time

Set up real-time data monitoring systems to continuously track data quality metrics and detect anomalies. This can allow organizations to identify and address issues as they arise such as the likelihood of data corruption or loss.

5. Establish data quality thresholds: 

Set thresholds for data quality metrics to trigger alerts when data quality falls below acceptable levels. This can help businesses to take immediate action to resolve data downtime issues and maintain data reliability.

6. Foster a data quality culture

Encourage a culture of data quality awareness and responsibility by providing employees: 

Training 

Tools 

And other relevant resources like data quality guidelines. 

This can empower individuals to take ownership of data quality and contribute to the organization’s data observability efforts.

If you have further questions on data observability, please contact us at:

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

YOUR EMAIL ADDRESS WILL NOT BE PUBLISHED. REQUIRED FIELDS ARE MARKED

*

0 Comments

Comment

Update the detailed information about Effective Strategies For Handling Missing Values In Data Analysis (Updated 2023) on the Achiashop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!