You are reading the article 10+ Simple Yet Powerful Excel Tricks For Data Analysis updated in December 2023 on the website Achiashop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 10+ Simple Yet Powerful Excel Tricks For Data Analysis
Overview
Microsoft Excel is one of the most widely used tools for data analysis
Learn the essential Excel functions used to analyze data for business analytics
Data Analysis with Excel serves as a precursor to Data Science with R or Python
*This article was originally published in 2023 and updated in April 2023.
IntroductionI’ve always admired the immense power of Excel. This software is not only capable of doing basic data computations, but you can also perform data analysis using it. It is widely used for many purposes including the likes of financial modeling and business planning. It can become a good stepping stone for people who are new to the world of business analytics.
In fact, we have designed an entire comprehensive program on Business Analytics for you, with Excel as a key component! Make sure you check it out and give yourself the gift of a business analytics career.
I feel fortunate that my journey started with Excel. Over the years, I’ve learned many tricks to work to deal with data faster than ever. Excel has numerous functions. It becomes confusing at times to choose the best one.
In this article, I’ll provide you some tips and tricks to work on Excel and save you time. This article is best suited to people keen to upgrade their data analysis skills.
Commonly used functions Syntax: =VLOOKUP(Key to lookup, Source_table, column of source table, are you ok with relative match?)For the above problem, we can write the formula in cell “F4” as =VLOOKUP(B4, $H$4:$L$15, 5, 0) and this will return the city name for all the Customer id 1 and post that copy this formula for all Customer ids.
Tip: Do not forget to lock the range of the second table using a “$” sign – a common error when copying this formula down. This is known as relative referencing.
Syntax: =Concatenate(Text1, Text2,.....Textn)Above problem can be solved using formula, =concatenate(B3, C3) and copy it.
Tip: I prefer using the “&” symbol, because it is shorter than typing a full “concatenate” formula, and does the exact same thing. The formula can be written as “= B3&C3”.
3. LEN() – This function tells you about the length of a cell i.e. number of characters including spaces and special characters.
Syntax: =Len(Text)Example: =Len(B3) = 23
4. LOWER(), UPPER() and PROPER() –These three functions help to change the text to lower, upper, and sentence case respectively (First letter of each word capital).
Syntax: =Upper(Text)/ Lower(Text) / Proper(Text) Syntax: =Trim(Text)6. IF(): I find it one of the most useful functions in excel. It lets you use conditional formulas that calculate one way when a certain thing is true and another way when false. For example, you want to mark each sales as “High” and “Low”. If sales are greater than or equals to $5000 then “High” else “Low”.
Syntax: =IF(condition, True Statement, False Statement) Generating inference from Data1. Pivot Table: Whenever you are working with company data, you seek answers for questions like “How much revenue is contributed by branches of North region?” or “What was the average number of customers for product A?” and many others.
Above, you can see that table on the left has sales detail against each customer with the region and product mapping. In the table to the right, we have summarized the information at region level which now helps us to generate an inference that the South region has the highest sales.
Above, you can see that we have arranged “Region” in row, “Product id” in column and sum of “Premium” is taken as value. Now you are ready with pivot table which shows Region and Product wise sum of premium. You can also use count, average, min, max and other summary metrics.
2. Creating Charts: Building a chart/ graph in excel requires nothing more than selecting the range of data you wish to chart and press F11. This will create an Excel chart in default chart style but you can change it by selecting different chart style. If you prefer the chart to be on the same worksheet as the data, instead of pressing F11, press ALT + F1.
Of course, in either case, once you have created the chart, you can customize to your particular needs to communicate your desired message.
Data CleaningAbove, you can see that values are separated by semicolon “;”. Now to split these values in a different column, I will recommend to use the “Text to Columns” feature in excel. Follow the below steps to convert it to different columns:
Select the range A1:A6
Above, we have two options “Delimited” and “Fixed width”. I have selected delimited because the values are separated by a delimiter(;). If we would be interested to split data based on the width such as the first four character to the first column, 5 to 10th character to the second column, then we would choose Fixed width.
Essential keyboard shortcutsKeyboard shortcuts are the best way to navigate cells or enter formulas more quickly. We’ve listed our favorites below.
Ctrl + Shift + Down/Up Arrow: Selects all the cells above or below the current cell
Ctrl+ Home: Navigates to cell A1
Ctrl+End: Navigates to the last cell that contains data
Alt+F1: Creates a chart based on selected data set.
Ctrl+Shift+L: Activate auto filter to data table
Alt+Down Arrow: To open the drop-down menu of auto filter
Alt+D+S: To sort the data set
Ctrl+O: Open a new workbook
Ctrl+N: Create a new workbook
F4: Select the range and press F4 key, it will change the reference to absolute, mixed and relative.
End NotesLooking to get started in the data field? We have curated the perfect multi-course program for you! Check out the Certified Business Analytics Program and launch your career!
Related
You're reading 10+ Simple Yet Powerful Excel Tricks For Data Analysis
Data Analysis Using Python Pandas
In this tutorial, we are going to see the data analysis using Python pandas library. The library pandas are written in C. So, we don’t get any problem with speed. It is famous for data analysis. We have two types of data storage structures in pandas. They are Series and DataFrame. Let’s see one by one.
1.SeriesSeries is a 1D array with customized index and values. We can create a Series object using the pandas.Series(data, index) class. Series will take integers, lists, dictionaries as data. Let’s see some examples.
Example # importing the pandas library import pandas as pd # data data = [1, 2, 3] # creating Series object # Series automatically takes the default index series = pd.Series(data) print(series) OutputIf you run the above program, you will get the following result.
0 1 1 2 2 3 dtype: int64How to have a customized index? See the example.
Example # importing the pandas library import pandas as pd # data data = [1, 2, 3] # index index = ['a', 'b', 'c'] # creating Series object series = pd.Series(data, index) print(series) OutputIf you run the above program, you will get the following result.
a 1 b 2 c 3 dtype: int64When we give the data as a dictionary to the Series class, then it takes keys as index and values as actual data. Let’s see one example.
Example # importing the pandas library import pandas as pd # data data = {'a':97, 'b':98, 'c':99} # creating Series object series = pd.Series(data) print(series) OutputIf you run the above program, you will get the following results.
a 97 b 98 c 99 dtype: int64We can access the data from the Series using an index. Let’s see the examples.
Example # importing the pandas library import pandas as pd # data data = {'a':97, 'b':98, 'c':99} # creating Series object series = pd.Series(data) # accessing the data from the Series using indexes print(series['a'], series['b'], series['c']) OutputIf you run the above code, you will get the following results.
97 98 99 2.PandasWe have how to use Series class in pandas. Let’s see how to use the DataFrame class. DataFrame data structure class in pandas that contains rows and columns.
We can create DataFrame objects using lists, dictionaries, Series, etc.., Let’s create the DataFrame using lists.
Example # importing the pandas library import pandas as pd # lists names = ['Tutorialspoint', 'Mohit', 'Sharma'] ages = [25, 32, 21] # creating a DataFrame data_frame = pd.DataFrame({'Name': names, 'Age': ages}) # printing the DataFrame print(data_frame) OutputIf you run the above program, you will get the following results.
Name Age 0 Tutorialspoint 25 1 Mohit 32 2 Sharma 21Let’s see how to create a data frame object using the Series.
Example # importing the pandas library import pandas as pd # Series _1 = pd.Series([1, 2, 3]) _2 = pd.Series([1, 4, 9]) _3 = pd.Series([1, 8, 27]) # creating a DataFrame data_frame = pd.DataFrame({"a":_1, "b":_2, "c":_3}) # printing the DataFrame print(data_frame) OutputIf you run the above code, you will get the following results.
a b c 0 1 1 1 1 2 4 8 2 3 9 27We can access the data from the DataFrames using the column name. Let’s see one example.
Example # importing the pandas library import pandas as pd # Series _1 = pd.Series([1, 2, 3]) _2 = pd.Series([1, 4, 9]) _3 = pd.Series([1, 8, 27]) # creating a DataFrame data_frame = pd.DataFrame({"a":_1, "b":_2, "c":_3}) # accessing the entire column with name 'a' print(data_frame['a']) OutputIf you run the above code, you will get the following results.
0 1 1 2 2 3Effective Strategies For Handling Missing Values In Data Analysis (Updated 2023)
Introduction
If you are aiming for a job as a data scientist, you must know how to handle the problem of missing values, which is quite common in many real-life datasets. Incomplete data can bias the results of the machine learning models and/or reduce the accuracy of the model. This article describes missing data, how it is represented, and the different reasons data values get missed. Along with the different categories of missing data, it also details out different ways of handling missing values with dataset examples.
Learning Objectives
In this tutorial, we will learn about missing values and the benefits of missing data analysis in data science.
You will learn about the different types of missing data and how to handle them correctly.
You will also learn about the most widely used imputation methods to handle incomplete data.
What Is a Missing Value?Missing data is defined as the values or data that is not stored (or not present) for some variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset. You can see the columns ‘Age’ and ‘Cabin’ have some missing values.
Source: analyticsindiamag
How Is a Missing Value Represented in a Dataset?In the dataset, the blank shows the missing values.
In Pandas, usually, missing values are represented by NaN. It stands for Not a Number.
Source: medium
The above image shows the first few records of the Titanic dataset extracted and displayed using Pandas.
Why Is Data Missing From the Dataset?There can be multiple reasons why certain values are missing from the data. Reasons for the missing of data from the dataset affect the approach of handling missing data. So it’s necessary to understand why the data could be missing.
Some of the reasons are listed below:
Past data might get corrupted due to improper maintenance.
Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.
The user has not provided the values intentionally
Item nonresponse: This means the participant refused to respond.
Types of Missing ValuesFormally the missing values are categorized as follows:
Source: theblogmedia
Missing Completely At Random (MCAR)In MCAR, the probability of data being missing is the same for all the observations. In this case, there is no relationship between the missing data and any other values observed or unobserved (the data which is not recorded) within the given dataset. That is, missing values are completely independent of other data. There is no pattern.
Missing At Random (MAR)MAR data means that the reason for missing values can be explained by variables on which you have complete information, as there is some relationship between the missing data and other values/data. In this case, the data is not missing for all the observations. It is missing only within sub-samples of the data, and there is some pattern in the missing values.
For example, if you check the survey data, you may find that all the people have answered their ‘Gender,’ but ‘Age’ values are mostly missing for people who have answered their ‘Gender’ as ‘female.’ (The reason being most of the females don’t want to reveal their age.)
So, the probability of data being missing depends only on the observed value or data. In this case, the variables ‘Gender’ and ‘Age’ are related. The reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable, but you can not predict the missing value itself.
Suppose a poll is taken for overdue books in a library. Gender and the number of overdue books are asked in the poll. Assume that most of the females answer the poll and men are less likely to answer. So why the data is missing can be explained by another factor, that is gender. In this case, the statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the missing data.
Missing Not At Random (MNAR)Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data can not explain it, then it is considered to be Missing Not At Random (MNAR).
If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR. It can happen due to the reluctance of people to provide the required information. A specific group of respondents may not answer some questions in a survey.
For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll. So, in this case, the missing value of the number of overdue books depends on the people who have more books overdue.
Another example is that people having less income may refuse to share some information in a survey or questionnaire.
In the case of MNAR as well, the statistical analysis might result in bias.
Why Do We Need to Care About Handling Missing Data?It is important to handle the missing values appropriately.
Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive Bayes support data with missing values.
You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled properly.
Missing data can lead to a lack of precision in the statistical analysis.
Practice ProblemLet’s take an example of the Loan Prediction Practice Problem from Analytics Vidhya. You can download the dataset from the following link.
Checking for Missing Values in PythonThe first step in handling missing values is to carefully look at the complete data and find all the missing values. The following code shows the total number of missing values in each column. It also shows the total number of missing values in the entire data set.
From the above output, we can see that there are 6 columns – Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History having missing values.
IN: #Find the total number of missing values from the entire dataset train_df.isnull().sum().sum() OUT: 149There are 149 missing values in total.
List of Methods to handle missing values in a datasetHere is a list of popular strategies to handle missing values in a dataset
Deleting the Missing Values
Imputing the Missing Values
Imputing the Missing Values for Categorical Features
Imputing the Missing Values using Sci-kit Learn Library
Using “Missingness” as a Feature
Handling Missing ValuesNow that you have found the missing data, how do you handle the missing values?
Analyze each column with missing values carefully to understand the reasons behind the missing of those values, as this information is crucial to choose the strategy for handling the missing values.
There are 2 primary ways of handling missing values:
Deleting the Missing values
Imputing the Missing Values
Deleting the Missing valueGenerally, this approach is not recommended. It is one of the quick and dirty techniques one can use to deal with missing values. If the missing value is of the type Missing Not At Random (MNAR), then it should not be deleted.
If the missing value is of type Missing At Random (MAR) or Missing Completely At Random (MCAR) then it can be deleted (In the analysis, all cases with available data are utilized, while missing observations are assumed to be completely random (MCAR) and addressed through pairwise deletion.)
There are 2 ways one can delete the missing data values:
Deleting the entire row (listwise deletion)
If a row has many missing values, you can drop the entire row. If every row has some (column) value missing, you might end up deleting the whole data. The code to drop the entire row is as follows:
IN: df = train_df.dropna(axis=0) df.isnull().sum() OUT: Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64Deleting the entire column
If a certain column has many missing values, then you can choose to drop the entire column. The code to drop the entire column is as follows:
IN: df = train_df.drop(['Dependents'],axis=1) df.isnull().sum() OUT: Loan_ID 0 Gender 13 Married 3 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 22 Loan_Amount_Term 14 Credit_History 50 Property_Area 0 Loan_Status 0 dtype: int64 Imputing the Missing ValueThere are many imputation methods for replacing the missing values. You can use different python libraries such as Pandas, and Sci-kit Learn to do this. Let’s go through some of the ways of replacing the missing values.
Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can replace it with some arbitrary value using the following code. E.g., in the following code, we are replacing the missing values of the ‘Dependents’ column with ‘0’.
IN: #Replace the missing value with '0' using 'fiilna' method train_df['Dependents'] = train_df['Dependents'].fillna(0) train_df[‘Dependents'].isnull().sum() OUT: 0Replacing with the mean
This is the most common method of imputing missing values of numeric columns. If there are outliers, then the mean will not be appropriate. In such cases, outliers need to be treated first. You can use the ‘fillna’ method for imputing the columns ‘LoanAmount’ and ‘Credit_History’ with the mean of the respective column values.
IN: #Replace the missing values for numerical columns with mean train_df['LoanAmount'] = train_df['LoanAmount'].fillna(train_df['LoanAmount'].mean()) train_df['Credit_History'] = train_df[‘Credit_History'].fillna(train_df['Credit_History'].mean()) OUT: Loan_ID 0 Gender 13 Married 3 Dependents 15 Education 0 Self_Employed 32 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64Replacing with the mode
Mode is the most frequently occurring value. It is used in the case of categorical features. You can use the ‘fillna’ method for imputing the categorical columns ‘Gender,’ ‘Married,’ and ‘Self_Employed.’
IN: #Replace the missing values for categorical columns with mode train_df['Gender'] = train_df['Gender'].fillna(train_df['Gender'].mode()[0]) train_df['Married'] = train_df['Married'].fillna(train_df['Married'].mode()[0]) train_df['Self_Employed'] = train_df[‘Self_Employed'].fillna(train_df['Self_Employed'].mode()[0]) train_df.isnull().sum() OUT: Loan_ID 0 Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0 dtype: int64Replacing with the median
The median is the middlemost value. It’s better to use the median value for imputation in the case of outliers. You can use the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’ with the median value.
train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())Replacing with the previous value – forward fill
In some cases, imputing the values with the previous value instead of the mean, mode, or median is more appropriate. This is called forward fill. It is mostly used in time series data. You can use the ‘fillna’ function with the parameter ‘method = ffill’
IN: import pandas as pd import numpy as np test = pd.Series(range(6)) test.loc[2:4] = np.nan test OUT: 0 0.0 1 1.0 2 Nan 3 Nan 4 Nan 5 5.0 dtype: float64 IN: # Forward-Fill test.fillna(method=‘ffill') OUT: 0 0.0 1 1.0 2 1.0 3 1.0 4 1.0 5 5.0 dtype: float64Replacing with the next value – backward fill
In backward fill, the missing value is imputed using the next value.
IN: # Backward-Fill test.fillna(method=‘bfill') OUT: 0 0.0 1 1.0 2 5.0 3 5.0 4 5.0 5 5.0 dtype: float64Interpolation
Missing values can also be imputed using interpolation. Pandas’ interpolate method can be used to replace the missing values with different interpolation methods like ‘polynomial,’ ‘linear,’ and ‘quadratic.’ The default method is ‘linear.’
IN: test.interpolate() OUT: 0 0.0 1 1.0 2 2.0 3 3.0 4 4.0 5 5.0 dtype: float64 How to Impute Missing Values for Categorical Features?There are two ways to impute missing values for categorical features as follows:
Impute the Most Frequent ValueWe will use ‘SimpleImputer’ in this case, and as this is a non-numeric column, we can’t use mean or median, but we can use the most frequent value and constant.
IN: import pandas as pd import numpy as np X = pd.DataFrame({'Shape':['square', 'square', 'oval', 'circle', np.nan]}) X Shape OUT: 0 square 1 square 2 oval 3 circle 4 NaN IN: from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='most_frequent') imputer.fit_transform(X) OUT: array([['square'], ['square'], ['oval'], ['circle'], ['square']], dtype=object)As you can see, the missing value is imputed with the most frequent value, ’square.’
Impute the Value “Missing”We can impute the value “missing,” which treats it as a separate category.
IN: imputer = SimpleImputer(strategy='constant', fill_value='missing') imputer.fit_transform(X) OUT: array([['square'], ['square'], ['oval'], ['circle'], ['missing']], dtype=object)In any of the above approaches, you will still need to OneHotEncode the data (or you can also use another encoder of your choice). After One Hot Encoding, in case 1, instead of the values ‘square,’ ‘oval,’ and’ circle,’ you will get three feature columns. And in case 2, you will get four feature columns (4th one for the ‘missing’ category). So it’s like adding the missing indicator column in the data. There is another way to add a missing indicator column, which we will discuss further.
How to Impute Missing Values Using Sci-kit Learn Library?We can impute missing values using the sci-kit library by creating a model to predict the observed value of a variable based on another variable which is known as regression imputation.
Univariate ApproachIn a Univariate approach, only a single feature is taken into consideration. You can use the class SimpleImputer and replace the missing values with mean, mode, median, or some constant value.
Let’s see an example:
IN: import numpy as np from sklearn.impute import SimpleImputer imp = SimpleImputer(missing_values=np.nan, strategy='mean') imp.fit([[1, 2], [np.nan, 3], [7, 6]]) OUT: SimpleImputer() IN: X = [[np.nan, 2], [6, np.nan], [7, 6]] print(imp.transform(X)) OUT: [[4. 2. ] [6. 3.666...] [7. 6. ]] Multivariate ApproachIn a multivariate approach, more than one feature is taken into consideration. There are two ways to impute missing values considering the multivariate approach. Using KNNImputer or IterativeImputer classes.
Let’s take an example of a titanic dataset.
Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also younger and people with higher fares are also older. In that case, it would make sense to impute low age for low fare values and high age for high fare values. So here, we are taking multiple features into account by following a multivariate approach.
IN: import pandas as pd cols = ['SibSp', 'Fare', 'Age'] X = df[cols] XSibSpFareAge017.250022.01171.283338.0207.925026.03153.100035.0408.050035.0508.4583NaN
IN: from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer impute_it = IterativeImputer() impute_it.fit_transform(X) OUT: array([[ 1. , 7.25 , 22. ], [ 1. , 71.2833 , 38. ], [ 0. , 7.925 , 26. ], [ 1. , 53.1 , 35. ], [ 0. , 8.05 , 35. ], [ 0. , 8.4583 , 28.50639495]])Let’s see how IterativeImputer works. For all rows in which ‘Age’ is not missing, sci-kit learn runs a regression model. It uses ‘Sib sp’ and ‘Fare’ as the features and ‘Age’ as the target. And then, for all rows for which ‘Age’ is missing, it makes predictions for ‘Age’ by passing ‘Sib sp’ and ‘Fare’ to the training model. So it actually builds a regression model with two features and one target and then makes predictions on any places where there are missing values. And those predictions are the imputed values.
Nearest Neighbors Imputations (KNNImputer)Missing values are imputed using the k-Nearest Neighbors approach, where a Euclidean distance is used to find the nearest neighbors. Let’s take the above example of the titanic dataset to see how it works.
IN: from sklearn.impute import KNNImputer impute_knn = KNNImputer(n_neighbors=2) impute_knn.fit_transform(X) OUT: array([[ 1. , 7.25 , 22. ], [ 1. , 71.2833, 38. ], [ 0. , 7.925 , 26. ], [ 1. , 53.1 , 35. ], [ 0. , 8.05 , 35. ], [ 0. , 8.4583, 30.5 ]])In the above example, the n_neighbors=2. So sci-kit learn finds the two most similar rows measured by how close the ‘Sib sp’ and ‘Fare’ values are to the row which has missing values. In this case, the last row has a missing value. And the third row and the fifth row have the closest values for the other two features. So the average of the ‘Age’ feature from these two rows is taken as the imputed value.
How to Use “Missingness” as a Feature?In some cases, while imputing missing values, you can preserve information about which values were missing and use that as a feature. This is because sometimes, there may be a relationship between the reason for missing values (also called the “missingness”) and the target variable you are trying to predict. In such cases, you can add a missing indicator to encode the “missingness” as a feature in the imputed data set.
Where can we use this?
Suppose you are predicting the presence of a disease. Now, imagine a scenario where a missing age is a good predictor of the disease because we don’t have records for people in poverty. The age values are not missing at random. They are missing for people in poverty, and poverty is a good predictor of disease. Thus, missing age or “missingness” is a good predictor of disease.
IN: import pandas as pd import numpy as np X = pd.DataFrame({'Age':[20, 30, 10, chúng tôi 10]}) XAge020.0130.0210.03NaN410.0
IN: from sklearn.impute import SimpleImputer # impute the mean imputer = SimpleImputer() imputer.fit_transform(X) OUT: array([[20. ], [30. ], [10. ], [17.5], [10. ]]) IN: imputer = SimpleImputer(add_indicator=True) imputer.fit_transform(X) OUT: array([[20. , 0. ], [30. , 0. ], [10. , 0. ], [17.5, 1. ], [10. , 0. ]])In the above example, the second column indicates whether the corresponding value in the first column was missing or not. ‘1’ indicates that the corresponding value was missing, and ‘0’ indicates that the corresponding value was not missing.
If you don’t want to impute missing values but only want to have the indicator matrix, then you can use the ‘MissingIndicator’ class from scikit learn.
ConclusionKey Takeaways
It is critical to reduce the potential bias in the machine learning models and get a precise statistical analysis of the data. Handling missing values is one of the challenges of data analysis.
Understanding different categories of missing data help in making decisions on how to handle it. We explored different categories of missing data and the different ways of handling it in this article.
Frequently Asked QuestionsQ1. What are the types of missing values in data?
A. The three types of missing data are Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR).
Q2. How do you handle missing values?
A. We can use different methods to handle missing data points, such as dropping missing values, imputing them using machine learning, or treating missing values as a separate category.
Q3. How does pairwise deletion handle missing data?
A. Pairwise deletion is a method of handling missing values where only the observations with complete data are used in each pairwise correlation or regression analysis. This method assumes that the missing data is MCAR, and it is appropriate when the missing data is not too large.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
blogathonHandling Missing ValuesMissing Datamissing value imputationMissing Values Treatment
Recommended For You
Top 10 Python Libraries For Data Visualization In 2023
In this article, we have discussed the top 10 python libraries for data visualization in 2023
Python is one of the most widely used programming languages. It serves to be a blessing in the field of data science. When one boasts of possessing good Python skills, it is expected out of that person that he/she is well acquainted with libraries in Python. Here are the top 10 Python libraries for data visualization in 2023 which make programming and developing models a lot easier.
1. SciPyThis stands for Scientific Python. This is yet another open-source library that comes in handy for all kinds of high-level computations. This plays a significant role to play in all those scientific and technical computations that you once thought weren’t easy to handle. This is user-friendly and no one denies this. One of its remarkable features is its ability to solve differential equations. This library has applications in linear algebra, solving differential equations, and optimizing algorithms to name a few.
2. GradioThis library allows you to build and deploy applications web applications. The best feature of this library is that your task is done in as little as 3 lines of code. Yet another benefit of this library that’s worth a mention is how fast and easy the whole process gets. With Gradio, it is possible to test different inputs. Model validation is easier than ever with Gradio. Since there is a provision of public links, it becomes very easy to implement and distribute web applications.
3. Keras 4. Matplotlib 5. OrbitThis is yet another Python framework designed for Bayesian time series forecasting and inference. Its framework is built on probabilistic programming packages like PyStan and Uber’s own Pyro.
6. SeabornIt is one of those data visualization libraries that helps in drawing attractive and informative statistical graphics. Seaborn provides a high-level interface. People consider it to be an extension of Matplotlib. While Matplotlib provides a range of basic plotting features, Seaborn lets users enjoy a range of visualization patterns. Yet another feature of this library that grabs attention is that the syntax is simple and not that complex.
7. PandasPandas stands for ‘Python Data Analysis Library’. This is an open-source Python package that ensures delivering high performance. Here, you can find easy-to-use data structures and data analysis tools that serve to be extremely useful while programming in Python. Some of the best features of this library are:
One can plot data with a histogram or box plot.
It is very easy to add, delete and update columns.
Renaming, sorting, indexing, merging, and manipulating data frames.
8. SktimeThis is an open-source Python library exclusively designed for time series analysis. It provides an extension to the scikit-learn API for time-series solutions and contains all the required algorithms and tools that are needed for the effective resolution of time-series regression, prediction, and categorization issues.
9. DartsDarts is yet another time series Python library that has made its way to the list of the top 10 Python libraries. Developed by Unit8, Darts is widely known for easy manipulation and forecasting of time series. It can handle large data quite well and supports both univariate and multivariate time series analysis and models.
10. Kats (Kits to Analyze Time Series)The Top 10 Most Effective Business Analysis Techniques
blog / Business Analytics 10 Most Popular Business Analysis Techniques Used by Professionals
Share link
Business analysis is a structured way of introducing and managing organizational change while providing value to all business stakeholders. It includes identifying new opportunities, optimizing costs, understanding required capabilities, and finding solutions to help businesses achieve their goals. If you are an aspiring business analyst seeking information on commonly used techniques or an existing professional in this field looking to upskill yourself, here are the top 10 business analysis techniques you must know to be effective in your role.
Top 10 Business Analysis Techniques SWOT AnalysisSWOT (Strengths, Weaknesses, Opportunities, Threats) analysis is a four-quadrant analysis where the business analyst groups information and data as per the strengths, weaknesses, opportunities, and threats related to a company. It helps to get a clear picture of a company’s standing, both internal as well as external factors, allowing for more informed decisions.
Pros and ConsSWOT analysis helps improve planning because it gives a clear picture of various attributes of the business. However, it may sometimes be an oversimplified analysis. For example, the high price of a product may not necessarily be a threat to the company. It can, in fact, be a strength as it may create the perception of luxury in customers’ minds.
MOST AnalysisMOST, a short form for Mission, Objectives, Strategies, and Tactics, helps keep these four factors relevant and aligned with the business. It is a powerful tool used to assess the organization’s strategic plan and gives a clear vision to each organization member regarding the direction of their work. This ultimately helps bring all functions and levels in alignment with each other.
Pros and ConsIt is a clean and simple way of communicating business strategy to all the stakeholders and helps them align in a common direction. However, this technique is not self-sufficient and needs the support of other tools and analyses to define business strategy fully.
Business Process Modeling (BPM)Business process modeling is a data-driven and illustrative representation of an organization’s business processes. It provides insights into the various functions of a business process, including events and activities, owners, decision points, and timelines. It uses an ‘as-is’ approach instead of a ‘to-be’ approach, thus allowing better visibility into existing processes and helping improve them.
Pros and ConsBusiness process modeling helps align business operations with strategy, improves communication among various stakeholders, and helps achieve operational efficiencies. The only downside to this technique is the risk of overanalyzing the process, especially if the business problem is not very complex. Also, if not implemented properly, this technique may underutilize the resources spent on creating the model.
Use Case ModelingUse case modeling depicts how users interact with a system to solve problems. It defines the user’s objective, various interactions between the system and the user, and the system’s behavior to fulfill the user’s objectives. It can be done using various tools such as Microsoft’s Visio, Lucidchart, and IBM Rational Rose.
Pros and ConsAs a user-centered technique, it helps develop a system from the user’s point of view. It also helps visualize complex projects more simply. A major drawback of use case modeling is that it is not object-oriented (i.e., not made up of data fields with unique attributes and behavior). It may also lead to miscommunication due to the use of non-technical language.
BrainstormingBrainstorming is a creative group activity undertaken to develop an exhaustive list of ideas and identify multiple possible solutions for a problem at hand. It requires freewheeling thinking and discussion to ensure no possibility is left unexplored.
Pros and Cons Non-Functional Requirement AnalysisNon-functional requirements are necessary for a system to perform well but do not directly contribute to the primary functions. For example, a functional requirement of a word editor can be the ability to write text. In contrast, a non-functional requirement can be software that automatically saves text if a user forgets to save it manually. Thus the Non-Functional Requirement technique analyzes various non-functional requirements such as security, reliability, performance, maintainability, scalability, and usability. It helps understand various operational capabilities and constraints that need to be considered in the design of systems.
Pros and ConsNon-functional requirement analysis ensures legal and other compliance requirements. It also creates ease of operations and a good experience for the user. One major con of non-functional capabilities is that it is difficult to alter once the design phase is complete.
PESTLE AnalysisThis strategic tool analyzes the external environmental factors that may impact a business and its future performance. PESTLE analysis includes the following factors:
E – Environmental
Pros and Cons Requirement AnalysisRequirement analysis is undertaken to capture the user expectations for a new product. It starts with identifying the relevant stakeholders, conducting interviews to capture requirements, categorizing requirements, interpreting and documenting requirements, and finally, signing off on the requirements that need to be worked upon.
Pros and Cons User StoriesUser stories convey what users want to achieve in a simple, non-technical way. They help provide context to the development team and help them understand why they are building what they’re building and how it impacts the end user. These stories are a core component of Agile programs.
Pros and ConsThe pros of user stories as a business analysis technique include outcomes that are a user-centric and improved collaboration between the product team and the users. Major cons include fulfilling compliance needs and ensuring correct interpretations of the user stories.
CATWOECATWOE is a generic business analysis technique that helps define and analyze the perspectives of various business stakeholders. It is an acronym that stands for:
E – Environmental Constraints
Pros and ConsIt is one of those business analysis techniques that help consider the perspectives of different stakeholders and give due credit to various requirements. The only possible con is that it may result in the confusion caused by rising conflicting views.
ALSO READ: What is Business Analytics? Why Should You Know More About it?
Fast-Track Your Business Analysis Career with EmeritusAccording to a recent survey, 88% of executives reported that their companies had increased investments in data, analytics, and AI during 2023. The future is bright for professionals pursuing a career in these areas. Suppose you are planning to pursue or enhance your skills in business analysis. In that case, we hope this article has given you a fair understanding of each business analysis technique. However, the specific technique you choose might differ based on the industry and goal of your organization.
ALSO READ: What Does a Business Analyst Do? Key Responsibilities, Skills Needed, Tools Used
If you want to learn more about business analytics concepts and applications, explore these online business analytics courses offered by world-class universities via Emeritus, and build your skills for a successful career in analytics.
Write to us at [email protected]
10 Youtube Tricks For Iphone App To Watch Videos Like A Pro – Webnots
YouTube is the largest video streaming service in the world offered by Google. However, the iPhone iOS app is somewhat traditional that Google downplay the features. For example, many video apps allow you to swipe left or right to backward / forward the play. Unfortunately, you can’t do that in YouTube app. If you are using the app on your iPhone every day, here are some YouTube tricks to watch videos without hassle.
Related: How to increase YouTube video views?
YouTube App Settings in iOSThere are two types of tricks you can do with YouTube iOS app. One is at the app level settings and other is at video level setting.
App Level SettingWhen you are in YouTube app, tap on the profile icon that is available on the top right corner.
Tap on Profile
Tap on “Settings” option.
YouTube Settings
Here you will have many options that you can configure as per your need.
Video Level SettingsWhen you are playing a video, tap on the video and then tap on the three vertical dots icon.
YouTube Video Setting
1. Setup Custom BreakiPhone has Screen Time option to monitor and control your time on phone. Watching videos for long time will strain your eyes and create health problems. In order to avoid sitting in front of YouTube app endless, you can setup custom breaks. This will help you to change the focus of your eyes and take sufficient break from watching videos. Go to app settings and enable the option “Remind me to take a break”.
Break Settings in YouTube
On the pop-up, setup your break time frequency. YouTube will allow you to setup from 5 minutes to many hours frequency gap for taking a break.
Setup Custom Break Frequency
After the set time, YouTube app will prompt with you a pop-up to take a break.
Warning to Take a Break
Similar to break, you can also setup bedtime to avoid watching videos during sleeping time.
YouTube Bedtime Setting
2. Change ThemeUnfortunately, YouTube app does not follow the theme settings on your iPhone. By default, it will open in light mode. However, you can change the theme to dark by going to settings and enable “Dark theme” option. This will help you to reduce the strain on eyes especially during nighttime or when you are in the bright environment.
3. Disable Recommendations and HistoryYouTube will show recommended video feed based on the history of the videos you watch. If you do not want to record the history, go to app settings and tap on “Clear watch history” and “Clear search history” options. In addition, you can also disable watch/search history by enabling “Pause watch history” and “Pause search history” options.
Clear YouTube History
4. Setup Siri ShortcutsGo to app settings and tap on “Siri Shortcuts” option. Here you can add search and search with voice shortcuts in Siri. Tap on the + icon against the option and tap on “Add to Siri” button. For example, you can add the Siri shortcut as “YouTube Voice” to open YouTube and start a voice search.
Setup Siri Shortcuts for YouTube
5. Restricted ModeSometimes it will be annoying to see adult content in the video feed without your knowledge and permission. Similar to Google search results, you can also restrict YouTube feed and search results from showing adult and mature content. Go to app settings and enable “Restricted Mode” option. This will be useful if you are sharing the iPhone with your family members who also watch videos in YouTube app.
6. Disable Inappropriate AdsIf you frequently see a particular ad below the video that is inappropriate, YouTube allows you to report and block that ad.
Tap on the three dots button that shows just below the image of the ad.
Choose “Stop seeing this ad” option.
Stop Specific Ad
Tap on “Repetitive” or “Irrelevant” or “Inappropriate” to block the ad.
7. Forward and RewindAs mentioned, YouTube does not offer swipe feature to drag the time bar. It is embarrassing to move the time bar to forward or rewind the video you are watching. However, you can double tap on the right side of the video to move 10 seconds forward and double tap on the left side of the video to rewind 10 seconds.
By default, the app will move 10 seconds forward or back. If you want to change the time, go to settings and tap on “Skip forward and back” option. You can choose 5, 10, 15, 20, 30 or 60 seconds. However, it is not possible to setup different time for forward and rewind.
Related: How to disable in-app purchase in iPhone?
8. Change Video QualityBy default, YouTube will play the videos in 360p quality to save network bandwidth. Tap on the video setting and tap again on “Quality” option. Here you can change the video quality to higher or lower level as per your need.
9. Watch Video in Slow MotionSometimes the video may play very fast that you can’t follow the step by step instructions. Instead of moving the time bar or rewinding by double tapping left of the video, you can play the video in slow motion mode. Tap on the three dots button on the video and choose “Playback speed” option. Select 0.25x, 0.5x or 0.75x for slow motion and 1.25x, 1.5x, 1.75x or 2x for fast playing.
Change Video Playing Speed
10. Ads Free VideosYouTube Premium Features
Final WordsWe hope you learn at least one new setting in your YouTube iOS app. We recommend you to setup break time and use dark theme if you are using YouTube app for many hours in iPhone. Also, next time when want to move forward or rewind use double tap or slow/fast motion.
Update the detailed information about 10+ Simple Yet Powerful Excel Tricks For Data Analysis on the Achiashop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!