Trending December 2023 # A Quick Tutorial On Clustering For Data Science Professionals # Suggested January 2024 # Top 13 Popular

You are reading the article A Quick Tutorial On Clustering For Data Science Professionals updated in December 2023 on the website Achiashop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 A Quick Tutorial On Clustering For Data Science Professionals

This is article was published as a part of the Data Science Blogathon.

Welcome to this wide-ranging article on clustering in data science! There’s a lot to unpack so let’s dive straight in.

In this article, we will be discussing what is clustering, why is clustering required, various applications of clustering, a brief about the K Means algorithm, and finally in detail practical implementations of some of the applications using clustering.

Table of Contents

What is Clustering?

Why is Clustering required?

Various applications of Clustering

A brief about the K-Means Clustering Algorithm

Practical implementation of Popular Clustering Applications

What is Clustering?

In simple terms, the agenda is to group similar items together into clusters, just like this:

Let’s go ahead and understand this with an example, suppose you are on a trip with your friends all of you decided to hike in the mountains, there you came across a beautiful butterfly which you have never seen before. Further, you encountered a few more. They are not exactly the same but similar enough for you to understand that they belong to the same species. Now here you need a lepidopterist(the one who studies and collects butterflies) to tell you exactly what species they are, but there is no need for an expert to identify a similar group of items. This way of identifying similar objects/ items is known as clustering.

Why is Clustering required?

So Clustering is an unsupervised task. Unsupervised means the ones in which we are not provided with any assigned labels or scores for training our data.

Here in the above figure on the left, we can see that each instance is marked with different markers which means it’s a labeled dataset for which we can use the classification algorithms like SVM, Logistics Regression, Decision Trees, or Random Forests. On the right side if you observe it is the same dataset but without labels so here the story for classifications algorithms ends(i.e we can’t use them here). This is where the clustering algorithms come into the picture to save the day!. Right now in the above picture, it is pretty obvious and quite easy to identify the three clusters with our eyes, but that we not be the case while working with real and complex datasets.

Various applications of Clustering 1. Search engines:

You may be familiar with the concept of image search which Google provides. So what this system does is that first, it applies the clustering algorithm on all the images available in the database available. After which similar images would fall under the same cluster. So when a particular user provides an image for reference what it will be doing is applying the trained clustering model on the image to identify its cluster once this is done it simply returns all the images from this cluster.

2. Customer Segmentation:

We can also cluster our customers based on their purchase history and their activity on our website. This is really important and useful to understand who our customers are and what they require so that our system can adapt to their requirements and suggest products to each respective segment accordingly.

3. Semi-supervised Learning:

When you are working on semi-supervised learning in which you are only provided with a few labels, there you could perform clustering algorithms and generate labels for all instances falling under the same cluster. This technique is really good for increasing the number of labels after which a supervised learning algorithm can be used and its performance gets better.

4. Anomaly detection:

Any instance that has a low affinity(Measure of how well an instance fits into a particular cluster) is probably an anomaly. For example, if you have clustered the user based on the request per minute on your website,  you can detect users with abnormal behavior. So this technique is particularly useful in detecting any manufacturing detects or for some fraud detections.

5. Image Segmentation:

If you cluster all the pixels according to their colors, then after that we can replace each pixel with the mean color of its cluster, this might be helpful whenever we need to reduce the number of different colors in the image. Image segmentation plays an important part in object detection and tracking systems.

We will look at how to implement this further.

A Brief About the K-Means Clustering Algorithm

Let’s go ahead and take a quick look at what the K-means algorithm really is.

Firstly, let’s generate some blobs for a better understanding of the unlabelled dataset.

import numpy as np from sklearn.datasets import make_blobs blob_centers = np.array( [[ 0.2, 2.3], [-1.5 , 2.3], [-2.8, 1.8], [-2.8, 2.8], [-2.8, 1.3]]) blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1]) X, y = make_blobs(n_samples=2000, centers=blob_centers, cluster_std=blob_std, random_state=7)

Now let’s plot them

plt.figure(figsize=(8, 4)) plt.scatter(X[:, 0], X[:, 1], c=None, s=1) save_fig("blobs_plot") plt.show()

So this is how an unlabeled dataset would look like, here we can clearly see that there are five blobs of instances. So basically k means is just a simple algorithm capable of clustering this kind of dataset efficiently and quickly.

Let’s go ahead and train a K-Means on this dataset. Now, this algorithm will try to find each blob’s center.

from sklearn.cluster import KMeans k = 5 kmeans = KMeans(n_clusters=k, random_state=101) y_pred = kmeans.fit_predict(X)

Keep in mind that we need to specify the number of cluster k that the algorithm needs to find. In our example, it is pretty straight forward but in general, it won’t be that easy. Now after training each instance would have been assigned to one of the five clusters. Remember that here an instance’s label is the index of the cluster, don’t confuse it with class labels in classification. 

Let’s take a look at the five centroids the algorithm found:

kmeans.cluster_centers_

These are the centroids for clusters with indexes of 0,1,2,3,4 respectively.

Now you can easily be able to assign new instances and the model will assign it to a cluster whose centroid is closet to it.

new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]]) kmeans.predict(new)

That is pretty much it for now, we will see in detail working and types of K-Means some other day in some other blog. Stay Tuned!

Implementation of Popular Clustering Applications

1. Image Segmentation using clustering

Image Segmentation is just the task of partitioning an image into multiple segments. For example, in a self-driving car’s object detection system, all the pixels that are part of a traffic signal’s image might be assigned to the “traffic-signal” segment. Today there are state of the art model based on CNN(convolution neural network) using complex architecture are being used for image processing. But we are going to do something much simpler which is color segmentation. We will simply assign pixels to a particular cluster if they have the same color. This technique might be sufficient for some applications, like the analysis of satellite images to measure the forest area coverage in a region, color segmentation might just do the work.

Let’s go ahead a load the image we are about to work on:

from matplotlib.image import imread image = imread('lady_bug.png') image.shape

Now Let’s go ahead and reshape the array to get a long list of RGB colors and then cluster them using K-Means:

X = image.reshape(-1, 3) kmeans = KMeans(n_clusters=8, random_state=101).fit(X) segmented_img = kmeans.cluster_centers_[kmeans.labels_] segmented_img = segmented_img.reshape(image.shape)

Now what’s happening here is, for example, it tries to identify a color cluster for all shades of green. After that, for each color, it looks for the mean color of the pixel’s color cluster. What I mean is it will replace all shades of green with a light green color assuming that the mean is light green. At last, it will reshape this long list of colors to the original dimension of the image.

Output with a different number of clusters:

2. Data preprocessing using Clustering

For Dimensionality reduction clustering might be an effective approach, like a preprocessing step before a supervised learning algorithm is implemented. Let’s take a look at how we can reduce the dimensionality of the famous MNIST dataset using clustering and how much performance difference we get after doing this.

MNIST dataset consists of 1797 grayscale(one channel) 8 X 8 images representing digits from 0 to 9. Let’s start by loading the dataset:

from sklearn.datasets import load_digits X_digits, y_digits = load_digits(return_X_y=True)

Now let’s split them into training and test set:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)

Now let’s go ahead and train a logistic regression model and evaluate its performance on the test set:

from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression() log_reg.fit(X_train, y_train)

Now Let’s evaluate its accuracy on the test set:

log_reg_score = log_reg.score(X_test, y_test) log_reg_score

Ok so now we have an accuracy of 96.88%. Let’s see if we can do better by using K-Means as a preprocessing step. We will be creating a pipeline that will first cluster the training set into 50 clusters and replace those images with their distances to these 50 clusters, then after that, we will apply the Logistic Regression model:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([

(“kmeans”, KMeans(n_clusters=50)),

(“log_reg”, LogisticRegression()),

])

pipeline.fit(X_train, y_train)

Let’s evaluate this pipeline on test set:

pipeline_score = pipeline.score(X_test, y_test)

pipeline_score

Boom! We just increased the accuracy of the model. But here we choose the number of clusters k arbitrarily. Let’s go ahead and apply grid search to find a better k value:

param_grid = dict(kmeans__n_clusters=range(2, 100)) grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2) grid_clf.fit(X_train, y_train)

Warning the above step might be time-consuming!

Let’s see the best cluster that we got and its accuracy:

grid_clf.best_params_

The accuracy now is:

grid_clf.score(X_test, y_test)

Here we got a significant boost in accuracy compared to earlier on the test set.

End Notes

To sum up, in this article we saw what is clustering?, why is clustering required? , various applications of clustering, a brief about the K Means algorithm, and lastly in detail practical implementations of some of the applications using clustering. I hope you liked it!

Stay tuned!

Connect me with on LinkedIn

Thank You!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

Related

You're reading A Quick Tutorial On Clustering For Data Science Professionals

Pandas Cheat Sheet For Data Science In Python

What is Pandas Cheat Sheet?

Pandas library has many functions, but some of these are confusing for some people. We have here provided a helpful resource available called the Python Pandas Cheat Sheet. It explains the basics of Pandas in a simple and concise manner.

👉 Download the PDF of Cheat Sheet here

Whether you are a newbie or experienced with Pandas, this cheat sheet can serve as a useful reference guide. It covers a variety of topics, including working with Series and DataFrame data structures, selecting and ordering data, and applying functions to your data.

In summary, this Pandas Python Cheat Sheet is a good resource for anyone looking to learn more about using Python for Data Science. It is a handy reference tool. It can help you improve your data analysis skills and work more efficiently with Pandas.

Explaining important functions in Pandas:

To start working with pandas functions, you need to install and import pandas. There are two commands to do this:

Step 1) # Install Pandas

Pip install pandas

Step 2) # Import Pandas

Import pandas as pd

Now, you can start working with Pandas functions. We will work to manipulate, analyze and clean the data. Here are some important functions of pandas.

Pandas Data Structures

As we have already discussed that Pandas has two data structures called Series and DataFrames. Both are labeled arrays and can hold any data type. There is The only difference that Series is a one-dimensional array, and DataFrame is two-dimensional array.

1. Series

It is a one-dimensional labeled array. It can hold any data type.

s = pd.Series([2, -4, 6, 3, None], index=['A', 'B', 'C', 'D', 'E']) 2. DataFrame

It is a two-dimensional labeled array. It can hold any data type and different sizes of columns.

data = {'RollNo' : [101, 102, 75, 99], 'Name' : ['Mithlesh', 'Ram', 'Rudra', 'Mithlesh'], 'Course' : ['Nodejs', None, 'Nodejs', 'JavaScript'] } df = pd.DataFrame(data, columns=['RollNo', 'Name', 'Course']) df.head()

Importing Data

Pandas have the ability to import or read various types of files in your Notebook.

Here are some examples given below.

# Import a CSV file pd pd.read_csv(filename) # Import a TSV file pd.read_table(filename) # Import a Excel file pd pd.read_excel(filename) # Import a SQL table/database pd.read_sql(query, connection_object) # Import a JSON file pd.read_json(json_string) # Import a HTML file pd.read_html(url) # From clipboard to read_table() pd.read_clipboard() # From dict pd.DataFrame(dict) Selection

You can select elements by its location or index. You can select rows, columns, and distinct values using these techniques.

1. Series # Accessing one element from Series s['D'] # Accessing all elements between two given indices s['A':'C'] # Accessing all elements from starting till given index s[:'C'] # Accessing all elements from given index till end s['B':] 2. DataFrame # Accessing one column df df['Name'] # Accessing rows from after given row df[1:] # Accessing till before given row df[:1] # Accessing rows between two given rows df[1:2]

Selecting by Boolean Indexing and Setting 1. By Position df.iloc[0, 1] df.iat[0, 1] 2. By Label df.loc[[0], ['Name']] 3. By Label/Position df.loc[2] # Both are same df.iloc[2] 4. Boolean Indexing

# Use filter to adjust DataFrame

# Set index a of Series s to 6 s[‘D’] = 10 s.head()

Data Cleaning

For data cleaning purposes, you can perform the following operations:

Rename columns using the rename() method.

Update values using the at[] or iat[] method to access and modify specific elements.

Create a copy of a Series or data frame using the copy() method.

Check for NULL values using the isnull() method, and drop them using the dropna() method.

Check for duplicate values using the duplicated() method. Drop them using the drop_duplicates() method.

Replace NULL values using the fill () method with a specified value.

Replace values using the replace() method.

Sort values using the sort_values() method.

Rank values using the rank() method.

# Renaming columns df.columns = ['a','b','c'] df.head() # Mass renaming of columns df = df.rename(columns={'RollNo': 'ID', 'Name': 'Student_Name'}) # Or use this edit in same DataFrame instead of in copy df.rename(columns={'RollNo': 'ID', 'Name': 'Student_Name'}, inplace=True) df.head() # Counting duplicates in a column df.duplicated(subset='Name') # Removing entire row that has duplicate in given column df.drop_duplicates(subset=['Name']) # You can choose which one keep - by default is first df.drop_duplicates(subset=['Name'], keep='last') # Checks for Null Values s.isnull() # Checks for non-Null Values - reverse of isnull() s.notnull() # Checks for Null Values df df.isnull() # Checks for non-Null Values - reverse of isnull() df.notnull() # Drops all rows that contain null values df.dropna() # Drops all columns that contain null values df.dropna(axis=1) # Replaces all null values with 'Guru99' df.fillna('Guru99') # Replaces all null values with the mean s.fillna(s.mean()) # Converts the datatype of the Series to float s.astype(float) # Replaces all values equal to 6 with 'Six' s.replace(6,'Six') # Replaces all 2 with 'Two' and 6 with 'Six' s.replace([2,6],['Two','Six']) # Drop from rows (axis=0) s.drop(['B', 'D']) # Drop from columns(axis=1) df.drop('Name', axis=1) # Sort by labels with axis df.sort_index() # Sort by values with axis df.sort_values(by='RollNo') # Ranking entries df.rank() # s1 is pointing to same Series as s s1 = s # s_copy of s, but not pointing same Series s_copy = s.copy() # df1 is pointing to same DataFrame as df df1 = s # df_copy of df, but not pointing same DataFrame df_copy = df.copy()

Retrieving Information

You can perform these operation to retrieve information:

Use shape attribute to get the number of rows and columns.

Use the head() or tail() method to obtain the first or last few rows as a sample.

Use the info(), describe(), or dtypes method to obtain information about the data type, count, mean, standard deviation, minimum, and maximum values.

Use the count(), min(), max(), sum(), mean(), and median() methods to obtain specific statistical information for values.

Use the loc[] method to obtain a row.

Use the groupby() method to apply the GROUP BY function to group similar values in a column of a DataFrame.

1. Basic information # Counting all elements in Series len(s) # Counting all elements in DataFrame len(df) # Prints number of rows and columns in dataframe df.shape # Prints first 10 rows by default, if no value set df.head(10) # Prints last 10 rows by default, if no value set df.tail(10) # For counting non-Null values column-wise df.count() # For range of index df df.index # For name of attributes/columns df.columns # Index, Data Type and Memory information df.info() # Datatypes of each column df.dtypes # Summary statistics for numerical columns df.describe() 2. Summary # For adding all values column-wise df.sum() # For min column-wise df.min() # For max column-wise df.max() # For mean value in number column df.mean() # For median value in number column df.median() # Count non-Null values s.count() # Count non-Null values df.count() # Return Series of given column df['Name'].tolist() # Name of columns df.columns.tolist() # Creating subset df[['Name', 'Course']] # Return number of values in each group df.groupby('Name').count() Applying Functions # Define function f = lambda x: x*5 # Apply this function on given Series - For each value s.apply(f) # Apply this function on given DataFrame - For each value df.apply(f) 1. Internal Data Alignment # NA values for indices that don't overlap s2 = pd.Series([8, -1, 4], index=['A', 'C', 'D']) s + s2 2. Arithmetic Operations with Fill Methods # Fill values that don't overlap s.add(s2, fill_value=0) 3. Filter, Sort and Group By

These following functions can be used for filtering, sorting, and grouping by Series and DataFrame.

# Filter rows where column is greater than 100 # Filter rows where 70 < column < 101 # Sorts values in ascending order s.sort_values() # Sorts values in descending order s.sort_values(ascending=False) # Sorts values by RollNo in ascending order df.sort_values('RollNo') # Sorts values by RollNo in descending order df.sort_values('RollNo', ascending=False) Exporting Data

Pandas has the ability to export or write data in various formats. Here are some examples given below.

# Export as a CSV file df df.to_csv(filename) # Export as a Excel file df df.to_excel(filename) # Export as a SQL table df df.to_sql(table_name, connection_object) # Export as a JSON file df.to_json(filename) # Export as a HTML table df.to_html(filename) # Write to the clipboard df.to_clipboard() Conclusion:

Pandas is open-source library in Python for working with data sets. Its ability to analyze, clean, explore, and manipulate data. It is an important tool for data scientists. Pandas is built on top of Numpy. It is used with other programs like Matplotlib and Scikit-learn. Pandas Cheat Sheet is a helpful resource for beginners and experienced users. It covers topics such as data structures, data selection, importing data, Boolean indexing, dropping values, sorting, and data cleaning. We have also prepared pandas cheat sheet pdf for article. Pandas is a library in Python and data science uses this library for working with pandas dataframes and series. We have discussed various pandas commands in this cheatsheet.

Colab of Cheat Sheet

My Colab Exercise file for Pandas – Pandas Cheat Sheet – Python for Data Science.ipynb

4Ddig Mac Data Recovery Review: A Quick And Cool Data Recovery Option

What you need to know:

4DDiG has an impressive file type, and device support.

Thanks to a three-step process, recovering data is no more rocket science.

It’s definitely worth a dig, so don’t skip the review

Data recovery is no joking matter. It ensures that your efforts, time, and invested money do not go in vain due to a stupid mistake or an unfortunate system crash. Tenorshare’s 4DDiG data recovery software can help retrieve the data without a backup, a technical degree, excessive wait time, and tons of $$$. But how?

Well, that’s exactly what we are going to uncover in this review. And most importantly, why do I feel that 4DDiG data recovery for Mac is a cool recovery option.

Tenorshare 4DDiG: It’s all about that data, no trouble! 

Whether you go through the 4DDiG website or poke around its app on your Mac or Windows device, the emphasis is to help you recover any and every kind of data. I present exhibit A, aka that next section, to prove the point.

An exhaustive support system 

First and foremost, 4DDiG promises to help you recover data in any scenario, whether it’s update failure, system crash, partition loss, accidental deletion, disk damage, or virus attack. In addition, it covers almost all file types, OS, and devices.

1000+ files and formats types

Documents – DOC, XLSX, PPTX, CWK, HTML, etc.

Photos – JPEG, PNG, BMP, GIF, PSD, CRW, RAW, SWF, SVG, etc.

Video – AVI, MOV, MP4, M4V, WMV, MKV, FLV, etc.

Audio – MP3, M4A, WMA, AAC, WAV, etc.

Emails

Archives – ZIP, RAR, SIT, ISO, HTML, etc.

Supported devices in Apple Ecosystem

MacBook

iMac

Hard Drive

SSD

USB Drive

Memory Card

Camera

iPod

Supported macOS

Monterey

Big Sur

Catalina

Mojave

High Sierra

Sierra

El Capitan

Yosemite

Supported File System

APFS

HFS+

FAT32

exFAT

Smart features for smarter Mac data recovery

While exhibit A was impressive, I now present to you something more lucrative!

Two recovery modes – 4DDiG for Mac offers

Quick scan mode – Scans and finds most recent and easily available files.

Deep Scan – An in-depth scanner that’ll find deep-buried deleted files. It’s time-consuming but boasts a higher recovery success rate.

File filter – Quickly locate a lost file by its name, type, date, extension, and more.

Free scan – Want to test the app before investing? Scan and preview lost photos and documents before you can recover them.

User-centric UI – The app’s interface is super simple and straightforward. Everything is well-labeled, and all it takes is three steps.

Secure data even after recovery – If you want to delete or move the data you recovered via 4DDiG, you’ll have to enter Mac’s password. An interesting way to avoid any more accidents.

The Three-step recovery

Choose the scan location

Note: You might get a warning to disable SIP before you can proceed. Simply follow the onscreen instruction. You’ll need to restart the Mac, so save a copy of onscreen instructions for reference.

Locate the data you want

Recover your lost files

Note: The downloaded folder might have sub-folders that segregate the files. So before you can admire the found file, you might have to do a bit of ruffling.

Should you dig Tenorshare 4DDiG?

iGeekometer 

User interface

85%

Features

85%

Data recovery efficacy and speed

93%

Compatibility

95%

Value for money

90%

Tenorshare 4DDiG had me bowled over the sheer amount of data and device compatibility it boasts. And it is almost as good as it sounds on paper, it’s simple, effective, and definitely worth a try.

Although the software could not perform as promised on my M1 Mac, that’s a bit upsetting. Nevertheless, between their simple UI and three-step guide, I believe even a non-techy user can recover data with ease.

Pros

Simplifies the overall process

Supports a variety of file types and devices

Good data recovery speed, even during deep scan

Free scan to test before you invest

30-day money-back guarantee

Cons

Scans only one drive at a time

Doesn’t support M1 Macs

Monthy license is quite expensive

Price:

1 Month License – $55.95

1 Year License – $59.95

Lifetime License – $69.95

Download

Read next:

Author Profile

Arshmeet

A self-professed Geek who loves to explore all things Apple. I thoroughly enjoy discovering new hacks, troubleshooting issues, and finding and reviewing the best products and apps currently available. My expertise also includes curating opinionated and honest editorials. If not this, you might find me surfing the web or listening to audiobooks.

Comprehensive & Practical Inferential Statistics Guide For Data Science

Introduction

Statistics is one of the key fundamental skills required for data science. Any expert in data science would surely recommend learning / upskilling yourself in statistics.

However, if you go out and look for resources on statistics, you will see that a lot of them tend to focus on the mathematics. They will focus on derivation of formulas rather than simplifying the concept. I believe, statistics can be understood in very simple and practical manner. That is why I have created this guide.

In this guide, I will take you through Inferential Statistics, which is one of the most important concepts in statistics for data science. I will take you through all the related concepts of Inferential Statistics and their practical applications.

This guide would act as a comprehensive resource to learn Inferential Statistics. So, go through the guide, section by section. Work through the examples and develop your statistics skills for data science.

Read on!

Table of Contents

Why we need Inferential Statistics?

Pre-requisites

Sampling Distribution and Central Limit Theorem

Hypothesis Testing

Types of Error in Hypothesis Testing

T-tests

Different types of t-test

ANOVA

Chi-Square Goodness of Fit

Regression and ANOVA

Coefficient of Determination (R-Squared)

1. Why do we need Inferential Statistics?

Suppose, you want to know the average salary of Data Science professionals in India. Which of the following methods can be used to calculate it?

Meet every Data Science professional in India. Note down their salaries and then calculate the total average?

Or hand pick a number of professionals in a city like Gurgaon. Note down their salaries and use it to calculate the Indian average.

Well, the first method is not impossible but it would require an enormous amount of resources and time. But today, companies want to make decisions swiftly and in a cost-effective way, so the first method doesn’t stand a chance.

On the other hand, second method seems feasible. But, there is a caveat. What if the population of Gurgaon is not reflective of the entire population of India? There are then good chances of you making a very wrong estimate of the salary of Indian Data Science professionals.

Now, what method can be used to estimate the average salary of all data scientists across India?

Enter Inferential Statistics

In simple language, Inferential Statistics is used to draw inferences beyond the immediate data available.

With the help of inferential statistics, we can answer the following questions:

Making inferences about the population from the sample.

Concluding whether a sample is significantly different from the population. For example, let’s say you collected the salary details of Data Science professionals in Bangalore. And you observed that the average salary of Bangalore’s data scientists is more than the average salary across India. Now, we can conclude if the difference is statistically significant.

If adding or removing a feature from a model will really help to improve the model.

If one model is significantly better than the other?

Hypothesis testing in general.

I am sure by now you must have got a gist of why inferential statistics is important. I will take you through the various techniques & concepts involved in Inferential statistics. But first, let’s discuss what are the prerequisites for understanding Inferential Statistics.

2. Pre-Requisites

To begin with Inferential Statistics, one must have a good grasp over the following concepts:

Probability

Basic knowledge of Probability Distributions

Descriptive Statistics

If you are not comfortable with either of the three concepts mentioned above, you must go through them before proceeding further.

Throughout the entire article, I will be using a few terminologies quite often. So, here is a brief description of them:

Statistic – A Single measure of some attribute of a sample. For eg: Mean/Median/Mode of a sample of Data Scientists in Bangalore.

Population Statistic – The statistic of the entire population in context. For eg: Population mean for the salary of the entire population of Data Scientists across India.

Sample Statistic – The statistic of a group taken from a population. For eg: Mean of salaries of all Data Scientists in Bangalore.

Standard Deviation – It is the amount of variation in the population data. It is given by σ.

Standard Error – It is the amount of variation in the sample data. It is related to Standard Deviation as σ/√n, where, n is the sample size.

3. Sampling Distribution and Central Limit Theorem

Suppose, you note down the salary of any 100 random Data Science professionals in Gurgaon, calculate the mean and repeat the procedure for say like 200 times (arbitrarily).

When you plot a frequency graph of these 200 means, you are likely to get a curve similar to the one below.

This looks very much similar to the normal curve that you studied in the Descriptive Statistics. This is called Sampling Distribution or the graph obtained by plotting sample means. Let us look at a more formal description of a Sampling Distribution.

A Sampling Distribution is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population.

A Sampling Distribution behaves much like a normal curve and has some interesting properties like :

The shape of the Sampling Distribution does not reveal anything about the shape of the population. For example, for the above Sampling Distribution, the population distribution may look like the below graph.

Population Distribution

Sampling Distribution helps to estimate the population statistic.

But how ?

This will be explained using a very important theorem in statistics – The Central Limit Theorem.

3.1 Central Limit Theorem

It states that when plotting a sampling distribution of means, the mean of sample means will be equal to the population mean. And the sampling distribution will approach a normal distribution with variance equal to σ/√n where σ is the standard deviation of population and n is the sample size.

Points to note:

Central Limit Theorem holds true irrespective of the type of distribution of the population.

Now, we have a way to estimate the population mean by just making repeated observations of samples of a fixed size.

Greater the sample size, lower the standard error and greater the accuracy in determining the population mean from the sample mean.

This seemed too technical isn’t it? Let’s break this down to understand this point by point.

The number of samples have to be sufficient (generally more than 50) to satisfactorily achieve a normal curve distribution. Also, care has to be taken to keep the sample size fixed since any change in sample size will change the shape of the sampling distribution and it will no longer be bell shaped.

As we increase the sample size, the sampling distribution squeezes from both sides giving us a better estimate of the population statistic since it lies somewhere in the middle of the sampling distribution (generally). The below image will help you visualize the effect of sample size on the shape of distribution.

Now, since we have collected the samples and plotted their means, it is important to know where the population mean lies with respect to a particular sample mean and how confident can we be about it. This brings us to our next topic – Confidence Interval.

3.2 Confidence Interval

The confidence interval is a type of interval estimate from the sampling distribution which gives a range of values in which the population statistic may lie. Let us understand this with the help of an example.

We know that 95% of the values lie within 2 (1.96 to be more accurate) standard deviation of a normal distribution curve. So, for the above curve, the blue shaded portion represents the confidence interval for a sample mean of 0.

Formally, Confidence Interval is defined as,

whereas,  = the sample mean

= Z value for desired confidence level α

σ = the population standard deviation

For an alpha value of 0.95 i.e 95% confidence interval, z=1.96.

Now there is one more term which you should be familiar with, Margin of Error.  It is given as {(z.σ)/√n} and defined as the sampling error by the surveyor or the person who collected the samples. That means, if a sample mean lies in the margin of error range then, it might be possible that its actual value is equal to the population mean and the difference is occurring by chance. Anything outside margin of error is considered statistically significant.

And it is easy to infer that the error can be both positive and negative side. The whole margin of error on both sides of the sample statistic constitutes the Confidence Interval. Numerically, C.I is twice of Margin of Error.

The below image will help you better visualize Margin of Error and Confidence Interval.

The shaded portion on horizontal axis represents the Confidence Interval and half of it is Margin of Error which can be in either direction of x (bar).

Interesting points to note about Confidence Intervals:

Confidence Intervals can be built with difference degrees of confidence suitable to a user’s needs like 70 %, 90% etc.

Greater the sample size, smaller the Confidence Interval, i.e more accurate determination of population mean from the sample means.

There are different confidence intervals for different sample means. For example, a sample mean of 40 will have a difference confidence interval from a sample mean of 45.

By 95% Confidence Interval, we do not mean that – The probability of a population mean to lie in an interval is 95%. Instead, 95% C.I means that 95% of the Interval estimates will contain the population statistic.

Many people do not have right knowledge about confidence interval and often interpret it incorrectly. So, I would like you to take your time visualizing the 4th argument and let it sink in.

3.3 Practical example

Calculate the 95% confidence interval for a sample mean of 40 and sample standard deviation of 40 with sample size equal to 100.

Solution:

We know, z-value for 95% C.I is 1.96. Hence, Confidence Interval (C.I) is calculated as:

C.I= [{x(bar) – (z*s/√n)},{x(bar) – (z*s/√n)}]

C.I = [{40-(1.96*40/10},{ 40+(1.96*40/10)}]

C.I = [32.16, 47.84]

4. Hypothesis Testing

Before I get into the theoretical explanation, let us understand Hypothesis Testing by using a simple example.

Example: Class 8th has a mean score of 40 marks out of 100. The principal of the school decided that extra classes are necessary in order to improve the performance of the class. The class scored an average of 45 marks out of 100 after taking extra classes. Can we be sure whether the increase in marks is a result of extra classes or is it just random?

Hypothesis testing lets us identify that. It lets a sample statistic to be checked against a population statistic or statistic of another sample to study any intervention etc. Extra classes being the intervention in the above example.

Hypothesis testing is defined in two terms – Null Hypothesis and Alternate Hypothesis.

Null Hypothesis being the sample statistic to be equal to the population statistic. For eg: The Null Hypothesis for the above example would be that the average marks after extra class are same as that before the classes.

Alternate Hypothesis for this example would be that the marks after extra class are significantly different from that before the class.

Hypothesis Testing is done on different levels of confidence and makes use of z-score to calculate the probability. So for a 95% Confidence Interval, anything above the z-threshold for 95% would reject the null hypothesis.

Points to be noted:

We cannot accept the Null hypothesis, only reject it or fail to reject it.

As a practical tip, Null hypothesis is generally kept which we want to disprove. For eg: You want to prove that students performed better after taking extra classes on their exam. The Null Hypothesis, in this case, would be that the marks obtained after the classes are same as before the classes.

5. Types of Errors in Hypothesis Testing

Now we have defined a basic Hypothesis Testing framework. It is important to look into some of the mistakes that are committed while performing Hypothesis Testing and try to classify those mistakes if possible.

Now, look at the Null Hypothesis definition above. What we notice at the first look is that it is a statement subjective to the tester like you and me and not a fact. That means there is a possibility that the Null Hypothesis can be true or false and we may end up committing some mistakes on the same lines.

There are two types of errors that are generally encountered while conducting Hypothesis Testing.

Type I error: Look at the following scenario – A male human tested positive for being pregnant. Is it even possible? This surely looks like a case of False Positive. More formally, it is defined as the incorrect rejection of a True Null Hypothesis. The Null Hypothesis, in this case, would be – Male Human is not pregnant.

Type II error: Look at another scenario where our Null Hypothesis is – A male human is pregnant and the test supports the Null Hypothesis.  This looks like a case of False Negative. More formally it is defined as the acceptance of a false Null Hypothesis.

The below image will summarize the types of error :

6. T-tests

T-tests are very much similar to the z-scores, the only difference being that instead of the Population Standard Deviation, we now use the Sample Standard Deviation. The rest is same as before, calculating probabilities on basis of t-values.

The Sample Standard Deviation is given as:

where n-1 is the Bessel’s correction for estimating the population parameter.

Another difference between z-scores and t-values are that t-values are dependent on Degree of Freedom of a sample. Let us define what degree of freedom is for a sample.

The Degree of Freedom –  It is the number of variables that have the choice of having more than one arbitrary value. For example, in a sample of size 10 with mean 10, 9 values can be arbitrary but the 1oth value is forced by the sample mean.

Points to note about the t-tests:

Greater the difference between the sample mean and the population mean, greater the chance of rejecting the Null Hypothesis. Why? (We discussed this above.)

Greater the sample size, greater the chance of rejection of Null Hypothesis.

7. Different types of t-tests 7.1 1-sample t-test

This is the same test as we described above. This test is used to:

Determine whether the mean of a group differs from the specified value.

Calculate a range of values that are likely to include the population mean.

where, X(bar) = sample mean

μ = population mean

s = sample standard deviation

N = sample size

7.2 Paired t-test

Paired t-test is performed to check whether there is a difference in mean after a treatment on a sample in comparison to before. It checks whether the Null hypothesis: The difference between the means is Zero, can be rejected or not.

The above example suggests that the Null Hypothesis should not be rejected and that there is no significant difference in means before and after the intervention since p-value is not less than the alpha value (o.o5) and t stat is not less than t-critical. The excel sheet for the above exercise is available here.

where, d (bar) = mean of the case wise difference between before and after,

= standard deviation of the difference

 n = sample size.

7.3 2-sample t-test

This test is used to determine:

Determine whether the means of two independent groups differ.

Calculate a range of values that is likely to include the difference between the population means.

The above formula represents the 2 sample t-test and can be used in situations like to check whether two machines are producing the same output. The points to be noted for this test are:

The groups to be tested should be independent.

The groups’ distribution should not be highly skewed.

where, X1 (bar) = mean of the first group

  = represents 1st group sample standard deviation

= represents the 1st group sample size.

7.4 Practical example

We will understand how to identify which t-test to be used and then proceed on to solve it. The other t-tests will follow the same argument.

Example: A population has mean weight of 68 kg. A random sample of size 25 has a mean weight of 70 with standard deviation =4. Identify whether this sample is representative of the population?

Step 0: Identifying the type of t-test

Number of samples in question = 1

Number of times the sample is in study = 1

Any intervention on sample = No

Recommended t-test = 1- sample t-test.

Had there been 2 samples, we would have opted for 2-sample t-test and if there would have been 2 observations on the same sample, we would have opted for paired t-test.`

Step 1: State the Null and Alternate Hypothesis

Null Hypothesis: The sample mean and population mean are same.

Alternate Hypothesis: The sample mean and population mean are different.

Step 2: Calculate the appropriate test statistic

df = 25-1 =24

t= (70-68)/(4/√25) = 2.5

Now, for a 95% confidence level, t-critical (two-tail) for rejecting Null Hypothesis for 24 d.f is 2.06 . Hence, we can reject the Null Hypothesis and conclude that the two means are different.

You can use the t-test calculator here.

8. ANOVA

ANOVA (Analysis of Variance) is used to check if at least one of two or more groups have statistically different means. Now, the question arises – Why do we need another test for checking the difference of means between independent groups? Why can we not use multiple t-tests to check for the difference in means?

The answer is simple. Multiple t-tests will have a compound effect on the error rate of the result. Performing t-test thrice will give an error rate of ~15% which is too high, whereas ANOVA keeps it at 5% for a 95% confidence interval.

To perform an ANOVA, you must have a continuous response variable and at least one categorical factor with two or more levels. ANOVA requires data from approximately normally distributed populations with equal variances between factor levels. However, ANOVA procedures work quite well even if the normality assumption has been violated unless one or more of the distributions are highly skewed or if the variances are quite different.

ANOVA is measured using a statistic known as F-Ratio. It is defined as the ratio of Mean Square (between groups) to the Mean Square (within group).

Mean Square (between groups) = Sum of Squares (between groups) / degree of freedom (between groups)

Mean Square (within group) = Sum of Squares (within group) / degree of freedom (within group)

Here, p = represents the number of groups

n = represents the number of observations in a group

=  represents the mean of a particular group

X (bar) = represents the mean of all the observations

Now, let us understand the degree of freedom for within group and between groups respectively.

Between groups : If there are k groups in ANOVA model, then k-1 will be independent. Hence, k-1 degree of freedom.

Within groups : If N represents the total observations in ANOVA (∑n over all groups) and k are the number of groups then, there will be k fixed points. Hence, N-k degree of freedom.

8.1 Steps to perform ANOVA

Hypothesis Generation

Null Hypothesis : Means of all the groups are same

Alternate Hypothesis : Mean of at least one group is different

Calculate within group and between groups variability

Calculate F-Ratio

Calculate probability using F-table

Reject/fail to Reject Null Hypothesis

There are various other forms of ANOVA too like Two-way ANOVA, MANOVA, ANCOVA etc. but One-Way ANOVA suffices the requirements of this course.

Practical applications of ANOVA in modeling are:

Identifying whether a categorical variable is relevant to a continuous variable.

Identifying whether a treatment was effective to the model or not.

8.2 Practical Example

Suppose there are 3 chocolates in town and their sweetness is quantified by some metric (S). Data is collected on the three chocolates. You are given the task to identify whether the mean sweetness of the 3 chocolates are different. The data is given as below:

                                                                 Type A                    Type B                   Type C

Here, first we have calculated the sample mean and sample standard deviation for you.

Now we will proceed step-wise to calculate the F-Ratio (ANOVA statistic).

Step 1: Stating the Null and Alternate Hypothesis

Null Hypothesis: Mean sweetness of the three chocolates are same.

Alternate Hypothesis: Mean sweetness of at least one of the chocolates is different.

Step 2: Calculating the appropriate ANOVA statistic

In this part, we will be calculating SS(B), SS(W), SS(T) and then move on to calculate MS(B) and MS(W). The thing to note is that,

Total Sum of Squares [SS(t)] = Between Sum of Squares [SS(B)] + Within Sum of Squares [SS(W)].

So, we need to calculate any two of the three parameters using the data table and formulas given above.

As, per the formula above, we need one more statistic i.e Grand Mean denoted by X(bar) in the formula above.

X bar = (643+655+702+469+427+525+484+456+402)/9 = 529.22

SS(B)=[3*(666.67-529.22)^2]+ [3*(473.67-529.22)^2]+[3*(447.33-529.22)^2] = 86049.55

SS (W) = [(643-666.67)^2+(655-666.67)^2+(702-666.67)^2] + [(469-473.67)^2+(427-473.67)^2+(525-473.67)^2] + [(484-447.33)^2+(456-447.33)^2+(402-447.33)^2]= 10254

MS(B) = SS(B) / df(B) = 86049.55 / (3-1) = 43024.78

MS(W) = SS(W) / df(W) = 10254/(9-3) = 1709

F-Ratio = MS(B) / MS(W) = 25.17 .

Now, for a 95 % confidence level, F-critical to reject Null Hypothesis for degrees of freedom(2,6) is 5.14 but we have 25.17 as our F-Ratio.

So, we can confidently reject the Null Hypothesis and come to a conclusion that at least one of the chocolate has a mean sweetness different from the others.

You can use the F-calculator here.

Note: ANOVA only lets us know the means for different groups are same or not. It doesn’t help us identify which mean is chúng tôi know which group mean is different, we can use another test know as Least Significant Difference Test.

9. Chi-square Goodness of Fit Test

Sometimes, the variable under study is not a continuous variable but a categorical variable. Chi-square test is used when we have one single categorical variable from the population.

Let us understand this with help of an example. Suppose a company that manufactures chocolates, states that they manufacture 30% dairy milk, 60% temptation and 10% kit-kat. Now suppose a random sample of 100 chocolates has 50 dairy milk, 45 temptation and 5 kitkats. Does this support the claim made by the company?

Let us state our Hypothesis first.

Null Hypothesis: The claims are True

Alternate Hypothesis: The claims are False.

Chi-Square Test is given by:

where, = sample or observed values

 = population values

The summation is taken over all the levels of a categorical variable.

 = [n * ]  Expected value of a level (i) is equal to the product of sample size and percentage of it in the population.

Let us now calculate the Expected values of all the levels.

E (dairy milk)= 100 * 30% = 30

E (temptation) = 100 * 60% =60

E (kitkat) = 100 * 10% = 10

Calculating chi-square = [(50-30)^2/30+(45-60)^2/60+(5-10)^2/10] =19.58

So we reject the Null Hypothesis.

If you have studied some basic Machine Learning Algorithms, the first algorithm that you must have studied is Regression. If we  recall those lessons of Regression, what we generally do is calculate the weights for features present in the model to better predict the output variable. But finding the right set of feature weights or features for that matter is not always possible.

It is highly likely that that the existing features in the model are not fit for explaining the trend in dependent variable or the feature weights calculated fail at explaining the trend in dependent variable. What is important is knowing the degree to which our model is successful in explaining the trend (variance) in dependent variable.

Enter ANOVA.

With the help of ANOVA techniques, we can analyse a model performance very much like we analyse samples for being statistically different or not.

But with regression things are not easy. We do not have mean of any kind to compare  or sample as such but we can find good alternatives in our regression model which can substitute for mean and sample.

Sample in case of regression is a regression model itself with pre-defined features and feature weights whereas mean is replaced by variance(of both dependent and independent variables).

Through our ANOVA test we would like to know the amount of variance explained by the Independent variables in Dependent Variable VS the amount of variance that was left unexplained.

It is intuitive to see that larger the unexplained variance(trend) of the dependent variable smaller will be the ratio and less effective is our regression model. On the other hand, if we have a large explained variance then it is easy to see that our regression model was successful in explaining the variance in the dependent variable and more effective is our model. The ratio of Explained Variance uand Unexplained Variance is called F-Ratio.

Let us now define these explained and unexplained variances to find the effectiveness of our model.

1. Regression (Explained) Sum of Squares – It is defined as the amount of variation explained by the Regression model in the dependent variable.

Mathematically, it is calculated as:

where, [hat] = predicted value and

y(bar) = mean of the actual y values.

Interpreting Regression sum of squares –

If our model is a good model for the problem at hand then it would produce an output which has distribution as same to the actual dependent variable. i.e it would be able to capture the inherent variation in the dependent variable.

2. Residual Sum of Squares – It is defined as the amount of variation independent variable which is not explained by the Regression model.

Mathematically, it is calculated as:

where,  = actual ‘y ‘ value

f(x) = predicted value

Interpretation of Residual Sum of Squares –

It can be interpreted as the amount by which the predicted values deviated from the actual values. Large deviation would indicate that the model failed at predicting the correct values for the dependent variable.

Let us now  work out F-ratio step by step. We will be making using of the Hypothesis Testing framework described above to test the significance of the model.

While calculating the F-Ratio care has to be taken to incorporate the effect of degree of freedom. Mathematically, F-Ratio is the ratio of [Regression Sum of Squares/df(regression)] and [Residual Sum of Squares/df(residual)].

We will be understanding the entire concept using an example and this excel sheet.

Step 0: State the Null and Alternate Hypothesis

Null Hypothesis: The model is unable to explain the variance in the dependent variable (Y).

Alternate Hypothesis: The model is able to explain the variance in dependent variable (Y)

Step 1:

Calculate the regression equation for X and Y using Excel’s in-built tool.

Step 2:

Predict the values of y for each row of data.

Step 3:

Calculate y(mean) – mean of the actual y values which in this case turns out to be 0.4293548387.

Step 4:

Calculate the Regression Sum of Squares using the above-mentioned formula. It turned out to be 2.1103632473

The Degree of freedom for regression equation is 1, since we have only 1 independent variable.

Step 5:

Calculate the Residual Sum of Squares using the above-mentioned formula. It turned out to be 0.672210946.

Degree of Freedom for residual = Total degree of freedom – Degree of freedom(regression)

=(62-1) – 1 = 60

Step 6:

F-Ratio = (2.1103632473/1)/(0.672210946/60) = 188.366

Now, for 95% confidence, F-critical to reject Null Hypothesis for 1,60 degrees of freedom in 4. But we have F-ratio as 188, so we can safely reject the Null Hypothesis and conclude that model explains variation to a large extent.

11. Coefficient of Determination (R-Square)

It is defined as the ratio of the amount of variance explained by the regression model to the total variation in the data. It represents the strength of correlation between two variables.

We already calculated the Regression SS and Residual SS. Total SS is the sum of Regression SS and Residual SS.

Total SS = 2.1103632473+ 0.672210946 = 2.78257419

Co-efficient of Determination = 2.1103632473/2.78257419 = 0.7588

12. Correlation Coefficient

This is another useful statistic which is used to determine the correlation between two variables. It is simply the square root of coefficient of Determination and ranges from -1 to 1 where 0 represents no correlation and 1 represents positive strong correlation while -1 represents negative strong correlation.

End Notes

So, this guide comes to an end with explaining all the theory along with practical implementations of various Inferential Statistics concepts. This guide has been created with a Hypothesis Testing framework and I hope this would be one stop solution for a quick Inferential Statistics guide.

Related

A Quick Glance On Cpanel Alternative

Start Your Free Software Development Course

Web development, programming languages, Software testing & others

Some claim, in reality, that cPanel targets at the most punitive smaller companies and developers, with their single-license plans starting at $15 a month, more than most users pay for its servers every month. The price policy of cPanel is now account-based, making it extremely costly particularly for resellers.

Different cPanel Alternatives

Given below are the different cPanel Alternatives:

1. Moss.sh 2. SpinupWP

SpinupWP is a cloud server control panel designed especially for WordPress. This is a downside, and many would not even see it as a consequence. But SpinupWP has been planned from the ground up to do so, if you intend to use your servers to host WordPress. Delicious Brains, a small WordPress developer company with a reputation in industry, developed the platform. As you would imagine, their support (Monday to Friday) its brilliant and WordPress knows inside. Your dashboard is very simple, with minimal setup to take care of WordPress self-hosting.

3. ServerPilot

We’ve got ServerPilot next up. ServerPilot is a hosted server management dashboard, comparable to RunCloud, Moss and SpinupWP, making it simple to handle servers –whether you are over 100 or one. It is specifically designed to accommodate PHP web applications and WordPress Web sides by people who want to use their servers. You have a U.S.-based support team committed to ensuring that, if you have concerns, you can begin without any problems. ServerPilot requires an Ubuntu 20.04 or 18.04 64-bit server, which is installed via SSH using a special server command (which is provided by that server).

We’ve got ServerPilot next up. ServerPilot is a hosted server management dashboard, comparable to RunCloud, SpinupWP, and Moss, making it simple to handle servers – whether you are over 100 or one. It is specifically designed to accommodate PHP web applications and WordPress Web sides by people who want to use their servers. You have a U.S.-based support team committed to ensuring that, if you have concerns, you can begin without any problems. ServerPilot requires an Ubuntu 18.04 or 20.04 64-bit server, which is installed via SSH using a special server command (which is provided by that server).

4. Interworx 5. DirectAdmin

DirectAdmin is much like cPanel, but builds to be quicker and needs less server resources. It comes with a variety of webmail plugins, protection supplements, custom graphic skins and more. Thanks to its popularity in addition, DirectAdmin provides all the basics you would need for the control of resource use, DNS clustering and automated updates. With cPanel and not a change fan, you could be very comfortable with this, as it also provides Installation and Softaculous integrations – the same service behind cPanel’s CMS installations.

6. Virtualmin

Virtualmin is built on top of Webmin is a popular system administration interface for Linux. Virtualmin provides a solidly built free usable open source version, but pay-per-view versions are also available. It has many customization options and services, easy to conquer the competition. Beginners are only allowed to use it if they want to learn more and develop their qualifications. Otherwise, a little overwhelming might be the user interface.

7. Ajenti

An extendable open-source control panel is available in Ajenti. It helps users to access a remote Linux box easily and safely via web endpoints, text editors, file managers and other tools. The Ajenti Administrative panel provides remote terminal control, user management and allows you to install firewalls, package installation, and resource use, among other features. There are many Ajenti plugins available, but their platform has been created with development partners in mind, so you can easily create more to improve its key functionality if you know Python and JavaScript.

Recommended Articles

This is a guide to cPanel Alternative. Here we discuss the introduction and the different cPanel alternatives respectively. You may also have a look at the following articles to learn more –

Top 6 Data Science Jobs In The Data

This data science career is doing very well on the market. Data science is making remarkable progress in many areas of technology, economy and commerce. It’s not an exaggeration. It is no surprise that data scientists will have many job opportunities.

It is true. Multiple projections show that the demand for data scientists will rise significantly in the next five-years. It is clear that demand is far greater than supply. Data science is a highly specialized field that requires a passion for math and analytical skills. This gap is perpetuated by the insufficient supply of these skills.

Every organization in the world is now data-driven. Data-driven organizations are the First Five: Google, Amazon, Facebook, Meta, Apple, Microsoft, and Facebook. They aren’t the only ones. Nearly every company in the market uses data-driven decision-making. The data sets can be customized quickly.

Amazon keeps meticulous records of all our choices and preferences in the world of shopping. It customizes the data to only send information that is relevant to the search terms of specific customers. Both the client and the company benefit from this process. This increases the company’s profit and helps the customer by acquiring goods at lower prices than they expected.

Data sets have a wider impact than just their positive effects. Data sets have positive effects on the health sphere by making people aware about critical health issues and other health-related items. It can also have an impact on agriculture, providing valuable information to farmers about efficient production and delivery of food.

It is evident that data scientists are needed around the globe, which makes their job prospects bright. Let’s take a look at some of the most exciting data science jobs available to data scientists who want to be effective in data management within organizations.

Top 6 Data Science Jobs in the Data-driven Industry 1. Data scientists

Average Salary: US$100,000.

Also read: 14 Best Webinar Software Tools in 2023 (Ultimate Guide for Free)

2. Data architects

Average Salary: US$95,000/annum

Roles and Responsibilities This employee is responsible for developing organizational data strategies that convert business requirements into technical requirements.

3. Data engineers

Average Salary: US$110,000 an Year

Also read: The 15 Best E-Commerce Marketing Tools

4. Data analysts

Average Salary: US$70,000 an Year

Roles and Responsibilities. A data analyst must analyze real-time data using statistical techniques and tools in order to present reports to management. It is crucial to create and maintain a database and analyze and interpret current trends and patterns within those databases.

5. Data storyteller

Average Salary: US$60,000 an Year

Also read: 10 Best Chrome Extensions For 2023

6. Database administrators

Average Salary: US$80,000 an Year

Roles and Responsibilities of a database administrator: The database administrator must be proficient in database software to manage data effectively and keep it up-to date for data design and development. This employee will manage the database access and prevent loss and corruption.

These are only a few of the many data science jobs available to the world. In recent years, data science has been a thriving field in many industries around the globe. In this fast-paced world, data is increasingly valuable and there are many opportunities to fill data-centric roles within reputable organizations.

Update the detailed information about A Quick Tutorial On Clustering For Data Science Professionals on the Achiashop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!