You are reading the article **A Comprehensive Guide To Sharding In Data Engineering For Beginners** updated in December 2023 on the website Achiashop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. *Suggested January 2024 A Comprehensive Guide To Sharding In Data Engineering For Beginners*

This article was published as a part of the Data Science Blogathon.

Big Data is a very commonly heard term these days. A reasonably large volume of data that cannot be handled on a small capacity configuration of servers can be called ‘Big Data’ in that particular context. In today’s competitive world, every business organization relies on decision-making based on the outcome of the analyzed data they have on hand. The data pipeline starting from the collection of raw data to the final deployment of machine learning models based on this data goes through the usual steps of cleaning, pre-processing, processing, storage, model building, and analysis. Efficient handling and accuracy depend on resources like software, hardware, technical workforce, and costs. Answering queries requires specific data probing in either static or dynamic mode with consistency, reliability, and availability. When data is large, inadequacy in handling queries due to the size of data and low capacity of machines in terms of speed, memory may prove problematic for the organization. This is where sharding steps in to address the above problems.

This guide explores the basics and various facets of data sharding, the need for sharding, and its pros, and cons.

What is Data Sharding?With the increasing use of IT technologies, data is accumulating at an overwhelmingly faster pace. Companies leverage this big data for data-driven decision-making. However, with the increased size of the data, system performance suffers due to queries becoming very slow if the dataset is entirely stored in a single database. This is why data sharding is required.

Image Source: Author

In simple terms, sharding is the process of dividing and storing a single logical dataset into databases that are distributed across multiple computers. This way, when a query is executed, a few computers in the network may be involved in processing the query, and the system performance is faster. With increased traffic, scaling the databases becomes non-negotiable to cope with the increased demand. Furthermore, several sharding solutions allow for the inclusion of additional computers. Sharding allows a database cluster to grow with the amount of data and traffic received.

Let’s look at some key terms used in the sharding of databases.

Scale-out and Scaling up: The process of creating or removing databases horizontally done to improve performance and increase capacity is called scale-out. Scaling up refers to the practice of adding physical resources to an existing database server, like memory, storage, and CPU, to improve performance.

Sharding: Sharding distributes similarly-formatted large data over several separate databases.

Chunk: A chunk is made up of sharded data subset and is bound by lower and higher ranges based on the shard key.

Shard: A shard is a horizontally distributed portion of data in a database. Data collections with the same partition keys are called logical shards, which are then distributed across separate database nodes.

Sharding Key: A sharding key is a column of the database to be sharded. This key is responsible for partitioning the data. It can be either a single indexed column or multiple columns denoted by a value that determines the data division between the shards. A primary key can be used as a sharding key. However, a sharding key cannot be a primary key. The choice of the sharding key depends on the application. For example, userID could be used as a sharding key in banking or social media applications.

Logical shard and Physical Shard: A chunk of the data with the same shard key is called a logical shard. When a single server holds one or more than one logical shard, it is called a physical shard.

Shard replicas: These are the copies of the shard and are allotted to different nodes.

Partition Key: It is a key that defines the pattern of data distribution in the database. Using this key, it is possible to direct the query to the concerned database for retrieving and manipulating the data. Data having the same partition key is stored in the same node.

Replication: It is a process of copying and storing data from a central database at more than one node.

Resharding: It is the process of redistributing the data across shards to adapt to the growing size of data.

Are Sharding and Partitioning the same?Both Sharding and Partitioning allow splitting and storing the databases into smaller datasets. However, they are not the same. Upon comparison, we can say that sharding distributes the data and is shared over several machines, but not with partitioning. Within a single unsharded database, partitioning is the process of grouping subsets of data. Hence, the phrases sharding and partitioning are used interchangeably when the terms “horizontal” and “vertical” are used before them. As a result, “horizontal sharding” and “horizontal partitioning” are interchangeable terms.

Vertical Sharding:

Entire columns are split and placed in new, different tables in a vertically partitioned table. The data in one vertical split is different from the data in the others, and each contains distinct rows and columns.

Horizontal Sharding:

Horizontal sharding or horizontal partitioning divides a table’s rows into multiple tables or partitions. Every partition has the same schema and columns but distinct rows. Here, the data stored in each partition is distinct and independent of the data stored in other partitions.

The image below shows how a table can be partitioned both horizontally and vertically.

The ProcessBefore sharding a database, it is essential to evaluate the requirements for selecting the type of sharding to be implemented.

At the start, we need to have a clear idea about the data and how the data will be distributed across shards. The answer is crucial as it will directly impact the performance of the sharded database and its maintenance strategy.

Next, the nature of queries that need to be routed through these shards should also be known. For read queries, replication is a better and more cost-effective option than sharding the database. On the other hand, workload involving writing queries or both read and write queries would require sharding of the database. And the final point to be considered is regarding shard maintenance. As the accumulated data increases, it needs to be distributed, and the number of shards keeps on growing over time. Hence, the distribution of data in shards requires a strategy that needs to be planned ahead to keep the sharding process efficient.

Types of Sharding ArchitecturesOnce you have decided to shard the existing database, the following step is to figure out how to achieve it. It is crucial that during query execution or distributing the incoming data to sharded tables/databases, it goes to the proper shard. Otherwise, there is a possibility of losing the data or experiencing noticeably slow searches. In this section, we will look at some commonly used sharding architectures, each of which has a distinct way of distributing data between shards. There are three main types of sharding architectures – Key or Hash-Based, Range Based, and Directory-Based sharding.

To understand these sharding strategies, say there is a company that handles databases for its client who sell their products in different countries. The handled database might look like this and can often extend to more than a million rows.

We will take a few rows from the above table to explain each sharding strategy.

So, to store and query these databases efficiently, we need to implement sharding on these databases for low latency, fault tolerance, and reliability.

Key Based Sharding

Key Based Sharding or Hash-Based Sharding, uses a value from the column data — like customer ID, customer IP address, a ZIP code, etc. to generate a hash value to shard the database. This selected table column is the shard key. Next, all row values in the shard key column are passed through the hash function.

This hash function is a mathematical function that converts any text input size (usually a combination of numbers and strings) and returns a unique output called a hash value. The hash value is based on the chosen algorithm (depending on the data and application) and the total number of available shards. This value indicates the data should be sent to which shard number.

It is important to remember that a shard key needs to be both unique and static, i.e., it should not change over a period of time. Otherwise, it would increase the amount of work required for update operations, thus slowing down performance.

The Key Based Sharding process looks like this:

Image Source: Author

Features of Key Based Sharding are-

It is easier to generate hash keys using algorithms. Hence, it is good at load balancing since data is equally distributed among the available numbers of shards.

As all shards share the same load, it helps to avoid database hotspots (when one shard contains excessive data as compared to the rest of the shards).

Additionally, in this type of sharding, there is no need to have any additional map or table to hold the information of where the data is stored.

However, it is not dynamic sharding, and it can be difficult to add or remove extra servers from a database depending on the application requirement. The adding or removing of servers requires recalculating the hash key. Since the hash key changes due to a change in the number of shards, all the data needs to be remapped and moved to the appropriate shard number. This is a tedious task and often challenging to implement in a production environment.

To address the above shortcoming of Key Based Sharding, a ‘Consistent Hashing’ strategy can be used.

Consistent Hashing-

In this strategy, hash values are generated both for the data input and the shard, based on the number generated for the data and the IP address of the shard machine, respectively. These two hash values are arranged around a ring or a circle utilizing the 360 degrees of the circle. The hash values that are close to each other are made into a pair, which can be done either clockwise or anti-clockwise.

The data is loaded according to this combination of hash values. Whenever the shards need to be reduced, the values from where the shard has been removed are attached to the nearest shard. A similar procedure is adopted when a shard is added. The possibility of mapping and reorganization problems in the Hash Key strategy is removed in this way as the mapping of the number of shards is reduced noticeably. For example, in Key Based Hashing, if you are required to shuffle the data to 3 out of 4 shards due to a change in the hash function, then in ‘consistent hashing,’ you will require shuffling on a lesser number of shards as compared to the previous one. Moreover, any overloading problem is taken care of by adding replicas of the shard.

Range Based Sharding

Range Based Sharding is the process of sharding data based on value ranges. Using our previous database example, we can make a few distinct shards using the Order value amount as a range (lower value and higher value) and divide customer information according to the price range of their order value, as seen below:

Image source: Author

Features of Range Based Sharding are-

Besides, there is no hashing function involved. Hence, it is possible to easily add more machines or reduce the number of machines. And there is no need to shuffle or reorganize the data.

On the other hand, this type of sharding does not ensure evenly distributed data. It can result in overloading a particular shard, commonly referred to as a database hotspot.

Directory-Based Sharding

This type of sharding relies on a lookup table (with the specific shard key) that keeps track of the stored data and the concerned allotted shards. It tells us exactly where the particular queried data is stored or located on a specific shard. This lookup table is maintained separately and does not exist on the shards or the application. The following image demonstrates a simple example of a Directory-Based Sharding.

Features of Directory-Based Sharding are –

The directory-Based Sharding strategy is highly adaptable. While Range Based Sharding is limited to providing ranges of values, Key Based Sharding is heavily dependent on a fixed hash function, making it challenging to alter later if application requirements change. Directory-Based Sharding enables us to use any method or technique to allocate data entries to shards, and it is convenient to add or reduce shards dynamically.

The only downside of this type of sharding architecture is that there is a need to connect to the lookup table before each query or write every time, which may increase the latency.

Furthermore, if the lookup table gets corrupted, it can cause a complete failure at that instant, known as a single point of failure. This can be overcome by ensuring the security of the lookup table and creating a backup of the table for such events.

Other than the three main sharding strategies discussed above, there can be many more sharding strategies that are usually a combination of these three.

After this detailed sharding architecture overview, we will now understand the pros and cons of sharding databases.

Benefits of ShardingHorizontal Scaling: For any non-distributed database on a single server, there will always be a limit to storage and processing capacity. The ability of sharding to extend horizontally makes the arrangement flexible to accommodate larger amounts of data.

Speed: Speed is one more reason why sharded database design is preferred is to improve query response times. Upon submitting a query to a non-sharded database, it likely has to search every row in the table before finding the result set, you’re searching for. Queries can become prohibitively slow in an application with an unsharded single database. However, by sharding a single table into multiple tables, queries pass through fewer rows, and their resulting values are delivered considerably faster.

Reliability: Sharding can help to improve application reliability by reducing the effect of system failures. If a program or website is dependent on an unsharded database, a failure might render the entire application inoperable. An outage in a sharded database, on the other hand, is likely to affect only one shard. Even if this causes certain users to be unable to access some areas of the program or website, the overall impact would be minimal.

Challenges in ShardingWhile sharding a database might facilitate growth and enhance speed, it can also impose certain constraints. We will go through some of them and why they could be strong reasons to avoid sharding entirely.

Increased complexity: Companies usually face a challenge of complexity when designing shared database architecture. There is a risk that the sharding operation will result in lost data or damaged tables if done incorrectly. Even if done correctly, shard maintenance and organization are likely to significantly influence the outcome.

Shard Imbalancing: Depending on the sharding architecture, distribution on different shards can get imbalanced due to incoming traffic. This results in remapping or reorganizing the data amongst different shards. Obviously, it is time-consuming and expensive.

Unsharding or restoring the database: Once a database has been sharded, it can be complicated to restore it to its earlier version. Backups of the database produced before it was sharded will not include data written after partitioning. As a result, reconstructing the original unsharded architecture would need either integrating the new partitioned data with the old backups or changing the partitioned database back into a single database, both of which would undesirable.

Not supported by all databases: It is to be noted that not every database engine natively supports sharding. There are several databases currently available. Some popular ones are MySQL, PostgreSQL, Cassandra, MongoDB, HBase, Redis, and more. Databases namely MySQL or MongoDB has an auto-sharding feature. As a result, we need to customize the strategy to suit the application requirements when using different databases for sharding.

Now that we have discussed the pros and cons of sharding databases, let us explore situations when one should select sharding.

When should one go for Sharding?

When the application data outgrows the storage capacity and can no longer be stored as a single database, sharding becomes essential.

When the volume of reading/writing exceeds the current database capacity, this results in a higher query response time or timeouts.

When a slow response is experienced while reading the read replicas, it indicates that the network bandwidth required by the application is higher than the available bandwidth.

Excluding the above situations, it is possible to optimize the database instead. These could include carrying out server upgrades, implementing caches, creating one or more read replicas, setting up remote databases, etc. Only when these options cannot solve the problem of increased data, sharding is always an option.

ConclusionWe have covered the fundamentals of sharding in Data Engineering and by now have developed a good understanding of this topic

With sharding, businesses can add horizontal scalability to the existing databases. However, it comes with a set of challenges that need to be addressed. These include considerable complexity, possible failure points, a requirement for additional resources, and more. Thus, sharding is essential only in certain situations.

I hope you enjoyed reading this guide! In the next part of this guide, we will cover how sharding is implemented step-by-step using MongoDB.

Author BioShe loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.

You can follow her on LinkedIn, GitHub, Kaggle, Medium, Twitter.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading __A Comprehensive Guide To Sharding In Data Engineering For Beginners__

## Comprehensive & Practical Inferential Statistics Guide For Data Science

Introduction

Statistics is one of the key fundamental skills required for data science. Any expert in data science would surely recommend learning / upskilling yourself in statistics.

However, if you go out and look for resources on statistics, you will see that a lot of them tend to focus on the mathematics. They will focus on derivation of formulas rather than simplifying the concept. I believe, statistics can be understood in very simple and practical manner. That is why I have created this guide.

In this guide, I will take you through Inferential Statistics, which is one of the most important concepts in statistics for data science. I will take you through all the related concepts of Inferential Statistics and their practical applications.

This guide would act as a comprehensive resource to learn Inferential Statistics. So, go through the guide, section by section. Work through the examples and develop your statistics skills for data science.

Read on!

Table of Contents

Why we need Inferential Statistics?

Pre-requisites

Sampling Distribution and Central Limit Theorem

Hypothesis Testing

Types of Error in Hypothesis Testing

T-tests

Different types of t-test

ANOVA

Chi-Square Goodness of Fit

Regression and ANOVA

Coefficient of Determination (R-Squared)

1. Why do we need Inferential Statistics?Suppose, you want to know the average salary of Data Science professionals in India. Which of the following methods can be used to calculate it?

Meet every Data Science professional in India. Note down their salaries and then calculate the total average?

Or hand pick a number of professionals in a city like Gurgaon. Note down their salaries and use it to calculate the Indian average.

Well, the first method is not impossible but it would require an enormous amount of resources and time. But today, companies want to make decisions swiftly and in a cost-effective way, so the first method doesn’t stand a chance.

On the other hand, second method seems feasible. But, there is a caveat. What if the population of Gurgaon is not reflective of the entire population of India? There are then good chances of you making a very wrong estimate of the salary of Indian Data Science professionals.

Now, what method can be used to estimate the average salary of all data scientists across India?

Enter Inferential StatisticsIn simple language, Inferential Statistics is used to draw inferences beyond the immediate data available.

With the help of inferential statistics, we can answer the following questions:

Making inferences about the population from the sample.

Concluding whether a sample is significantly different from the population. For example, let’s say you collected the salary details of Data Science professionals in Bangalore. And you observed that the average salary of Bangalore’s data scientists is more than the average salary across India. Now, we can conclude if the difference is statistically significant.

If adding or removing a feature from a model will really help to improve the model.

If one model is significantly better than the other?

Hypothesis testing in general.

I am sure by now you must have got a gist of why inferential statistics is important. I will take you through the various techniques & concepts involved in Inferential statistics. But first, let’s discuss what are the prerequisites for understanding Inferential Statistics.

2. Pre-RequisitesTo begin with Inferential Statistics, one must have a good grasp over the following concepts:

Probability

Basic knowledge of Probability Distributions

Descriptive Statistics

If you are not comfortable with either of the three concepts mentioned above, you must go through them before proceeding further.

Throughout the entire article, I will be using a few terminologies quite often. So, here is a brief description of them:

Statistic – A Single measure of some attribute of a sample. For eg: Mean/Median/Mode of a sample of Data Scientists in Bangalore.

Population Statistic – The statistic of the entire population in context. For eg: Population mean for the salary of the entire population of Data Scientists across India.

Sample Statistic – The statistic of a group taken from a population. For eg: Mean of salaries of all Data Scientists in Bangalore.

Standard Deviation – It is the amount of variation in the population data. It is given by σ.

Standard Error – It is the amount of variation in the sample data. It is related to Standard Deviation as σ/√n, where, n is the sample size.

3. Sampling Distribution and Central Limit TheoremSuppose, you note down the salary of any 100 random Data Science professionals in Gurgaon, calculate the mean and repeat the procedure for say like 200 times (arbitrarily).

When you plot a frequency graph of these 200 means, you are likely to get a curve similar to the one below.

This looks very much similar to the normal curve that you studied in the Descriptive Statistics. This is called Sampling Distribution or the graph obtained by plotting sample means. Let us look at a more formal description of a Sampling Distribution.

A Sampling Distribution is a probability distribution of a statistic obtained through a large number of samples drawn from a specific population.

A Sampling Distribution behaves much like a normal curve and has some interesting properties like :

The shape of the Sampling Distribution does not reveal anything about the shape of the population. For example, for the above Sampling Distribution, the population distribution may look like the below graph.

Population Distribution

Sampling Distribution helps to estimate the population statistic.

But how ?

This will be explained using a very important theorem in statistics – The Central Limit Theorem.

3.1 Central Limit TheoremIt states that when plotting a sampling distribution of means, the mean of sample means will be equal to the population mean. And the sampling distribution will approach a normal distribution with variance equal to σ/√n where σ is the standard deviation of population and n is the sample size.

Points to note:

Central Limit Theorem holds true irrespective of the type of distribution of the population.

Now, we have a way to estimate the population mean by just making repeated observations of samples of a fixed size.

Greater the sample size, lower the standard error and greater the accuracy in determining the population mean from the sample mean.

This seemed too technical isn’t it? Let’s break this down to understand this point by point.

The number of samples have to be sufficient (generally more than 50) to satisfactorily achieve a normal curve distribution. Also, care has to be taken to keep the sample size fixed since any change in sample size will change the shape of the sampling distribution and it will no longer be bell shaped.

As we increase the sample size, the sampling distribution squeezes from both sides giving us a better estimate of the population statistic since it lies somewhere in the middle of the sampling distribution (generally). The below image will help you visualize the effect of sample size on the shape of distribution.

Now, since we have collected the samples and plotted their means, it is important to know where the population mean lies with respect to a particular sample mean and how confident can we be about it. This brings us to our next topic – Confidence Interval.

3.2 Confidence IntervalThe confidence interval is a type of interval estimate from the sampling distribution which gives a range of values in which the population statistic may lie. Let us understand this with the help of an example.

We know that 95% of the values lie within 2 (1.96 to be more accurate) standard deviation of a normal distribution curve. So, for the above curve, the blue shaded portion represents the confidence interval for a sample mean of 0.

Formally, Confidence Interval is defined as,

whereas, = the sample mean

= Z value for desired confidence level α

σ = the population standard deviation

For an alpha value of 0.95 i.e 95% confidence interval, z=1.96.

Now there is one more term which you should be familiar with, Margin of Error. It is given as {(z.σ)/√n} and defined as the sampling error by the surveyor or the person who collected the samples. That means, if a sample mean lies in the margin of error range then, it might be possible that its actual value is equal to the population mean and the difference is occurring by chance. Anything outside margin of error is considered statistically significant.

And it is easy to infer that the error can be both positive and negative side. The whole margin of error on both sides of the sample statistic constitutes the Confidence Interval. Numerically, C.I is twice of Margin of Error.

The below image will help you better visualize Margin of Error and Confidence Interval.

The shaded portion on horizontal axis represents the Confidence Interval and half of it is Margin of Error which can be in either direction of x (bar).

Interesting points to note about Confidence Intervals:

Confidence Intervals can be built with difference degrees of confidence suitable to a user’s needs like 70 %, 90% etc.

Greater the sample size, smaller the Confidence Interval, i.e more accurate determination of population mean from the sample means.

There are different confidence intervals for different sample means. For example, a sample mean of 40 will have a difference confidence interval from a sample mean of 45.

By 95% Confidence Interval, we do not mean that – The probability of a population mean to lie in an interval is 95%. Instead, 95% C.I means that 95% of the Interval estimates will contain the population statistic.

Many people do not have right knowledge about confidence interval and often interpret it incorrectly. So, I would like you to take your time visualizing the 4th argument and let it sink in.

3.3 Practical exampleCalculate the 95% confidence interval for a sample mean of 40 and sample standard deviation of 40 with sample size equal to 100.

Solution:

We know, z-value for 95% C.I is 1.96. Hence, Confidence Interval (C.I) is calculated as:

C.I= [{x(bar) – (z*s/√n)},{x(bar) – (z*s/√n)}]

C.I = [{40-(1.96*40/10},{ 40+(1.96*40/10)}]

C.I = [32.16, 47.84]

4. Hypothesis TestingBefore I get into the theoretical explanation, let us understand Hypothesis Testing by using a simple example.

Example: Class 8th has a mean score of 40 marks out of 100. The principal of the school decided that extra classes are necessary in order to improve the performance of the class. The class scored an average of 45 marks out of 100 after taking extra classes. Can we be sure whether the increase in marks is a result of extra classes or is it just random?

Hypothesis testing lets us identify that. It lets a sample statistic to be checked against a population statistic or statistic of another sample to study any intervention etc. Extra classes being the intervention in the above example.

Hypothesis testing is defined in two terms – Null Hypothesis and Alternate Hypothesis.

Null Hypothesis being the sample statistic to be equal to the population statistic. For eg: The Null Hypothesis for the above example would be that the average marks after extra class are same as that before the classes.

Alternate Hypothesis for this example would be that the marks after extra class are significantly different from that before the class.

Hypothesis Testing is done on different levels of confidence and makes use of z-score to calculate the probability. So for a 95% Confidence Interval, anything above the z-threshold for 95% would reject the null hypothesis.

Points to be noted:

We cannot accept the Null hypothesis, only reject it or fail to reject it.

As a practical tip, Null hypothesis is generally kept which we want to disprove. For eg: You want to prove that students performed better after taking extra classes on their exam. The Null Hypothesis, in this case, would be that the marks obtained after the classes are same as before the classes.

5. Types of Errors in Hypothesis TestingNow we have defined a basic Hypothesis Testing framework. It is important to look into some of the mistakes that are committed while performing Hypothesis Testing and try to classify those mistakes if possible.

Now, look at the Null Hypothesis definition above. What we notice at the first look is that it is a statement subjective to the tester like you and me and not a fact. That means there is a possibility that the Null Hypothesis can be true or false and we may end up committing some mistakes on the same lines.

There are two types of errors that are generally encountered while conducting Hypothesis Testing.

Type I error: Look at the following scenario – A male human tested positive for being pregnant. Is it even possible? This surely looks like a case of False Positive. More formally, it is defined as the incorrect rejection of a True Null Hypothesis. The Null Hypothesis, in this case, would be – Male Human is not pregnant.

Type II error: Look at another scenario where our Null Hypothesis is – A male human is pregnant and the test supports the Null Hypothesis. This looks like a case of False Negative. More formally it is defined as the acceptance of a false Null Hypothesis.

The below image will summarize the types of error :

6. T-testsT-tests are very much similar to the z-scores, the only difference being that instead of the Population Standard Deviation, we now use the Sample Standard Deviation. The rest is same as before, calculating probabilities on basis of t-values.

The Sample Standard Deviation is given as:

where n-1 is the Bessel’s correction for estimating the population parameter.

Another difference between z-scores and t-values are that t-values are dependent on Degree of Freedom of a sample. Let us define what degree of freedom is for a sample.

The Degree of Freedom – It is the number of variables that have the choice of having more than one arbitrary value. For example, in a sample of size 10 with mean 10, 9 values can be arbitrary but the 1oth value is forced by the sample mean.

Points to note about the t-tests:

Greater the difference between the sample mean and the population mean, greater the chance of rejecting the Null Hypothesis. Why? (We discussed this above.)

Greater the sample size, greater the chance of rejection of Null Hypothesis.

7. Different types of t-tests 7.1 1-sample t-testThis is the same test as we described above. This test is used to:

Determine whether the mean of a group differs from the specified value.

Calculate a range of values that are likely to include the population mean.

where, X(bar) = sample mean

μ = population mean

s = sample standard deviation

N = sample size

7.2 Paired t-testPaired t-test is performed to check whether there is a difference in mean after a treatment on a sample in comparison to before. It checks whether the Null hypothesis: The difference between the means is Zero, can be rejected or not.

The above example suggests that the Null Hypothesis should not be rejected and that there is no significant difference in means before and after the intervention since p-value is not less than the alpha value (o.o5) and t stat is not less than t-critical. The excel sheet for the above exercise is available here.

where, d (bar) = mean of the case wise difference between before and after,

= standard deviation of the difference

n = sample size.

7.3 2-sample t-testThis test is used to determine:

Determine whether the means of two independent groups differ.

Calculate a range of values that is likely to include the difference between the population means.

The above formula represents the 2 sample t-test and can be used in situations like to check whether two machines are producing the same output. The points to be noted for this test are:

The groups to be tested should be independent.

The groups’ distribution should not be highly skewed.

where, X1 (bar) = mean of the first group

= represents 1st group sample standard deviation

= represents the 1st group sample size.

7.4 Practical exampleWe will understand how to identify which t-test to be used and then proceed on to solve it. The other t-tests will follow the same argument.

Example: A population has mean weight of 68 kg. A random sample of size 25 has a mean weight of 70 with standard deviation =4. Identify whether this sample is representative of the population?

Step 0: Identifying the type of t-testNumber of samples in question = 1

Number of times the sample is in study = 1

Any intervention on sample = No

Recommended t-test = 1- sample t-test.

Had there been 2 samples, we would have opted for 2-sample t-test and if there would have been 2 observations on the same sample, we would have opted for paired t-test.`

Step 1: State the Null and Alternate HypothesisNull Hypothesis: The sample mean and population mean are same.

Alternate Hypothesis: The sample mean and population mean are different.

Step 2: Calculate the appropriate test statisticdf = 25-1 =24

t= (70-68)/(4/√25) = 2.5

Now, for a 95% confidence level, t-critical (two-tail) for rejecting Null Hypothesis for 24 d.f is 2.06 . Hence, we can reject the Null Hypothesis and conclude that the two means are different.

You can use the t-test calculator here.

8. ANOVAANOVA (Analysis of Variance) is used to check if at least one of two or more groups have statistically different means. Now, the question arises – Why do we need another test for checking the difference of means between independent groups? Why can we not use multiple t-tests to check for the difference in means?

The answer is simple. Multiple t-tests will have a compound effect on the error rate of the result. Performing t-test thrice will give an error rate of ~15% which is too high, whereas ANOVA keeps it at 5% for a 95% confidence interval.

To perform an ANOVA, you must have a continuous response variable and at least one categorical factor with two or more levels. ANOVA requires data from approximately normally distributed populations with equal variances between factor levels. However, ANOVA procedures work quite well even if the normality assumption has been violated unless one or more of the distributions are highly skewed or if the variances are quite different.

ANOVA is measured using a statistic known as F-Ratio. It is defined as the ratio of Mean Square (between groups) to the Mean Square (within group).

Mean Square (between groups) = Sum of Squares (between groups) / degree of freedom (between groups)

Mean Square (within group) = Sum of Squares (within group) / degree of freedom (within group)

Here, p = represents the number of groups

n = represents the number of observations in a group

= represents the mean of a particular group

X (bar) = represents the mean of all the observations

Now, let us understand the degree of freedom for within group and between groups respectively.

Between groups : If there are k groups in ANOVA model, then k-1 will be independent. Hence, k-1 degree of freedom.

Within groups : If N represents the total observations in ANOVA (∑n over all groups) and k are the number of groups then, there will be k fixed points. Hence, N-k degree of freedom.

8.1 Steps to perform ANOVA

Hypothesis Generation

Null Hypothesis : Means of all the groups are same

Alternate Hypothesis : Mean of at least one group is different

Calculate within group and between groups variability

Calculate F-Ratio

Calculate probability using F-table

Reject/fail to Reject Null Hypothesis

There are various other forms of ANOVA too like Two-way ANOVA, MANOVA, ANCOVA etc. but One-Way ANOVA suffices the requirements of this course.

Practical applications of ANOVA in modeling are:

Identifying whether a categorical variable is relevant to a continuous variable.

Identifying whether a treatment was effective to the model or not.

8.2 Practical ExampleSuppose there are 3 chocolates in town and their sweetness is quantified by some metric (S). Data is collected on the three chocolates. You are given the task to identify whether the mean sweetness of the 3 chocolates are different. The data is given as below:

Type A Type B Type C

Here, first we have calculated the sample mean and sample standard deviation for you.

Now we will proceed step-wise to calculate the F-Ratio (ANOVA statistic).

Step 1: Stating the Null and Alternate HypothesisNull Hypothesis: Mean sweetness of the three chocolates are same.

Alternate Hypothesis: Mean sweetness of at least one of the chocolates is different.

Step 2: Calculating the appropriate ANOVA statisticIn this part, we will be calculating SS(B), SS(W), SS(T) and then move on to calculate MS(B) and MS(W). The thing to note is that,

Total Sum of Squares [SS(t)] = Between Sum of Squares [SS(B)] + Within Sum of Squares [SS(W)].

So, we need to calculate any two of the three parameters using the data table and formulas given above.

As, per the formula above, we need one more statistic i.e Grand Mean denoted by X(bar) in the formula above.

X bar = (643+655+702+469+427+525+484+456+402)/9 = 529.22

SS(B)=[3*(666.67-529.22)^2]+ [3*(473.67-529.22)^2]+[3*(447.33-529.22)^2] = 86049.55

SS (W) = [(643-666.67)^2+(655-666.67)^2+(702-666.67)^2] + [(469-473.67)^2+(427-473.67)^2+(525-473.67)^2] + [(484-447.33)^2+(456-447.33)^2+(402-447.33)^2]= 10254

MS(B) = SS(B) / df(B) = 86049.55 / (3-1) = 43024.78

MS(W) = SS(W) / df(W) = 10254/(9-3) = 1709

F-Ratio = MS(B) / MS(W) = 25.17 .

Now, for a 95 % confidence level, F-critical to reject Null Hypothesis for degrees of freedom(2,6) is 5.14 but we have 25.17 as our F-Ratio.

So, we can confidently reject the Null Hypothesis and come to a conclusion that at least one of the chocolate has a mean sweetness different from the others.

You can use the F-calculator here.

Note: ANOVA only lets us know the means for different groups are same or not. It doesn’t help us identify which mean is chúng tôi know which group mean is different, we can use another test know as Least Significant Difference Test.

9. Chi-square Goodness of Fit TestSometimes, the variable under study is not a continuous variable but a categorical variable. Chi-square test is used when we have one single categorical variable from the population.

Let us understand this with help of an example. Suppose a company that manufactures chocolates, states that they manufacture 30% dairy milk, 60% temptation and 10% kit-kat. Now suppose a random sample of 100 chocolates has 50 dairy milk, 45 temptation and 5 kitkats. Does this support the claim made by the company?

Let us state our Hypothesis first.

Null Hypothesis: The claims are True

Alternate Hypothesis: The claims are False.

Chi-Square Test is given by:

where, = sample or observed values

= population values

The summation is taken over all the levels of a categorical variable.

= [n * ] Expected value of a level (i) is equal to the product of sample size and percentage of it in the population.

Let us now calculate the Expected values of all the levels.

E (dairy milk)= 100 * 30% = 30

E (temptation) = 100 * 60% =60

E (kitkat) = 100 * 10% = 10

Calculating chi-square = [(50-30)^2/30+(45-60)^2/60+(5-10)^2/10] =19.58

So we reject the Null Hypothesis.

If you have studied some basic Machine Learning Algorithms, the first algorithm that you must have studied is Regression. If we recall those lessons of Regression, what we generally do is calculate the weights for features present in the model to better predict the output variable. But finding the right set of feature weights or features for that matter is not always possible.

It is highly likely that that the existing features in the model are not fit for explaining the trend in dependent variable or the feature weights calculated fail at explaining the trend in dependent variable. What is important is knowing the degree to which our model is successful in explaining the trend (variance) in dependent variable.

Enter ANOVA.

With the help of ANOVA techniques, we can analyse a model performance very much like we analyse samples for being statistically different or not.

But with regression things are not easy. We do not have mean of any kind to compare or sample as such but we can find good alternatives in our regression model which can substitute for mean and sample.

Sample in case of regression is a regression model itself with pre-defined features and feature weights whereas mean is replaced by variance(of both dependent and independent variables).

Through our ANOVA test we would like to know the amount of variance explained by the Independent variables in Dependent Variable VS the amount of variance that was left unexplained.

It is intuitive to see that larger the unexplained variance(trend) of the dependent variable smaller will be the ratio and less effective is our regression model. On the other hand, if we have a large explained variance then it is easy to see that our regression model was successful in explaining the variance in the dependent variable and more effective is our model. The ratio of Explained Variance uand Unexplained Variance is called F-Ratio.

Let us now define these explained and unexplained variances to find the effectiveness of our model.

1. Regression (Explained) Sum of Squares – It is defined as the amount of variation explained by the Regression model in the dependent variable.

Mathematically, it is calculated as:

where, [hat] = predicted value and

y(bar) = mean of the actual y values.

Interpreting Regression sum of squares –

If our model is a good model for the problem at hand then it would produce an output which has distribution as same to the actual dependent variable. i.e it would be able to capture the inherent variation in the dependent variable.

2. Residual Sum of Squares – It is defined as the amount of variation independent variable which is not explained by the Regression model.

Mathematically, it is calculated as:

where, = actual ‘y ‘ value

f(x) = predicted value

Interpretation of Residual Sum of Squares –

It can be interpreted as the amount by which the predicted values deviated from the actual values. Large deviation would indicate that the model failed at predicting the correct values for the dependent variable.

Let us now work out F-ratio step by step. We will be making using of the Hypothesis Testing framework described above to test the significance of the model.

While calculating the F-Ratio care has to be taken to incorporate the effect of degree of freedom. Mathematically, F-Ratio is the ratio of [Regression Sum of Squares/df(regression)] and [Residual Sum of Squares/df(residual)].

We will be understanding the entire concept using an example and this excel sheet.

Step 0: State the Null and Alternate HypothesisNull Hypothesis: The model is unable to explain the variance in the dependent variable (Y).

Alternate Hypothesis: The model is able to explain the variance in dependent variable (Y)

Step 1:Calculate the regression equation for X and Y using Excel’s in-built tool.

Step 2:Predict the values of y for each row of data.

Step 3:Calculate y(mean) – mean of the actual y values which in this case turns out to be 0.4293548387.

Step 4:Calculate the Regression Sum of Squares using the above-mentioned formula. It turned out to be 2.1103632473

The Degree of freedom for regression equation is 1, since we have only 1 independent variable.

Step 5:Calculate the Residual Sum of Squares using the above-mentioned formula. It turned out to be 0.672210946.

Degree of Freedom for residual = Total degree of freedom – Degree of freedom(regression)

=(62-1) – 1 = 60

Step 6:F-Ratio = (2.1103632473/1)/(0.672210946/60) = 188.366

Now, for 95% confidence, F-critical to reject Null Hypothesis for 1,60 degrees of freedom in 4. But we have F-ratio as 188, so we can safely reject the Null Hypothesis and conclude that model explains variation to a large extent.

11. Coefficient of Determination (R-Square)It is defined as the ratio of the amount of variance explained by the regression model to the total variation in the data. It represents the strength of correlation between two variables.

We already calculated the Regression SS and Residual SS. Total SS is the sum of Regression SS and Residual SS.

Total SS = 2.1103632473+ 0.672210946 = 2.78257419

Co-efficient of Determination = 2.1103632473/2.78257419 = 0.7588

12. Correlation CoefficientThis is another useful statistic which is used to determine the correlation between two variables. It is simply the square root of coefficient of Determination and ranges from -1 to 1 where 0 represents no correlation and 1 represents positive strong correlation while -1 represents negative strong correlation.

End NotesSo, this guide comes to an end with explaining all the theory along with practical implementations of various Inferential Statistics concepts. This guide has been created with a Hypothesis Testing framework and I hope this would be one stop solution for a quick Inferential Statistics guide.

Related

## Become A Data Visualization Whiz With This Comprehensive Guide To Seaborn In Python

Overview

Seaborn is a popular data visualization library for Python

Seaborn combines aesthetic appeal and technical insights – two crucial cogs in a data science project

Learn how it works and the different plots you can generate using seaborn

IntroductionThere is just something extraordinary about a well-designed visualization. The colors stand out, the layers blend nicely together, the contours flow throughout, and the overall package not only has a nice aesthetic quality, but it provides meaningful insights to us as well.

This is quite important in data science where we often work with a lot of messy data. Having the ability to visualize it is critical for a data scientist. Our stakeholders or clients will more often than not rely on visual cues rather than the intricacies of a machine learning model.

There are plenty of excellent Python visualization libraries available, including the built-in matplotlib. But seaborn stands out for me. It combines aesthetic appeal seamlessly with technical insights, as we’ll soon see.

In this article, we’ll learn what seaborn is and why you should use it ahead of matplotlib. We’ll then use seaborn to generate all sorts of different data visualizations in Python. So put your creative hats on and let’s get rolling!

Seaborn is part of the comprehensive and popular Applied Machine Learning course. It’s your one-stop-destination to learning all about machine learning and its different aspects.

Table of Contents

What is Seaborn?

Why should you use Seaborn versus matplotlib?

Setting up the Environment

Data Visualization using Seaborn

Visualizing Statistical Relationships

Plotting with Categorical Data

Visualizing the Distribution of a Dataset

What is Seaborn?Have you ever used the ggplot2 library in R? It’s one of the best visualization packages in any tool or language. Seaborn gives me the same overall feel.

Seaborn is an amazing Python visualization library built on top of matplotlib.

It gives us the capability to create amplified data visuals. This helps us understand the data by displaying it in a visual context to unearth any hidden correlations between variables or trends that might not be obvious initially. Seaborn has a high-level interface as compared to the low level of Matplotlib.

Why should you use Seaborn versus matplotlib?I’ve been talking about how awesome seaborn is so you might be wondering what all the fuss is about.

I’ll answer that question comprehensively in a practical manner when we generate plots using seaborn. For now, let’s quickly talk about how seaborn feels like it’s a step above matplotlib.

Seaborn makes our charts and plots look engaging and enables some of the common data visualization needs (like mapping color to a variable or using faceting). Basically, it makes the data visualization and exploration easy to conquer. And trust me, that is no easy task in data science.

“If Matplotlib “tries to make easy things easy and hard things possible”, seaborn tries to make a well-defined set of hard things easy too.” – Michael Waskom (Creator of Seaborn)

There are essentially a couple of (big) limitations in matplotlib that Seaborn fixes:

Seaborn comes with a large number of high-level interfaces and customized themes that matplotlib lacks as it’s not easy to figure out the settings that make plots attractive

Matplotlib functions don’t work well with dataframes, whereas seaborn does

Setting up the EnvironmentThe seaborn library has four mandatory dependencies you need to have:

To install Seaborn and use it effectively, first, we need to install the aforementioned dependencies. Once this step is done, we are all set to install Seaborn and enjoy its mesmerizing plots. To install Seaborn, you can use the following line of code-

To install the latest release of seaborn, you can use pip:

pip

install

seaborn

You can also use conda to install the latest version of seaborn:

conda

install

seaborn

To import the dependencies and seaborn itself in your code, you can use the following code-

View the code on Gist.

That’s it! We are all set to explore seaborn in detail.

Datasets Used for Data VisualizationWe’ll be working primarily with two datasets:

I’ve picked these two because they contain a multitude of variables so we have plenty of options to play around with. Both these datasets also mimic real-world scenarios so you’ll get an idea of how data visualization and exploration work in the industry.

You can check out this and other high-quality datasets and hackathons on the DataHack platform. So go ahead and download the above two datasets before you proceed. We’ll be using them in tandem.

Data Visualization using SeabornLet’s get started! I have divided this implementation section into two categories:

Visualizing statistical relationships

Plotting categorical data

We’ll look at multiple examples of each category and how to plot it using seaborn.

Visualizing statistical relationshipsA statistical relationship denotes a process of understanding relationships between different variables in a dataset and how that relationship affects or depends on other variables.

Here, we’ll be using seaborn to generate the below plots:

Scatter plot

SNS.relplot

Hue plot

I have picked the ‘Predict the number of upvotes‘ project for this. So, let’s start by importing the dataset in our working environment:

Scatterplot using SeabornA scatterplot is perhaps the most common example of visualizing relationships between two variables. Each point shows an observation in the dataset and these observations are represented by dot-like structures. The plot shows the joint distribution of two variables using a cloud of points.

To draw the scatter plot, we’ll be using the relplot() function of the seaborn library. It is a figure-level role for visualizing statistical relationships. By default, using a relplot produces a scatter plot:

Python Code:

SNS.relplot using SeabornSNS.relplot is the relplot function from SNS class, which is a seaborn class that we imported above with other dependencies.

The parameters – x, y, and data – represent the variables on X-axis, Y-axis and the data we are using to plot respectively. Here, we’ve found a relationship between the views and upvotes.

Next, if we want to see the tag associated with the data, we can use the below code:

View the code on Gist.

Hue PlotWe can add another dimension in our plot with the help of hue as it gives color to the points and each color has some meaning attached to it.

In the above plot, the hue semantic is categorical. That’s why it has a different color palette. If the hue semantic is numeric, then the coloring becomes sequential.

View the code on Gist.

We can also change the size of each point:

View the code on Gist.

We can also change the size manually by using another parameter sizes as sizes = (15, 200).

Plotting Categorical Data

Jitter

Hue

Boxplot

Voilin Plot

Pointplot

In the above section, we saw how we can use different visual representations to show the relationship between multiple variables. We drew the plots between two numeric variables. In this section, we’ll see the relationship between two variables of which one would be categorical (divided into different groups).

We’ll be using catplot() function of seaborn library to draw the plots of categorical data. Let’s dive in

Jitter PlotFor jitter plot we’ll be using another dataset from the problem HR analysis challenge, let’s import the dataset now.

View the code on Gist.

Now, we’ll see the plot between the columns education and avg_training_score by using catplot() function.

Since we can see that the plot is scattered, so to handle that, we can set the jitter to false. Jitter is the deviation from the true value. So, we’ll set the jitter to false by using another parameter.

Hue PlotNext, if we want to introduce another variable or another dimension in our plot, we can use the hue parameter just like we used in the above section. Let’s say we want to see the gender distribution in the plot of education and avg_training_score, to do that, we can use the following code

In the above plots, we can see that the points are overlapping each other, to eliminate this situation, we can set kind = “swarm”, swarm uses an algorithm that prevents the points from overlapping and adjusts the points along the categorical axis. Let’s see how it looks like-

Pretty amazing, right? What if we want to see the swarmed version of the plot as well as a third dimension? Let’s see how it goes if we introduce is_promoted as a new variable

Clearly people with higher scores got a promotion.

Boxplot using seabornAnother kind of plot we can draw is a boxplot which shows three quartile values of the distribution along with the end values. Each value in the boxplot corresponds to actual observation in the data. Let’s draw the boxplot now-

When we use hue semantic with boxplot, it is leveled along the categorical axis so they don’t overlap. The boxplot with hue would look like-

Violin Plot using seabornWe can also represent the above variables differently by using violin plots. Let’s try it out

The violin plots combine the boxplot and kernel density estimation procedure to provide richer description of the distribution of values. The quartile values are displayed inside the violin. We can also split the violin when the hue semantic parameter has only two levels, which could also be helpful in saving space on the plot. Let’s look at the violin plot with a split of levels.

These amazing plots are the reason why I started using seaborn. It gives you a lot of options to display the data. Another coming in the line is boxplot.

Boxplot using seabornBoxplot operates on the full dataset and obtains the mean value by default. Let’s face it now.

Pointplot using seabornAnother type of plot coming in is pointplot, and this plot points out the estimate value and confidence interval. Pointplot connects data from the same hue category. This helps in identifying how the relationship is changing in a particular hue category. You can check out how does a pointplot displays the information below.

As it is clear from the above plot, the one whose score is high has is more confident in getting a promotion.

This is not the end, seaborn is a huge library with a lot of plotting functions for different purposes. One such purpose is to introduce multiple dimensions. We can visualize higher dimension relationships as well. Let’s check it out using swarm plot.

Swarm plot using seabornIt becomes so easy to visualize the insights when we combine multiple concepts into one. Here swarm plot is promoted attribute as hue semantic and gender attribute as a faceting variable.

Visualizing the Distribution of a DatasetWhenever we are dealing with a dataset, we want to know how the data or the variables are being distributed. Distribution of data could tell us a lot about the nature of the data, so let’s dive into it.

Plotting Univariate Distributions

Histogram

One of the most common plots you’ll come across while examining the distribution of a variable is distplot. By default, distplot() function draws histogram and fits a Kernel Density Estimate. Let’s check out how age is distributed across the data.

This clearly shows that the majority of people are in their late twenties and early thirties.

Histogram using SeabornAnother kind of plot that we use for univariate distribution is a histogram.

A histogram represents the distribution of data in the form of bins and uses bars to show the number of observations falling under each bin. We can also add a rugplot in it instead of using KDE (Kernel Density Estimate), which means at every observation, it will draw a small vertical stick.

Plotting Bivariate Distributions

Hexplot

KDE plot

Boxen plot

Ridge plot (Joyplot)

Apart from visualizing the distribution of a single variable, we can see how two independent variables are distributed with respect to each other. Bivariate means joint, so to visualize it, we use jointplot() function of seaborn library. By default, jointplot draws a scatter plot. Let’s check out the bivariate distribution between age and avg_training_score.

There are multiple ways to visualize bivariate distribution. Let’s look at a couple of more.

Hexplot using SeabornHexplot is a bivariate analog of histogram as it shows the number of observations that falls within hexagonal bins. This is a plot which works with large dataset very easily. To draw a hexplot, we’ll set kind attribute to hex. Let’s check it out now.

View the code on Gist.

KDE Plot using SeabornThat’s not the end of this, next comes KDE plot. It’s another very awesome method to visualize the bivariate distribution. Let’s see how the above observations could also be achieved by using jointplot() function and setting the attribute kind to KDE.

View the code on Gist.

Heatmaps using SeabornNow let’s talk about my absolute favorite plot, the heatmap. Heatmaps are graphical representations in which each variable is represented as a color.

Let’s go ahead and generate one:

View the code on Gist.

Boxen Plot using SeabornAnother plot that we can use to show the bivariate distribution is boxen plot. Boxen plots were originally named letter value plot as it shows large number of values of a variable, also known as quantiles. These quantiles are also defined as letter values. By plotting a large number of quantiles, provides more insights about the shape of the distribution. These are similar to box plots, let’s see how they could be used.

View the code on Gist.

Ridge Plot using seabornThe next plot is quite fascinating. It’s called ridge plot. It is also called joyplot. Ridge plot helps in visualizing the distribution of a numeric value for several groups. These distributions could be represented by using KDE plots or histograms. Now, let’s try to plot a ridge plot for age with respect to gender.

View the code on Gist.

Visualizing Pairwise Relationships in a DatasetWe can also plot multiple bivariate distributions in a dataset by using pairplot() function of the seaborn library. This shows the relationship between each column of the database. It also draws the univariate distribution plot of each variable on the diagonal axis. Let’s see how it looks.

End NotesWe’ve covered a lot of plots here. We saw how the seaborn library can be so effective when it comes to visualizing and exploring data (especially large datasets). We also discussed how we can plot different functions of the seaborn library for different kinds of data.

Like I mentioned earlier, the best way to learn seaborn (or any concept or library) is by practicing it. The more you generate new visualizations on your own, the more confident you’ll become. Go ahead and try your hand at any practice problem on the DataHack platform and start becoming a data visualization master!

Related

## Using Tigervnc In Ubuntu: A Comprehensive Guide

What is TigerVNC?

TigerVNC is a high-performance, platform-neutral implementation of Virtual Network Computing (VNC), a protocol that allows you to view and control the desktop of another computer remotely. TigerVNC is free and open-source software, available under the GNU General Public License.

TigerVNC provides several benefits, including:

High performance: TigerVNC is designed for efficient remote desktop access over low-bandwidth networks.

Security: TigerVNC supports encryption and authentication, ensuring that your remote desktop connection is secure.

Cross-platform compatibility: TigerVNC can be used to connect to Ubuntu from Windows, macOS, and other operating systems.

Installing TigerVNC in UbuntuBefore we can use TigerVNC, we need to install it on our Ubuntu machine. Here are the steps to do so:

Open a terminal window by pressing Ctrl+Alt+T.

Install TigerVNC by running the following command:

sudo apt-get install tigervnc-standalone-server tigervnc-xorg-extension tigervnc-viewerThis command will install the TigerVNC server and viewer components.

Configuring TigerVNC in UbuntuAfter installing TigerVNC, we need to configure it to allow remote desktop access. Here are the steps to do so:

Open a terminal window by pressing Ctrl+Alt+T.

Run the following command to create a new VNC password:

vncpasswdThis command will prompt you to enter and confirm a new VNC password. This password will be used to authenticate remote desktop connections.

Edit the TigerVNC configuration file by running the following command:

sudo nano /etc/vnc.conf

Add the following lines to the end of the file:

Authentication=VncAuth These lines tell TigerVNC to use VNC authentication and to use the password file we created earlier.

Save and close the file by pressing Ctrl+X, then Y, then Enter.

Starting the TigerVNC Server Now that we have installed and configured TigerVNC, we can start the server and begin accepting remote desktop connections. Here are the steps to do so:

Open a terminal window by pressing Ctrl+Alt+T.

Start the TigerVNC server by running the following command:

vncserverThis command will start the TigerVNC server and generate a unique desktop environment for each new connection.

Note the display number that is output by the command. It should be in the format :1, :2, etc. We will need this display number to connect to the remote desktop later.

Connecting to the Remote Desktop with TigerVNC ViewerNow that the TigerVNC server is running, we can connect to the remote desktop using TigerVNC Viewer. Here are the steps to do so:

Download and install TigerVNC Viewer on the device you want to connect from. You can download it from the official website.

Open TigerVNC Viewer and enter the IP address or hostname of the Ubuntu machine in the "Remote Host" field.

Enter the display number we noted earlier in the "Display" field. For example, if the display number was :1, enter 1.

Enter the VNC password we created earlier in the "Password" field.

ConclusionTigerVNC is a powerful and flexible tool for remotely accessing Ubuntu desktops. By following the steps outlined in this article, you should now be able to install, configure, and use TigerVNC in Ubuntu. With TigerVNC, you can easily work on your Ubuntu machine from anywhere in the world, using any device that supports the VNC protocol.

If you are looking for a way to remotely access your Ubuntu desktop, TigerVNC is a great option. This open-source software allows you to connect to your Ubuntu machine from another device, such as a Windows or macOS computer. In this article, we will explore how to install and use TigerVNC in Ubuntu, with code examples and explanations of related concepts.TigerVNC is a high-performance, platform-neutral implementation of Virtual Network Computing (VNC), a protocol that allows you to view and control the desktop of another computer remotely. TigerVNC is free and open-source software, available under the GNU General Public License. TigerVNC provides several benefits, including:Before we can use TigerVNC, we need to install it on our Ubuntu machine. Here are the steps to do so:This command will install the TigerVNC server and viewer components.After installing TigerVNC, we need to configure it to allow remote desktop access. Here are the steps to do so:This command will prompt you to enter and confirm a new VNC password. This password will be used to authenticate remote desktop connections.These lines tell TigerVNC to use VNC authentication and to use the password file we created chúng tôi that we have installed and configured TigerVNC, we can start the server and begin accepting remote desktop connections. Here are the steps to do so:This command will start the TigerVNC server and generate a unique desktop environment for each new chúng tôi that the TigerVNC server is running, we can connect to the remote desktop using TigerVNC Viewer. Here are the steps to do so:TigerVNC is a powerful and flexible tool for remotely accessing Ubuntu desktops. By following the steps outlined in this article, you should now be able to install, configure, and use TigerVNC in Ubuntu. With TigerVNC, you can easily work on your Ubuntu machine from anywhere in the world, using any device that supports the VNC protocol.

## Comprehensive Beginner’s Guide To Jupyter Notebooks For Data Science & Machine Learning

One of the most common questions people ask is which IDE/environment/tool to use while working on your data science projects. Plenty of options are available – from language-specific IDEs like R Studio, PyCharm to editors like Sublime Text or Atom – the choice can be intimidating for a beginner. If there is one tool that every data scientist should use or must be comfortable with, it is Jupyter Notebooks (previously known as iPython notebooks). In this article, you will learn all about Jupyter Notebooks!

What is a Jupyter Notebook?Jupyter Notebook is an open-source web application that allows us to create and share codes and documents. It provides an environment, where you can document your code, run it, look at the outcome, visualize data and see the results without leaving the environment. This makes it a handy tool for performing end to end data science workflows – data cleaning, statistical modeling, building and training machine learning models, visualizing data, and many, many other uses.

Jupyter Notebooks really shine when you are still in the prototyping phase. This is because your code is written in indepedent cells, which are executed individually. This allows the user to test a specific block of code in a project without having to execute the code from the start of the script. Many other IDE enviornments (like RStudio) also do this in several ways, but I have personally found Jupyter’s individual cells structure to be the best of the lot.

These Notebooks are incredibly flexible, interactive and powerful tools in the hands of a data scientist. They even allow you to run other languages besides Python, like R, SQL, etc. Since they are more interactive than an IDE platform, they are widely used to display codes in a more pedagogical manner.

How to Install Jupyter Notebook?As you might have guessed by now, you need to have Python installed on your machine first. Either Python 2.7 or Python 3.3 (or greater) will do.

AnacondaFor new users, the general consensus is that you should use the Anaconda distribution to install both Python and the Jupyter notebook. Anaconda installs both these tools and includes quite a lot of packages commonly used in the data science and machine learning community. You can download the latest version of Anaconda from here.

The pip MethodIf, for some reason, you decide not to use Anaconda, then you need to ensure that your machine is running the latest pip version. How do you do that? If you have Python already installed, pip will already be there. To upgrade to the latest pip version, follow the below code:

#Linux and OSX pip install -U pip setuptools #Windows python -m pip install -U pip setuptoolsOnce pip is ready, you can go ahead and install Jupyter:

#For Python2 pip install jupyter #For Python3 pip3 install jupyterYou can view the official Jupyter installation documentation here.

How to Set Up Jupyter Notebook?We’ve now learned all about what these notebooks are and how to go about setting them up on our own machines. Time to get the party started!

To run your Jupyter notebook, simply type the below command and you’re good to go!

jupyter notebookOnce you do this, the Jupyter notebook will open up in your default web browser with the below URL:

In some cases, it might not open up automatically. A URL will be generated in the terminal/command prompt with the token key. You will need to copy paste this entire URL, including the token key, into your browser when you are opening a Notebook.

Once the Notebook is opened, you’ll see three tabs at the top: Files, Running and Clusters. Files basically lists all the files, Running shows you the terminals and notebooks you currently have open, and Clusters is provided by IPython parallel.

Python 3

Text File

Folder

Terminal

In a Text File, you are given a blank slate. Add whatever alphabets, words and numbers you wish. It basically works as a text editor (similar to the application on Ubuntu). You also get the option to choose a language (there are a plethora of them given to you) so you can write a script in that. You also have the ability to find and replace words in the file.

In the Folder option, it does what the name suggests. You can create a new folder to put your documents in, rename it and delete it, whatever your requirement.

The Terminal works exactly like the terminal on your Mac or Linux machine (cmd on Windows). It does a job of supporting terminal sessions within your web browser. Type python in this terminal and voila! Your python script is ready to be written.

But in this article, we are going to focus on the notebook so we will select the Python 3 option from the ‘New’ option. You will get the below screen:

You can then start things off by importing the most common Python libraries: pandas and numpy. In the menu just above the code, you have options to play around with the cells: add, edit, cut, move cells up and down, run the code in the cell, stop the code, save your work and restart the kernel.

In the drop-down menu (shown above), you even have four options:

Code – This is self-explanatory; it is where you type your code

Raw NBConvert – It’s a command line tool to convert your notebook into another format (like HTML)

Heading – This is where you add Headings to separate sections and make your notebook look tidy and neat. This has now been converted into the Markdown option itself. Add a ‘##’ to ensure that whatever you type after that will be taken as a heading

Using Jupyter Notebook’s Magic FunctionsThe developers have inserted pre-defined magic functions that make your life easier and your work far more interactive. You can run the below command to see a list of these functions (note: the “%” is not needed usually because Automagic is usually turned on):

%lsmagicYou’ll see a lot of options listed and you might even recognise a few! Functions like %clear, %autosave, %debug and %mkdir are some you must have seen previously. Now, magic commands run in two ways:

Line-wise

Cell-wise

As the name suggests, line-wise is when you want to execute a single command line while cell-wise is when you want to execute not just a line, but the entire block of code in the entire cell.

In line-wise, all given commands must started with the % character while in cell-wise, all commands must begin with %%. Let’s look at the below example to get a better understanding:

Line-wise:

%time a = range(10)Cell-wise:

%%timeit a = range (10) min(a)I suggest you run these commands and see the difference for yourself!

Not Just Limited to Python – Use R, Julia and JavaScript within NotebooksAnd the magic doesn’t stop there. You can even use other languages in your Notebook, like R, Julia, JavaScript, etc. I personally love the ‘ggplot2’ package in R so using this for exploratory data analysis is a huge, huge bonus.

To enable R in Jupyter, you will need the ‘IRKernel’ (dedicated kernel for R) which is available on GitHub. It’s a 8 step process and has been explained in detail, along with screenshots to guide you, here.

If you are a Julia user, you can use that within Jupyter Notebooks too! Check out this comprehensive article which is focused on learning data science for a Julia user and includes a section on how to leverage it within the Jupyter environment.

If you prefer working on JavaScript, I recommend using the ‘IJavascript’ kernel. Check out this GitHub repository which walks you through the steps required for installing this kernel on different OS. Note that you will need to have chúng tôi and npm installed before being able to use this.

Interactive Dashboards in Jupyter Notebooks – Why not?Before you go about adding widgets, you need to import the widgets package:

from ipywidgets import widgetsThe basic type of widgets are your typical text input, input-based, and buttons. See the below example, taken from Dominodatalab, on how an interactive widget looks like:

You can check out a comprehensive guide to widgets here.

Keyboard Shortcuts – Save time and become even more productive!Shortcuts are one of the best things about Jupyter Notebooks. When you want to run any code block, all you need to do is press Ctrl+Enter. There are a lot more keyboard shortcuts that Jupyter notebooks offer that save us a bunch of time.

Below are a few shortcuts we hand picked that will be of immense use to you, when starting out. I highly recommend trying these out as you read them one by one. You won’t know how you lived without them!

A Jupyter Notebook offers two different keyboard input modes – Command and Edit. Command mode binds the keyboard to notebook level commands and is indicated by a grey cell border with a blue left margin. Edit mode allows you to type text (or code) into the active cell and is indicated by a green cell border.

Jump between command and edit mode using Esc and Enter, respectively. Try it out right now!

Once you are in command mode (that is, you don’t have an active cell), you can try out the below shortcuts:

A will insert a new cell above the active cell, and B will insert one below the active cell

To delete a cell, press D twice in succession

To undo a deleted cell, press Z

Y turns the currently active cell into a code cell

Hold down Shift + the up or down arrow key to select multiple cells. While in multiple selection mode, pressing Shift + M will merge your selection

F will pop up the ‘Find and Replace’ menu

When in edit mode (press Enter when in command mode to get into Edit mode), you will find the below shortcuts handy:

Ctrl + Home to go the start of the cell

Ctrl + S will save your progress

As mentioned, Ctrl + Enter will run your entire cell block

Alt + Enter will not only run your cell block, it also adds a new cell below

Ctrl + Shift + F opens the command palette

Useful Jupyter Notebook ExtensionsExtensions are a very productive way of enhancing your productivity on Jupyter Notebooks. One of the best tools to install and use extensions I have found is ‘Nbextensions’. It takes two simple steps to install it on your machine (there are other methods as well but I found this the most convenient):

Step 1: Install it from pip:

pip install jupyter_contrib_nbextensionsStep 2: Install the associated JavaScript and CSS files:

jupyter contrib nbextension install --userOnce you’re done with this, you’ll see a ‘Nbextensions’ tab on the top of your Jupyter Notebook home. And voila! There are a collection of awesome extensions you can use for your projects.

Code prettify: It reformats and beautifies the contents of code blocks.

Printview: This extension adds a toolbar button to call jupyter nbconvert for the current the notebook and optionally display the converted file in a new browser tab.

Scratchpad: This adds a scratchpad cell, which enables you to run your code without having to modify your Notebook. It’s a really handy extension to have when you want to experiment with your code but don’t want to do it on your live Notebook.

Table of Contents (2): This awesome extension collects all the headers in your Notebook and displays them in a floating window.

These are just some of the extensions you have at your disposal. I highly recommend checking out their entire list and experimenting with them.

Saving and Sharing your NotebookGo to the ‘Files’ menu and you’ll see a ‘Download As’ option there:

You can save your Notebook in any of the 7 options provided. The most commonly used is either a .ipynb file so the other person can replicate your code on their machine or the .html one which opens as a web page (this comes in handy when you want to save the images embedded in the Notebook).

You can also use the nbconvert option to manually convert your notebook into a different format like HTML or PDF.

You can also use jupyterhub, which lets you host notebooks on it’s server and share it with multiple users. A lot of top notch research projects use this for collaboration.

JupyterLab – The Evolution of Jupyter NotebooksJupyterLab was launched in February this year and is considered the evolution of Jupyter Notebooks. It allows a more flexible and powerful way of working on projects, but with the same components that Jupyter notebooks have. The JupyterLab environment is exactly the same as a Jupyter Notebook, but with a more productive experience.

JupyterLab enables you to arrange your work area with notebooks, terminals, text files and outputs – all in one window! You just have to drag and drop the cells where you want them. You can also edit popular file formats like Markdown, CSV and JSON with a live preview to see the changes happening in real time in the actual file.

You can see the installation instructions here if you want to try it out on your machine. The long term aim of the developers is for JupyterLab to eventually replace Jupyter notebooks. But that point is still a bit further away right now.

Best PracticesWhile working alone on projects can be fun, most of the time you’ll find yourself working within a team. And in that situation, it’s very important to follow guidelines and best practices to ensure your code and Jupyter Notebooks are annotated properly so as to be consistent with your team members. Here I have listed down a few best practices pointers you should definitely follow while working on a Jupyter Notebook:

Make sure you have the required documentation for your code

Consider a naming scheme and stick to it throughout your code to ensure consistency. This makes it easier for others to follow along

Ensure proper line spacing in your code. You don’t want your loops and functions in the same line – that makes for a maddening experience when it has to be referenced later!

You’ll find sometimes that your file has become quite code heavy. Check out options on how to hide some of the code you deem not important for later reference. This can be invaluable to make your Notebook look tidier and cleaner

Check out this notebook on matplotlib to see how beautifully and neatly it can be represented

Another bonus tip! When you think of creating a presentation, the first tools to come to mind are PowerPoint and Google Slides. Nut your Jupyter Notebooks can create slides too! Remember when I said it’s super flexible? I wasn’t exaggerating.

End NotesDo note that this is not an exhaustive list of things you can do with your Jupyter notebook. There is so much more to it and you pick these things up the more you use it. The key, as with so many things, is experimenting with practice. Check out this GitHub repository which contains a collection of fascinating Jupyter Notebooks.

Frequently Asked QuestionQ1. What is the Jupyter Notebook used for?

A. Jupyter Notebook is an open-source web-based application for interactive data analysis, visualization, and code development. It provides a flexible environment where users can create and share documents containing live code, equations, visualizations, and narrative text.

Q2. Is Jupyter Notebook a Python IDE?

A. Jupyter Notebook is not a full-fledged Python Integrated Development Environment (IDE). While it can execute Python code and provides a coding environment, it is an interactive computational notebook that supports multiple programming languages, including Python.

Q3. Is Jupyter Notebook the same as Python?

A. No, Jupyter Notebook is not the same as Python. Python is a programming language, whereas Jupyter Notebook is an interactive computing environment that allows you to write and execute code, including Python code, and perform data analysis, visualization, and documentation.

Q4. What is the full form of Jupyter?

A. Jupyter is derived by combining three programming languages: Julia, Python, and R. The name “Jupyter” represents these three languages and serves as a platform supporting multiple programming languages.

Q5. What languages are used in Jupyter?

A. Jupyter Notebook supports a wide range of programming languages beyond Python. Some popular languages used in Jupyter include R, Julia, Scala, JavaScript, and many others. This versatility allows users to work with different languages in a single notebook environment, making it a powerful tool for data science and interactive computing.

Related

## Understanding Umask: A Comprehensive Guide

As a developer or system administrator, it’s essential to understand the concept of umask. Umask is a command-line utility that determines the default file permissions for newly created files and directories. In this article, we’ll take a closer look at what umask is, how it works, and how to use it in Linux and Unix systems.

What is Umask?In Unix and Linux systems, every file and directory has a set of permissions that determine who can read, write, and execute them. These permissions are represented by three digits, each representing the permissions for a specific group of users: the owner of the file, the group owner of the file, and everyone else.

For example, if a file has permissions set to 644, it means that the owner of the file can read and write to it, while the group owner and everyone else can only read it.

The umask command determines the default permissions that are assigned to newly created files and directories. It works by subtracting the specified umask value from the default permissions assigned to new files and directories.

Understanding Umask ValuesThe umask value is represented by a three-digit octal number. Each digit represents the permissions that are removed from the default permissions for the owner, group owner, and everyone else.

For example, if the umask value is set to 022, it means that the write permission is removed for the group owner and everyone else. The default permissions for newly created files will be 644 (owner can read and write, group owner and everyone else can read), and for directories, it will be 755 (owner can read, write, and execute, group owner and everyone else can read and execute).

Using Umask in Linux and Unix SystemsTo set the umask value, you can use the umask command followed by the desired value. For example, to set the umask value to 022, you can run the following command:

umask 022You can also set the umask value in the shell startup file (e.g., ~/.bashrc or ~/.bash_profile) to make it persistent across sessions.

Once you set the umask value, any new files or directories you create will have the default permissions calculated based on the umask value.

Umask ExamplesLet’s take a look at some examples to understand how umask works in practice.

Example 1: Setting the Umask ValueSuppose you want to set the umask value to 027. You can run the following command:

umask 027This will set the umask value to 027, which means that the write permission is removed for the owner, and the read and write permissions are removed for the group owner and everyone else.

Example 2: Creating a New FileSuppose you create a new file named example.txt after setting the umask value to 027. The default permissions for the file will be 640 (owner can read and write, group owner can read, and everyone else has no permissions).

touch example.txt ls -l example.txtOutput:

Example 3: Creating a New DirectorySuppose you create a new directory named example after setting the umask value to 027. The default permissions for the directory will be 750 (owner can read, write, and execute, group owner can read and execute, and everyone else has no permissions).

mkdir example ls -ld exampleOutput:

ConclusionIn summary, umask is a command-line utility that determines the default file permissions for newly created files and directories in Unix and Linux systems. Understanding how umask works is essential for developers and system administrators to ensure that the correct permissions are set for files and directories. By using umask, you can easily set the default permissions for newly created files and directories based on your specific requirements.

Update the detailed information about **A Comprehensive Guide To Sharding In Data Engineering For Beginners** on the Achiashop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!