Trending December 2023 # Understanding Umask: A Comprehensive Guide # Suggested January 2024 # Top 13 Popular

You are reading the article Understanding Umask: A Comprehensive Guide updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Understanding Umask: A Comprehensive Guide

As a developer or system administrator, it’s essential to understand the concept of umask. Umask is a command-line utility that determines the default file permissions for newly created files and directories. In this article, we’ll take a closer look at what umask is, how it works, and how to use it in Linux and Unix systems.

What is Umask?

In Unix and Linux systems, every file and directory has a set of permissions that determine who can read, write, and execute them. These permissions are represented by three digits, each representing the permissions for a specific group of users: the owner of the file, the group owner of the file, and everyone else.

For example, if a file has permissions set to 644, it means that the owner of the file can read and write to it, while the group owner and everyone else can only read it.

The umask command determines the default permissions that are assigned to newly created files and directories. It works by subtracting the specified umask value from the default permissions assigned to new files and directories.

Understanding Umask Values

The umask value is represented by a three-digit octal number. Each digit represents the permissions that are removed from the default permissions for the owner, group owner, and everyone else.

For example, if the umask value is set to 022, it means that the write permission is removed for the group owner and everyone else. The default permissions for newly created files will be 644 (owner can read and write, group owner and everyone else can read), and for directories, it will be 755 (owner can read, write, and execute, group owner and everyone else can read and execute).

Using Umask in Linux and Unix Systems

To set the umask value, you can use the umask command followed by the desired value. For example, to set the umask value to 022, you can run the following command:

umask 022

You can also set the umask value in the shell startup file (e.g., ~/.bashrc or ~/.bash_profile) to make it persistent across sessions.

Once you set the umask value, any new files or directories you create will have the default permissions calculated based on the umask value.

Umask Examples

Let’s take a look at some examples to understand how umask works in practice.

Example 1: Setting the Umask Value

Suppose you want to set the umask value to 027. You can run the following command:

umask 027

This will set the umask value to 027, which means that the write permission is removed for the owner, and the read and write permissions are removed for the group owner and everyone else.

Example 2: Creating a New File

Suppose you create a new file named example.txt after setting the umask value to 027. The default permissions for the file will be 640 (owner can read and write, group owner can read, and everyone else has no permissions).

touch example.txt ls -l example.txt


Example 3: Creating a New Directory

Suppose you create a new directory named example after setting the umask value to 027. The default permissions for the directory will be 750 (owner can read, write, and execute, group owner can read and execute, and everyone else has no permissions).

mkdir example ls -ld example



In summary, umask is a command-line utility that determines the default file permissions for newly created files and directories in Unix and Linux systems. Understanding how umask works is essential for developers and system administrators to ensure that the correct permissions are set for files and directories. By using umask, you can easily set the default permissions for newly created files and directories based on your specific requirements.

You're reading Understanding Umask: A Comprehensive Guide

Understanding Neural Network: A Beginner’s Guide

The term “neural network” is derived from the work of a neuroscientist, Warren S. McCulloch and Walter Pitts, a logician, who developed the first conceptual model of an artificial neural network. In their work, they describe the concept of a neuron, a single cell living in a network of cells that receives inputs, processes those inputs, and generates an output. In the computing world, neural networks are organized on layers made up of interconnected nodes which contain an activation function. These patterns are presented to the network through the input layer which further communicates it to one or more hidden layers. The hidden layers perform all the processing and pass the outcome to the output layer.

Neural networks are typically used to derive meaning from complex and non-linear data, detect and extract patterns which cannot be noticed by the human brain. Here are some of the standard applications of neural network used these days. # Pattern/ Image or object recognition # Times series forecasting/ Classification # Signal processing # In self-driving cars to manage control # Anomaly detection These applications fall into different types of neural networks such as convolutional neural network, recurrent neural networks, and feed-forward neural networks. The first one is more used in image recognition as it uses a mathematical process known as convolution to analyze images in non-literal ways. Let’s understand neural network in R with a dataset. The dataset consists of 724 observations and 7 variables. “Companies.Changed” , “Experience.Score”, “Test.Score”, “Interview.Score”, “Qualification.Index”, “age”, “Status” The following codes runs the network classifying ‘Status’ as a function of several independent varaibles. Status refers to recruitment with two variables: Selected and Rejected. To go ahead, we first need to install “neuralnet” package >library(neuralnet) >HRAnalytics<-read.csv(“filename.csv”) > temp<-HRAnalytics Now, removing NA from the data > temp <-na.omit(temp) > dim(temp) # 724 rows and 7 columns) [1] 724   7 > y<-( temp$Status ) # Assigning levels in the Status Column > levels(y)<-c(-1,+1) > class(y) [1] “factor” # Now converting the factors into numeric > y<-as.numeric (as.character (y)) > y < > names(y)<-c(“Status”) Removing the existing Status column and adding the new one Y > temp$ Status <-NULL > temp <-cbind(temp ,y) > temp <-scale( temp ) > chúng tôi (100) > n=nrow( temp ) The dataset will be split up in a subset used for training the neural network and another set used for testing. As the ordering of the dataset is completely random, we do not have to extract random rows and can just take the first x rows. > train <- sample (1:n, 500, FALSE) > f<- Status ~ Companies.Changed+Experience.Score+Test.Score+Interview.Score+Qualification.Index+age Now we’ll build a neural network with 3 hidden nodes.  We will Train the neural network with backpropagation. Backpropagation refers to the backward propagation of error. > fit <- neuralnet (f, data = temp [train ,], hidden =3, algorithm = “rprop+”) Plotting the neural network > plot(fit, intercept = FALSE ,show.weights = TRUE)

The above plot gives you an understanding of all the six input layers, three hidden layers, and the output layer. > z<-temp > z<-z[, -7] The compute function is applied for computing the outputs based on the independent variables as inputs from the dataset. Now, let’s predict on testdata (-train) > pred <- compute (fit, z[-train,]) > sign(pred$net.result ) Now let’s create a simple confusion matrix: > table(sign(pred$net.result),sign( temp[-train ,7])) -1   1 -1 108  20 1   36  60 (108+60)/(108+20+36+60) [1] 0.75 Here, the prediction accuracy is 75% I hope the above example helped you understand how neural networks tune themselves to find the right answer on their own, increasing the accuracy of the predictions. Please note that the acceptable level of accuracy is considered to be over 80%.  Unlike any other technique, neural networks also have certain limitations. One of the major limitation is that the data scientist or analyst has no other role than to feed the input and watch it train and gives the output. One of the article mentions that “with backpropagation, you almost don’t know what you’re doing”. If we just ignore the negatives, neural network has huge application and is a promising and practical form of machine learning. In the recent times, the best-performing artificial-intelligence systems in areas such as autonomous driving, speech recognition, computer vision, and automatic translation are all aided by neural networks. Only time will tell, how this field will emerge and offer intelligent solutions to problems we still have not thought of.

Using Tigervnc In Ubuntu: A Comprehensive Guide

What is TigerVNC?

TigerVNC is a high-performance, platform-neutral implementation of Virtual Network Computing (VNC), a protocol that allows you to view and control the desktop of another computer remotely. TigerVNC is free and open-source software, available under the GNU General Public License.

TigerVNC provides several benefits, including:

High performance: TigerVNC is designed for efficient remote desktop access over low-bandwidth networks.

Security: TigerVNC supports encryption and authentication, ensuring that your remote desktop connection is secure.

Cross-platform compatibility: TigerVNC can be used to connect to Ubuntu from Windows, macOS, and other operating systems.

Installing TigerVNC in Ubuntu

Before we can use TigerVNC, we need to install it on our Ubuntu machine. Here are the steps to do so:

Open a terminal window by pressing Ctrl+Alt+T.

Install TigerVNC by running the following command:

sudo apt-get install tigervnc-standalone-server tigervnc-xorg-extension tigervnc-viewer

This command will install the TigerVNC server and viewer components.

Configuring TigerVNC in Ubuntu

After installing TigerVNC, we need to configure it to allow remote desktop access. Here are the steps to do so:

Open a terminal window by pressing Ctrl+Alt+T.

Run the following command to create a new VNC password:


This command will prompt you to enter and confirm a new VNC password. This password will be used to authenticate remote desktop connections.

Edit the TigerVNC configuration file by running the following command:

sudo nano /etc/vnc.conf

Add the following lines to the end of the file:

Authentication=VncAuth These lines tell TigerVNC to use VNC authentication and to use the password file we created earlier.

Save and close the file by pressing Ctrl+X, then Y, then Enter.

Starting the TigerVNC Server Now that we have installed and configured TigerVNC, we can start the server and begin accepting remote desktop connections. Here are the steps to do so:

Open a terminal window by pressing Ctrl+Alt+T.

Start the TigerVNC server by running the following command:


This command will start the TigerVNC server and generate a unique desktop environment for each new connection.

Note the display number that is output by the command. It should be in the format :1, :2, etc. We will need this display number to connect to the remote desktop later.

Connecting to the Remote Desktop with TigerVNC Viewer

Now that the TigerVNC server is running, we can connect to the remote desktop using TigerVNC Viewer. Here are the steps to do so:

Download and install TigerVNC Viewer on the device you want to connect from. You can download it from the official website.

Open TigerVNC Viewer and enter the IP address or hostname of the Ubuntu machine in the "Remote Host" field.

Enter the display number we noted earlier in the "Display" field. For example, if the display number was :1, enter 1.

Enter the VNC password we created earlier in the "Password" field.


TigerVNC is a powerful and flexible tool for remotely accessing Ubuntu desktops. By following the steps outlined in this article, you should now be able to install, configure, and use TigerVNC in Ubuntu. With TigerVNC, you can easily work on your Ubuntu machine from anywhere in the world, using any device that supports the VNC protocol.

If you are looking for a way to remotely access your Ubuntu desktop, TigerVNC is a great option. This open-source software allows you to connect to your Ubuntu machine from another device, such as a Windows or macOS computer. In this article, we will explore how to install and use TigerVNC in Ubuntu, with code examples and explanations of related concepts.TigerVNC is a high-performance, platform-neutral implementation of Virtual Network Computing (VNC), a protocol that allows you to view and control the desktop of another computer remotely. TigerVNC is free and open-source software, available under the GNU General Public License. TigerVNC provides several benefits, including:Before we can use TigerVNC, we need to install it on our Ubuntu machine. Here are the steps to do so:This command will install the TigerVNC server and viewer components.After installing TigerVNC, we need to configure it to allow remote desktop access. Here are the steps to do so:This command will prompt you to enter and confirm a new VNC password. This password will be used to authenticate remote desktop connections.These lines tell TigerVNC to use VNC authentication and to use the password file we created chúng tôi that we have installed and configured TigerVNC, we can start the server and begin accepting remote desktop connections. Here are the steps to do so:This command will start the TigerVNC server and generate a unique desktop environment for each new chúng tôi that the TigerVNC server is running, we can connect to the remote desktop using TigerVNC Viewer. Here are the steps to do so:TigerVNC is a powerful and flexible tool for remotely accessing Ubuntu desktops. By following the steps outlined in this article, you should now be able to install, configure, and use TigerVNC in Ubuntu. With TigerVNC, you can easily work on your Ubuntu machine from anywhere in the world, using any device that supports the VNC protocol.

Comprehensive Guide To Itil Lifecycle

Overview of ITIL Lifecycle

Information technology infrastructure library (ITIL) is a planned structure, the main purpose of which is to improve the efficiency of the IT department of a company. This department does not just remain back-office support but the IT officers are service partners of the business. In this topic, we are going to learn about ITIL Lifecycle.

The ITIL is designed in such a way that the planning, selection, maintenance, and delivery of IT services of any business is systematized and standardized.

Start Your Free Project Management Course

Project scheduling and management, project management software & others

When a company decides to adopt ITIL, it requires trained and certified personnel to maintain it and also to guide the company and its IT department. Microsoft, IBM, and Hewlett Packard Enterprise are Company which is already using ITIL successfully.

Evolution of ITIL

In 1989 ITIL was introduced to standardize IT service management. This helped in streamlining services in organizations.

In 2001 ITIL v2 was introduced which included actual processes and a sound support system to benefit organizations.

2007 brought us ITIL v3 which provided guidelines to design, service and operation. Feedback for improvement was also started for continuous improvement.

In 2011, ITIL v3 gave a broader perspective and added more focus on strategy,

2023 gives us ITIL v4 which hopes to provide an improved role of IT management in a service economy. It ensures to give practical guidance and while drawing connections between ITIL and new technologies like DevOps.

Stages of ITIL Lifecycle

1. The Strategy of Service

This stage is of most importance as it is the crux of ITIL Lifecycle services. A uniform and rational strategy is the key to superior service management provided by any organization. The business goals of a company and the procedures followed by the IT department should be in sync. The objectives should be in alignment with the strategies.

So, the initial step to be taken here is :

To find out who the customers are?

What are the services required?+

What sort of skill or qualifications are needed?

From where will the funds come and how will the delivery be done?

How will the monetary worth be determined?

Who will take the responsibility of the business relations?

What is the purpose of IT service management?

2. Design of the Service

In this stage, the strategies of stage 1 are converted into activity. Now the ideas are taken to the next step and planning and designing takes place. A time period is also pre-decided within which the service needs to be executed.

This process includes:

Understanding the strategy.

Creating a prospectus for the service

Ensuring that the policies and services are cost-efficient

Look into the security system

3. Transition of Service

Once designed, the strategy is tested so that it is ready to be actually performed or we can say ready to be executed. This is the stage where the procedure is thoroughly checked so that there is no issue when it is finally presented to the customer.

This transition includes:

A new service to be implemented for every design.

Every design to be tested and displayed.

Any changes required for services to be managed.

Any risks to the services also are looked into.

Accomplishing the business target.

4. Service Operation

This is where the service is presented to the customer and is ready for operation. Customer satisfaction should be ensured by the service provider here and it is his duty to see how the service is performing. If there are any issues, they need to be reported.

Services are delivered to sanctioned users.

The cost should be effective and quality enhanced.

The satisfaction of the user.

Business Enhancement.

5. Continued Improvement of Service

Though the planning, designing and implementing services is done meticulously, continuous monitoring is required so that all the strategic targets of that IT service are reached. Once these are reached, new targets can be set and the process can start again.

By ensuring the proper execution of each stage of the ITIL lifecycle, the company knows that their services and their business strategies are on the same page.

Guiding Principles of ITIL

These are a few rules of ITIL which might be common to other methods too.

Focusing on creating value directly or indirectly

Acknowledge what is good and work on the weaker aspects

Work on small projects, improvising while the job is being done and measuring the work done for future reference.

Transparency amongst the team members as well as the shareholders and owners of the company as always proves beneficial to all concerned.

The undertaking of a project until its completion should have a holistic approach to it as this is the responsibility of the service value system (SVS)

The employment of resources, tools, procedures should be optimum and practical as time and finances both matter.

Human Resources should be involved only when necessary as it is easier with software.

ITIL looks at how the knowledge of the admins can be utilized for the benefit of the organization at a larger scale.

ITIL is beneficial as stated under:

The business and its IT department are aligned better as far as the goal is concerned.

Service is provided within the timeline and there are happy customers.

Resources are optimally utilized resulting in cost reduction.

There is more clarity on the cost and assets of IT.

The environment is more adjustable and open to changes which are very helpful.

ITIL proves to be a good infrastructure for businesses that don’t have a fixed service foundation but can pursue specialists to do the job in the best way possible.

Recommended Articles

This has been a guide to ITIL Lifecycle. Here we discuss the Evolution, Stages, Principles and the main purpose of using ITIL in an IT department of a company. You can also go through our other suggested articles to learn more –

Understanding The Photoshop Interface (Ultimate Guide)

Whether you have just started using Photoshop or are a seasoned user, understanding the Photoshop interface is essential for an efficient workflow. Opening the program is daunting for some, with all the panels, bars, and tools to learn. Knowing what each section does helps you understand the program better.

I will show you each element of the interface, including the Toolbar, Options bar, Layers panel, and more. You can use this as a starting point or a refresher on what you already know. I also include some valuable tricks to help you get the most out of Photoshop.

The Home Screen

Opening an image using Photoshop takes you directly to the workspace. However, when opening the program itself, you first see the Home Screen, which gives you options to start your project.

A. Create Or Open A Document

The panel on the left is where you can start a new project, open a saved project, or open an image to begin editing. 

Select New File to open the New Document window. This section is where you can set the canvas size, resolution, color mode, and more.

In the New Document window, select a default or saved preset from the top bar or input your own settings in the right-hand panel. Select Create when you are ready to open the document in the workspace.

If you have a file saved in a specific place, you can open it with the options at the bottom, such as cloud photos from Lightroom or images shared with you by your team members.

B. The Menu Bar

On the Home Screen, you can access some aspects of the Menu bar if you need them, such as using Automate to create an HDR image. I will explain the Menu bar in more detail later on.

C. Quick Access Panel D. Photoshop Suggestions

The area at the top of the Home Screen offers tips to teach you how to carry out certain functions in Photoshop. You can hide these suggestions if you’d rather view more of your recent files.

The Photoshop Workspace Basics Explained

Once you have created a new document or selected an image, the Photoshop workspace is opened. In the workspace, you can see your canvas or artboards, select tools, make adjustments, and add effects to your project. Let’s find out the basics of this workspace.

1. The Document Window & Tabs

In the center of the Photoshop interface is the Document window, where your canvas is situated. The canvas is the area where the image and other elements are visible. The canvas size depends on the size of the image you opened or the document you created.

To create your project, you can add shapes, new layers, objects, and other images to the canvas.

Any enlarged elements that are too big to fit on the canvas will extend into the Pasteboard surrounding the image. The pasteboard’s purpose is to create a border around the image, hold the elements that extend beyond the canvas, and separate artboards when using the artboard feature.

You can also use the pasteboard to make creative edits to your image by enlarging a particular element, for example, adding a torn paper image to begin the process of making a photo look torn.

Around the pasteboard on the top and left-hand side is the Ruler, which you can choose to have visible or hidden. The ruler helps you align and measure elements and is a quick way to create a new guideline.

Then, at the bottom of the Document window is the Status bar, which gives you information about your document.

You can also change the zoom percentage by typing in a new number next to the status information.

Lastly, in the Document window are the Scrollbars, which let you move the document around in the window. To move the document, drag the bar on the right up or down and the bar at the bottom left or right.

2. The Toolbar & Expanded Panels

On the left-hand side of the Photoshop interface is the Toolbar, which holds all the tools you might need to access while working on your project.

Once the tool is activated, it has a darker gray around it, and you can start using the tool on your project.

The tools with an Expanded panel holding additional tools have a small triangle icon to indicate that more tools are available.

You can use the same icon to revert to the single-column Toolbar view.

Once the Customize Toolbar window opens, rearrange the tools as you’d like to.

Rich Tooltips 3. The Options Bar

The Options bar is near the top of the interface and changes based on which tool is selected. The purpose of the Options bar is to customize and provide settings for the activated tool.

Once you select a different tool, for example, the Quick Selection Tool (W), you will notice the Options bar change to provide the customization options for this tool.

The first option for each tool, situated next to the tool’s icon, is the Tool Preset picker, where you will find any default or saved presets for the selected tool.

4. The Menu Bar

The Menu bar — also known as the Application bar — is situated at the very top of the Photoshop interface. This bar holds multiple actions and commands grouped into various categories. You can alter the document settings, add effects to layers, change the image dimensions, and much more from the Menu bar. 

Situated on the far right of the Menu bar (Win) or the far left, just underneath the Menu bar (Mac) on the right, are the options to Close, Minimize, or Maximize the Photoshop window.

A helpful menu category in this bar is the View menu. This menu path provides you with several options to change how you view the document, such as zooming in and out, adding guidelines to the canvas, or changing the screen mode.

You can move through the different menu paths to see what each category holds and all the options the menus provide.

5. The Layers Panel

There are several types of panels in Photoshop, but the most important one is the Layers panel, which is situated on the right-hand side of the interface. The Layers panel contains information about all the layers in your document and is the key to making changes and organizing the layers in the document.

When you open an image in Photoshop, it automatically becomes a locked background layer. As you add different elements to your project, such as Text, Shapes, and Adjustments, they will all be added to the Layers panel above the image layer. 

Whenever you want to edit a specific layer, you need to ensure that it’s selected in this panel before you can make any changes to the layer. 

The Layers panel also has quick links for adding a layer mask, an adjustment layer, or a layer style to a particular layer. These icons are found at the bottom of the panel. From left to right, the icons are: 

Link Layers

Add a Layer Style

Add layer mask or vector mask 

Create new fill or adjustment layer

Create a new group

Create a new layer

Delete layer

For example, selecting Filter for type layers hides every other layer and leaves the one text layer I have on the document. These filters don’t affect what is visible on the canvas.

Below the filter icons are options to change the layer’s Blend Mode, Opacity, Fill, and options to lock elements or the entire layer.

As you add layers, they automatically appear above the layer you were previously on. The order of the layers in the panel directly relates to the order of the layers on the canvas. The layer order means that some layers may hide certain elements. 

6. Panels & Panel Tabs

Other than the Layers panel, several other panels offer more settings and options to modify different layers, tools, and effects for the project. The panels are situated on the right-hand side of the interface and may differ based on which panels are visible in your workspace.

Next to the panels are the Panel tabs, which contain more panel options hidden away. 

Customize And Organize The Photoshop Workspace

Photoshop organizes panels, tools, and windows in a default manner for all users. However, you can customize these elements in different ways. For example, you can dock, group, or stack panels. You can also hide or show different panels, the Toolbar, and other elements to create a customized workspace.

How To Move Panels In Photoshop

You can move panels and other windows around in Photoshop to better suit your workflow. By default, panels are either docked, stacked, or grouped in Photoshop.




The panel will now sit in the panel group or tab where you moved it.

How To Customize Windows And The Workspace In Photoshop

You can easily customize your workspace to match the projects on which you are working. For instance, you can have a photo editing workspace where you keep the tools you usually edit images with visible while hiding the rest of the tools. 

However, you may need multiple workspaces since you might not only work on one type of project the whole time. You can easily create and toggle between various workspaces.

The panels with a checkmark indicate that the panel is visible in the workspace. The panels without a checkmark are hidden, which includes panels that are stacked and grouped but not open. For example, the Color panel is checked, but the Gradients panel isn’t.

How To Customize Preferences In Photoshop

Another customization option in Photoshop is to customize various preferences, including the color of the pasteboard, hiding or revealing Rich Tooltips, and much more. There are so many customization options in Photoshop preferences that you can have a look through for yourself.

When the Preference window opens, you can select any tabs on the left-hand side and change the settings for tools, units, measurements, and much more.

For example, head to the Interface tab and change the color theme to a darker gray.

There are several other customization options in the Preferences window you can check out to create a workspace that works for you. With a solid understanding of the layout of Photoshop, it’s important to optimize the program to ensure it runs well before you start. I outline how to optimize Photoshop to run faster here.

A Comprehensive Guide To Sharding In Data Engineering For Beginners

This article was published as a part of the Data Science Blogathon.

Big Data is a very commonly heard term these days. A reasonably large volume of data that cannot be handled on a small capacity configuration of servers can be called ‘Big Data’ in that particular context. In today’s competitive world, every business organization relies on decision-making based on the outcome of the analyzed data they have on hand. The data pipeline starting from the collection of raw data to the final deployment of machine learning models based on this data goes through the usual steps of cleaning, pre-processing, processing, storage, model building, and analysis. Efficient handling and accuracy depend on resources like software, hardware, technical workforce, and costs. Answering queries requires specific data probing in either static or dynamic mode with consistency, reliability, and availability. When data is large, inadequacy in handling queries due to the size of data and low capacity of machines in terms of speed, memory may prove problematic for the organization. This is where sharding steps in to address the above problems.

This guide explores the basics and various facets of data sharding, the need for sharding, and its pros, and cons.

What is Data Sharding?

With the increasing use of IT technologies, data is accumulating at an overwhelmingly faster pace. Companies leverage this big data for data-driven decision-making. However, with the increased size of the data, system performance suffers due to queries becoming very slow if the dataset is entirely stored in a single database. This is why data sharding is required.

Image Source: Author

In simple terms, sharding is the process of dividing and storing a single logical dataset into databases that are distributed across multiple computers. This way, when a query is executed, a few computers in the network may be involved in processing the query, and the system performance is faster. With increased traffic, scaling the databases becomes non-negotiable to cope with the increased demand. Furthermore, several sharding solutions allow for the inclusion of additional computers. Sharding allows a database cluster to grow with the amount of data and traffic received.

Let’s look at some key terms used in the sharding of databases.

Scale-out and Scaling up: The process of creating or removing databases horizontally done to improve performance and increase capacity is called scale-out. Scaling up refers to the practice of adding physical resources to an existing database server, like memory, storage, and CPU, to improve performance.

Sharding: Sharding distributes similarly-formatted large data over several separate databases.

Chunk: A chunk is made up of sharded data subset and is bound by lower and higher ranges based on the shard key.

Shard: A shard is a horizontally distributed portion of data in a database. Data collections with the same partition keys are called logical shards, which are then distributed across separate database nodes.

Sharding Key: A sharding key is a column of the database to be sharded. This key is responsible for partitioning the data. It can be either a single indexed column or multiple columns denoted by a value that determines the data division between the shards. A primary key can be used as a sharding key. However, a sharding key cannot be a primary key. The choice of the sharding key depends on the application. For example, userID could be used as a sharding key in banking or social media applications.

Logical shard and Physical Shard: A chunk of the data with the same shard key is called a logical shard. When a single server holds one or more than one logical shard, it is called a physical shard.

Shard replicas: These are the copies of the shard and are allotted to different nodes.

Partition Key: It is a key that defines the pattern of data distribution in the database. Using this key, it is possible to direct the query to the concerned database for retrieving and manipulating the data. Data having the same partition key is stored in the same node.

Replication: It is a process of copying and storing data from a central database at more than one node.

Resharding: It is the process of redistributing the data across shards to adapt to the growing size of data.

Are Sharding and Partitioning the same?

Both Sharding and Partitioning allow splitting and storing the databases into smaller datasets. However, they are not the same. Upon comparison, we can say that sharding distributes the data and is shared over several machines, but not with partitioning. Within a single unsharded database, partitioning is the process of grouping subsets of data. Hence, the phrases sharding and partitioning are used interchangeably when the terms “horizontal” and “vertical” are used before them. As a result, “horizontal sharding” and “horizontal partitioning” are interchangeable terms.

Vertical Sharding:

Entire columns are split and placed in new, different tables in a vertically partitioned table. The data in one vertical split is different from the data in the others, and each contains distinct rows and columns.

Horizontal Sharding:

Horizontal sharding or horizontal partitioning divides a table’s rows into multiple tables or partitions. Every partition has the same schema and columns but distinct rows. Here, the data stored in each partition is distinct and independent of the data stored in other partitions.

The image below shows how a table can be partitioned both horizontally and vertically.

The Process

Before sharding a database, it is essential to evaluate the requirements for selecting the type of sharding to be implemented.

At the start, we need to have a clear idea about the data and how the data will be distributed across shards. The answer is crucial as it will directly impact the performance of the sharded database and its maintenance strategy.

Next, the nature of queries that need to be routed through these shards should also be known. For read queries, replication is a better and more cost-effective option than sharding the database. On the other hand, workload involving writing queries or both read and write queries would require sharding of the database. And the final point to be considered is regarding shard maintenance. As the accumulated data increases, it needs to be distributed, and the number of shards keeps on growing over time. Hence, the distribution of data in shards requires a strategy that needs to be planned ahead to keep the sharding process efficient.

Types of Sharding Architectures

Once you have decided to shard the existing database, the following step is to figure out how to achieve it. It is crucial that during query execution or distributing the incoming data to sharded tables/databases, it goes to the proper shard. Otherwise, there is a possibility of losing the data or experiencing noticeably slow searches. In this section, we will look at some commonly used sharding architectures, each of which has a distinct way of distributing data between shards. There are three main types of sharding architectures – Key or Hash-Based, Range Based, and Directory-Based sharding.

To understand these sharding strategies, say there is a company that handles databases for its client who sell their products in different countries. The handled database might look like this and can often extend to more than a million rows.

We will take a few rows from the above table to explain each sharding strategy.

So, to store and query these databases efficiently, we need to implement sharding on these databases for low latency, fault tolerance, and reliability.

Key Based Sharding

Key Based Sharding or Hash-Based Sharding, uses a value from the column data — like customer ID, customer IP address, a ZIP code, etc. to generate a hash value to shard the database. This selected table column is the shard key. Next, all row values in the shard key column are passed through the hash function.

This hash function is a mathematical function that converts any text input size (usually a combination of numbers and strings) and returns a unique output called a hash value. The hash value is based on the chosen algorithm (depending on the data and application) and the total number of available shards. This value indicates the data should be sent to which shard number.

It is important to remember that a shard key needs to be both unique and static, i.e., it should not change over a period of time. Otherwise, it would increase the amount of work required for update operations, thus slowing down performance.

The Key Based Sharding process looks like this:

Image Source: Author

Features of Key Based Sharding are-

It is easier to generate hash keys using algorithms. Hence, it is good at load balancing since data is equally distributed among the available numbers of shards.

As all shards share the same load, it helps to avoid database hotspots (when one shard contains excessive data as compared to the rest of the shards).

Additionally, in this type of sharding, there is no need to have any additional map or table to hold the information of where the data is stored.

However, it is not dynamic sharding, and it can be difficult to add or remove extra servers from a database depending on the application requirement. The adding or removing of servers requires recalculating the hash key. Since the hash key changes due to a change in the number of shards, all the data needs to be remapped and moved to the appropriate shard number. This is a tedious task and often challenging to implement in a production environment.

To address the above shortcoming of Key Based Sharding, a ‘Consistent Hashing’ strategy can be used.

Consistent Hashing-

In this strategy, hash values are generated both for the data input and the shard, based on the number generated for the data and the IP address of the shard machine, respectively. These two hash values are arranged around a ring or a circle utilizing the 360 degrees of the circle. The hash values that are close to each other are made into a pair, which can be done either clockwise or anti-clockwise.

The data is loaded according to this combination of hash values. Whenever the shards need to be reduced, the values from where the shard has been removed are attached to the nearest shard. A similar procedure is adopted when a shard is added. The possibility of mapping and reorganization problems in the Hash Key strategy is removed in this way as the mapping of the number of shards is reduced noticeably. For example, in Key Based Hashing, if you are required to shuffle the data to 3 out of 4 shards due to a change in the hash function, then in ‘consistent hashing,’ you will require shuffling on a lesser number of shards as compared to the previous one. Moreover, any overloading problem is taken care of by adding replicas of the shard.

Range Based Sharding

Range Based Sharding is the process of sharding data based on value ranges. Using our previous database example, we can make a few distinct shards using the Order value amount as a range (lower value and higher value) and divide customer information according to the price range of their order value, as seen below:

Image source: Author

Features of Range Based Sharding are-

Besides, there is no hashing function involved. Hence, it is possible to easily add more machines or reduce the number of machines. And there is no need to shuffle or reorganize the data.

On the other hand, this type of sharding does not ensure evenly distributed data. It can result in overloading a particular shard, commonly referred to as a database hotspot.

Directory-Based Sharding

This type of sharding relies on a lookup table (with the specific shard key) that keeps track of the stored data and the concerned allotted shards. It tells us exactly where the particular queried data is stored or located on a specific shard. This lookup table is maintained separately and does not exist on the shards or the application. The following image demonstrates a simple example of a Directory-Based Sharding.

Features of Directory-Based Sharding are –

The directory-Based Sharding strategy is highly adaptable. While Range Based Sharding is limited to providing ranges of values, Key Based Sharding is heavily dependent on a fixed hash function, making it challenging to alter later if application requirements change. Directory-Based Sharding enables us to use any method or technique to allocate data entries to shards, and it is convenient to add or reduce shards dynamically.

The only downside of this type of sharding architecture is that there is a need to connect to the lookup table before each query or write every time, which may increase the latency.

Furthermore, if the lookup table gets corrupted, it can cause a complete failure at that instant, known as a single point of failure. This can be overcome by ensuring the security of the lookup table and creating a backup of the table for such events.

Other than the three main sharding strategies discussed above, there can be many more sharding strategies that are usually a combination of these three.

After this detailed sharding architecture overview, we will now understand the pros and cons of sharding databases.

Benefits of Sharding

Horizontal Scaling: For any non-distributed database on a single server, there will always be a limit to storage and processing capacity. The ability of sharding to extend horizontally makes the arrangement flexible to accommodate larger amounts of data.

Speed: Speed is one more reason why sharded database design is preferred is to improve query response times. Upon submitting a query to a non-sharded database, it likely has to search every row in the table before finding the result set, you’re searching for. Queries can become prohibitively slow in an application with an unsharded single database. However, by sharding a single table into multiple tables, queries pass through fewer rows, and their resulting values are delivered considerably faster.

Reliability: Sharding can help to improve application reliability by reducing the effect of system failures. If a program or website is dependent on an unsharded database, a failure might render the entire application inoperable. An outage in a sharded database, on the other hand, is likely to affect only one shard. Even if this causes certain users to be unable to access some areas of the program or website, the overall impact would be minimal.

Challenges in Sharding

While sharding a database might facilitate growth and enhance speed, it can also impose certain constraints. We will go through some of them and why they could be strong reasons to avoid sharding entirely.

Increased complexity: Companies usually face a challenge of complexity when designing shared database architecture. There is a risk that the sharding operation will result in lost data or damaged tables if done incorrectly. Even if done correctly, shard maintenance and organization are likely to significantly influence the outcome.

Shard Imbalancing: Depending on the sharding architecture, distribution on different shards can get imbalanced due to incoming traffic. This results in remapping or reorganizing the data amongst different shards. Obviously, it is time-consuming and expensive.

Unsharding or restoring the database: Once a database has been sharded, it can be complicated to restore it to its earlier version. Backups of the database produced before it was sharded will not include data written after partitioning. As a result, reconstructing the original unsharded architecture would need either integrating the new partitioned data with the old backups or changing the partitioned database back into a single database, both of which would undesirable.

Not supported by all databases: It is to be noted that not every database engine natively supports sharding. There are several databases currently available. Some popular ones are MySQL, PostgreSQL, Cassandra, MongoDB, HBase, Redis, and more. Databases namely MySQL or MongoDB has an auto-sharding feature. As a result, we need to customize the strategy to suit the application requirements when using different databases for sharding.

Now that we have discussed the pros and cons of sharding databases, let us explore situations when one should select sharding.

When should one go for Sharding?

When the application data outgrows the storage capacity and can no longer be stored as a single database, sharding becomes essential.

When the volume of reading/writing exceeds the current database capacity, this results in a higher query response time or timeouts.

When a slow response is experienced while reading the read replicas, it indicates that the network bandwidth required by the application is higher than the available bandwidth.

Excluding the above situations, it is possible to optimize the database instead. These could include carrying out server upgrades, implementing caches, creating one or more read replicas, setting up remote databases, etc. Only when these options cannot solve the problem of increased data, sharding is always an option.


We have covered the fundamentals of sharding in Data Engineering and by now have developed a good understanding of this topic

With sharding, businesses can add horizontal scalability to the existing databases. However, it comes with a set of challenges that need to be addressed. These include considerable complexity, possible failure points, a requirement for additional resources, and more. Thus, sharding is essential only in certain situations.

I hope you enjoyed reading this guide! In the next part of this guide, we will cover how sharding is implemented step-by-step using MongoDB.

Author Bio

She loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.

You can follow her on LinkedIn, GitHub, Kaggle, Medium, Twitter.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.


Update the detailed information about Understanding Umask: A Comprehensive Guide on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!