Trending December 2023 # Excel Vlookup Tutorial For Beginners: Step # Suggested January 2024 # Top 21 Popular

You are reading the article Excel Vlookup Tutorial For Beginners: Step updated in December 2023 on the website We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Excel Vlookup Tutorial For Beginners: Step

What is VLOOKUP?

Vlookup (V stands for ‘Vertical’) is an in-built function in excel which allows establishing a relationship between different columns of excel. In other words, it allows you to find (look up) a value from one column of data and returns it’s respective or corresponding value from another column.

In this VLOOKUP guide, we will learn

Usage of VLOOKUP:

Let’s take an instance of Vlookup as:

So as an Example:

So as an Example:

You start with the information which is already available:

(In this Case, Employee’s Name)

To find the information you don’t know:

(In this case, we want to look up for Employee’s Salary)

Download the above Excel File

Moreover, By Applying VLOOKUP, value(Employee’s salary) of the corresponding Employee’s Code will be displayed.

How to use VLOOKUP function in Excel

Following is a step-by-step guide on how to apply the VLOOKUP function in Excel:

Enter the VLOOKUP Function in the above Cell: Start with an equal sign which denotes that a function is entered, ‘VLOOKUP’ keyword is used after the equal sign depicting VLOOKUP function =VLOOKUP ()

The parenthesis will contain the Set of Arguments (Arguments are the piece of data that function needs in order to execute).

VLOOKUP uses four arguments or pieces of data:

The first argument would be the cell reference (as the placeholder) for the value that needs to be searched or the lookup value. Lookup value refers to the data which is already available or data which you know. (In this case, Employee Code is considered as the lookup value so that the first argument will be H2, i.e., the value which needs to be looked up or searched, will be present on the cell reference ‘H2’).

It refers to the block of values that are needed to be searched. In Excel, this block of values is known as table array or the lookup table. In our instance, the lookup table would be from cell reference B2 to E25,i.e., the complete block where the corresponding value would be searched.

NOTE: The lookup values or the data you know have to be in the left-hand column of your lookup table,i.e., your cell range.

It refers to the column reference. In other words, it notifies VLOOKUP where you expect to find the data, you want to view. (Column reference is the column index in the lookup table of the column where the corresponding value ought to be found.) In this case, the column reference would be 4 as the Employee’s Salary column has an index of 4 as per the lookup table.

The last argument is range lookup. It tells the VLOOKUP function whether we want the approximate match or the exact match to the lookup value. In this case, we want the exact match (‘FALSE’ keyword).

FALSE: Refers to the Exact Match.

TRUE: Refers for Approximate Match.

Press ‘Enter’ to notify the cell that we have completed the function. However, you get an error message as below because no value has been entered in the cell H2i.e. No employee code has been entered in Employee Code which will allow the value for lookup.

However, as you enter any Employee Code in H2, it will return the corresponding value i.e. Employee’s Salary.

So in a brief what happened is I told the cell through the VLOOKUP formula is that the values which we know are present in the left-hand column of the data,i.e., depicting the column for Employee’s Code. Now you have to look through my lookup table or my range of cells and in the fourth column to the right of the table find the value on the same row,i.e., the corresponding value (Employee’s Salary) in the same row of the corresponding Employee’s Code.

The above instance explained about the Exact Matches in VLOOKUP,i.e., FALSE Keyword as the last parameter.

VLOOKUP for Approximate Matches (TRUE Keyword as the last parameter)

Consider a scenario where a table calculates discounts for the customers who do not want to buy exactly tens or hundreds of items.

As shown below, certain Company has imposed discounts on the quantity of items ranging from 1 to 10,000:

Download the above Excel File

Now it is uncertain that the customer buys exactly hundreds or thousands of items. In this case, Discount will be applied as per the VLOOKUP’s Approximate Matches. In other words, we do not want to limit them for finding matches to just the values present in the column that are 1, 10, 100, 1000, 10000. Here are the steps:

Step 2) Enter ‘=VLOOKUP()’ in the cell. In the parenthesis enter the set of Arguments for the above instance.

Step 3) Enter the Arguments:

Argument 1: Enter the Cell reference of the cell at which the value present will be searched for the corresponding value in the lookup table.

Step 4) Argument 2: Choose the lookup table or the table array in which you want VLOOKUP to search for the corresponding value.(In this case, choose the columns Quantity and Discount)

Step 5) Argument 3: The third argument would be the column index in the lookup table you want to be searched for the corresponding value.

Step 5) Argument4: Last argument would be the condition for Approximate Matches or Exact Matches. In this instance, we are particularly looking for the Approximate matches (TRUE Keyword).

Step 6) Press ‘Enter.’ Vlookup formula will be applied to the mentioned Cell reference, and when you enter any number in the quantity field, it will show you the discount imposed based on Approximate Matches in VLOOKUP.

NOTE: If you want to use TRUE as the last parameter, you can leave it blank and by default it chooses TRUE for Approximate Matches.

Vlookup function applied between 2 different sheets placed in the same workbook

Let’s see an instance similar to the above case scenario. We are provided with one workbook containing two different sheets. One where Employee’s Code along with Employee’s Name and Employee’s Designation is given another sheet contains Employee’s Code and respective Employee’s Salary (as shown below).



Download the above Excel File

Now the objective is to view all the data in one page, i.e., Sheet 1 as below:

VLOOKUP can help us aggregate all the data so that we can see Employee’s Code, Name, and Salary in one place or sheet.

We will start our work on Sheet 2 as that sheet provides us with two arguments of the VLOOKUP function that is – Employee’s Salary is listed in Sheet 2 which is to be searched by VLOOKUP and reference of the Column index is 2 (as per the lookup table).

Also, we know we want to find the employee’s salary corresponding to the Employee’s Code.

Moreover, that data starts in A2 and ends in B25. So that would be our lookup table or the table array argument.

Step 1) Navigate to sheet 1 and enter the respective headings as shown.

Enter the Vlookup function: =VLOOKUP ().

Step 3) Argument 1: Enter the cell reference which contains the value to be searched in the lookup table. In this case, ‘F2’ is the reference index which will contain Employee’s Code to match for corresponding Employee’s Salary in the lookup table.

Step 4) Argument 2: In the second argument, we enter the lookup table or the table array. However, in this instance, we have the lookup table situated in another sheet in the same workbook. Therefore, for building a relationship we need to enter address of the lookup table as Sheet2!A2:B25 – (A2:B25 refers to the lookup table in sheet 2)

Step 5) Argument 3: Third argument refers to the Column index of the column present in Lookup table where values ought to be present.

Step 6) Argument 4: Last Argument refers to the Exact Matches (FALSE) or Approximate Matches (TRUE). In this instance, we want to retrieve the exact matches for the Employee’s Salary.

Step 7) Press Enter and when you enter the Employee’s Code in the cell, you will be returned with corresponding Employee’s Salary for that Employee’s Code.


The above 3 scenarios explain the working of VLOOKUP Functions. You can play around using more instances. VLOOKUP is an important feature present in MS-Excel which allows you to manage data more efficiently.

You're reading Excel Vlookup Tutorial For Beginners: Step

Step By Step Guide To Create Search Box In Excel

Search Box in Excel

Search Box in Excel is a customized feature that lets you easily locate and highlight specific data in a worksheet. It’s like searching for a book in a library. If you know the book’s title, you can search for it in the catalog instead of searching through every book on the shelves.

Similarly, the Search Box lets you quickly locate specific words or numbers in a large dataset. It helps you find what you need without manually searching through everything.

Search Box in Excel Syntax

=SEARCH(search_text, within_text, [start_num])


search_text (required argument): 

This is the text or substring you want to search for within the larger text string.

within_text (required argument): 

This is the text string to search for the search_text.

start_num (optional argument):

 This is the starting position from which you want to begin the search. If omitted, Excel assumes it to be 1 (the beginning of the text string).

Please remember that the SEARCH function in Excel is case-insensitive, which means it will not distinguish between uppercase and lowercase letters. If a case-sensitive search is necessary, use the FIND function, which has a similar syntax but is case-sensitive.

How to Create a Search Box in Excel?

Now, let’s use some examples to understand how to create your own Search Box in Excel.

Examples of Search Box in Excel

You can download this Search Box Excel Template here – Search Box Excel Template

Example #1

Consider that you have the data of a company that sells and purchases used vehicles. However, the enormous amount of data makes searching for a particular car model name difficult. To simplify this process, you want to create a search box in Excel that highlights all values related to “Scooter” instead of manually searching through every cell.


1. Open an Excel worksheet and go to the cell where you wish to create the search box. Here, we have selected G1 as the search box cell. You can highlight the selected cell to distinguish it from other cells.

The formula used above is deciphered after the last step for this example.

This simplifies the process of looking for any value. For example, after highlighting the fields related to Scooter, we can further refine our search results by applying a filter based on the color of those highlighted fields.

Now let’s understand the meaning of the parameters used in the Search formula and how it worked for us in Excel.

Explanation of Formula

Let’s have a look at each parameter individually.

1. $G$1

=SEARCH($G$1, $A2&$B2&$C2&$D2)

2. $A2&

This is the text string within which we want to search for the value specified in G9. The “&” symbol will join or concatenate the values in cells A2, B2, C2, and D2 into one string.

This is how it can help simplify the search process.

We can even use filters to perform a search as we did above, but then we would need to apply multiple filters to look for multiple things. Moreover, the example that we saw here had limited data. There may be cases when the amount of data in a sheet is huge. A Search Box can help us in all such situations as it creates a search criterion for the entire sheet.

Example #2

To create a search box in Excel, use the FILTER function (here, we are not applying a filter), an easy and efficient way to filter data based on criteria. Here are the steps to create a search box in Excel using the FILTER function, along with the following illustration for better understanding:


Here’s the role of each part of the formula:

B3:D12: This is the range of values that you want to filter.

C3:C12=G2: This is the criteria that you want to use for filtering. Adjust it based on your specific criteria. This example compares the values in the range C3:C12 with the value in cell G2 (the value entered in the search box).

“NO MATCH FOUND”: This value will get displayed in the result box if no entries meet the filtering criteria. You can customize it to your preference.

With the FILTER function, you can easily create a search box in Excel that dynamically filters data based on your criteria, making it a powerful tool for data analysis and manipulation.

Things to Remember

Make sure that you enter the formula correctly in the conditional formatting window.

Use the $ sign as shown in Example 1 to ensure no deviation.

The & sign is useful for adding more columns in the formula. Ensure not to put the & sign at the end of the formula.

Though both Search Box and Filters are useful for fetching outputs based on various conditions, we should not use them interchangeably as they solve unique purposes in different manners. This box can also be useful to enhance the function of a filter.

Frequently Asked Questions (FAQs) Q1. Where is the search bar on Excel? Q2. Why is my search box not working in Excel?

Answer: There could be many reasons why the Microsoft Excel search box or “Find” tool isn’t working. Some possible explanations and solutions are as follows:

No text or value to search: Check that you have entered the correct search text or value in the search box’s “Find what” Excel may be unable to find matches if the search text is blank or contains a typo.

Active cell outside the search range: Excel looks for text or values within the current worksheet or range. Make certain that the active cell is within the search range. Excel may be unable to find matches if the active cell is outside the search range.

Incorrect search options: Excel’s “Find” tool provides several search options, including match case, search direction, and search by rows or columns. Check that you’ve selected the appropriate options based on your search criteria. Excel may be unable to find matches if the search options are not properly configured.

Protected worksheet: If the worksheet or workbook is password-protected or has restricted permissions, the “Find” tool may not function as expected. In such cases, you may need to unprotect the worksheet or workbook before using the “Find”

Excel version or installation issues: Excel version or installation issues: In some cases, problems with Excel itself, such as software bugs or installation errors, can interfere with the “Find” tool’s functionality. In such situations, you may need to update or reinstall Excel or contact Microsoft or your IT department for assistance.

Suppose you’ve checked all the above options, and the search box still doesn’t work in Excel. In that case, it’s best to consult the Excel document or Help feature or contact Microsoft support or your IT department for further troubleshooting and resolution.

Q3. What are the functions of a search bar?

Answer: A search bar is a tool that allows users to find specific content within a dataset. It has filtering capabilities auto-suggestion feature and can function as a navigation tool. It may also provide error handling, a history, and personalized recommendations. Finally, search results are visible for users to browse and select from.

Recommended Articles

This has been a guide to Search Box in Excel. Here we discuss How to Create a Search Box in Excel and the Usage of a Search Box in Excel, along with practical examples and a downloadable Excel template. You can also go through our other suggested articles –

How To Install Windows 7: Clean Installation Tutorial (Step

The best way to start fresh and avoid many possible issues is by installing Windows 7 in a new empty partition. This will be possible when you are utilizing a brand-new hard drive, of when you wipe out the partition that contains any version of Windows. 

If you just landed on this tutorial I would recommend you to read the following articles before proceeding:

Tip: If you’re installing to an older machine, make sure all USB devices are disconnected. You may even need to disconnect internal USB connections. There have been reports of stalled installations when USB devices are connected.

Process to install Windows 7

Now diving into the Windows 7 installation process, the safest way is by starting a clean installation using the Windows 7 DVD or a USB bootable drive.


If you are using a Windows 7 DVD or USB bootable drive, make sure that your computer’s BIOS is set to boot first from any of these installation media, because if your hard drive is set to boot first you’ll never get to the installation wizard. To change the boot order you should check your computer or motherboard manual, as there are many different ways to do this.

If you happen to have the Windows 7 ISO file or you computer doesn’t have a DVD drive, you can also create bootable USB drive using the Windows 7 USB DVD Download Tool — refer to this article to learn how to do this. 

Troubleshooting Tips:

If the setup is not detecting your hard drive, there could be a variety of reasons, the most common are: the hard drive isn’t functional, it might be disconnected, or if your computer is an older system, the setup may not be recognizing your disk controller. In which case, you will prompted to supply the drivers. For Windows 7 32-bit (x86) you should be able to use Windows Vista or Windows 7 type drivers that can be stored in a USB drive, on a floppy disk, or in a CD or DVD — for this last option mentioned, you’ll have to remove the disk from the computer, insert the disk with the drivers and load, then remove the disk and insert the Windows 7 DVD once again.

The setup will start the process of installing Windows by copying, expanding, adding features, updates (if available), and finally completing the installation — This could take a while, for now just sit and relax. Oh! Windows may restart during the installation many times, when this happens do not press any key if you see the message “Press any key to continue”, leave Windows alone and let it continue with the installation process.

6.  Set a password for your user account in Windows 7. You can skip the creation of a password, but it is strongly recommended to have one. Also don’t forget to choose a password hint, this is something that can be a word or a phrase that will help you to remember your account password if you happen to forget, but it should be something difficult for other people to figure out — do not type your actual password here!

7.  Next, type your product key: You can enter the product key (serial number) that was included with the purchased of this copy, or you can skip it this portion of the setup, but remember that you only have 30 days to supply the product key. One thing to keep in mind, if you do the mistake and install a version of Windows that do not correspond to the product key, you’ll have to start the installation process from scratch once again, so pay attention.

8.  Select Automatic Update settings for Windows 7: If you are a regular user this first options is the one you want to choose.

10.  Set up your network: if this a clean installation of Windows 7 in a laptop that has a wireless network adapter enabled, you might be asked to enter a security pass-phrase to connect to the network and/or internet.

In the case you have chosen the option Home Network, the following dialog box will let you create or join a homegroup, but this is something that we’ll talk in a later Windows How-to.

Now your computer has a clean installation of Windows 7, login for the first time and you are done!

If you made it up to here, going through the three-part tutorial on the Windows pre installation process, you may realize that the actual process of installing Windows 7 it is not too difficult, the what to do before start is what take some time, but it makes sure that the process goes smooth and it pays off at the end, because it is likely that you won’t run into a lot of problems.

Essential Photoshop Preferences For Beginners

Learn how to improve Photoshop’s performance, customize the interface, save backups of your work, and more with the important options you need to know about in the Photoshop Preferences! Covers both Photoshop CC and CS6.

Written by Steve Patterson.

In this tutorial, we’ll look at some essential Preferences in Photoshop that every beginner should know about. The Preferences are where we find all sorts of options that control Photoshop’s appearance, behavior and performance. There are more options in the Photoshop Preferences than we could possibly cover in one tutorial, but that’s okay because most of the default settings are fine. Here, we’re just going to look the options that are worthy of your attention right when you first start learning Photoshop. Some of the options allow you to customize Photoshop’s interface. Others will speed up your workflow. And some help to keep Photoshop and your computer running smoothly. There are other important Preferences as well, but we’ll save them for future lessons when it makes more sense to talk about them.

I’ll be using Photoshop CC but this tutorial is also compatible with Photoshop CS6. All but one of the Preferences we’ll look at are available in both versions. As we’ll learn, Photoshop’s Preferences are divided into categories. In some cases, an option will be located in a different category depending on which version of Photoshop you’re using. I’ll point out these differences as we go along.

This is lesson 7 of 8 in Chapter 1 – Getting Started with Photoshop.

Let’s get started!

How To Access The Photoshop Preferences

As I mentioned, Photoshop’s Preferences are divided into various categories. Let’s start with the General category. To access the Preferences, on a Windows PC, go up to the Edit menu in the Menu Bar along the top of the screen. From there, choose Preferences down near the bottom of the list, and then General. On a Mac (which is what I’m using here), go up to the Photoshop menu in the Menu Bar. Choose Preferences, and then choose General:

The Preferences Dialog Box

This opens the Photoshop Preferences dialog box. The categories we can choose from are listed in the column along the left. Options for the currently-selected category appear in the main area in the center. At the moment, the General category is selected. Note that in Photoshop CC, Adobe added several new categories to the Preferences, like Workspace, Tools and History Log. While the categories themselves are only available in Photoshop CC, most of the options within these new categories can be found in other categories in CS6:

The Preferences dialog box in Photoshop CC.

The General Preferences Export Clipboard

The first option we’ll look at, found in the General preferences, is Export Clipboard. This option can affect the overall performance of your computer. When we copy and paste images or layers in Photoshop, the copied items are placed into Photoshop’s clipboard. The clipboard is the part of your computer’s memory (its RAM) that’s set aside for Photoshop to use. Your computer’s operating system also has its own clipboard (its own section of memory).

When “Export Clipboard” is enabled, any items stored in Photoshop’s clipboard are also exported to your operating system’s clipboard. This allows you to then paste the copied items into a different app, like Adobe Illustrator or InDesign. But Photoshop’s file sizes can be huge. Exporting huge files into your operating system’s memory can cause errors and performance problems.

By default, “Export Clipboard” is enabled (checked). To help keep your computer running smoothly, disable (uncheck) this option. If you do need to move files from Photoshop into another app, it’s better to just save the file in Photoshop. Then, open the saved file in the other program:

Disable “Export Clipboard” to improve performance.

Interface Preferences

Switching from General to the Interface category.

Color Theme

The first option we’ll look at is Color Theme. This option controls the overall color of Photoshop’s interface. In this case, “color” just means different shades of gray. Adobe gives us four different color themes to choose from. Each theme is represented by a swatch. The default color theme is the second swatch from the left:

The Color Theme swatches.

Adobe began using this darker theme in Photoshop CS6. Photoshop CC also uses this darker theme by default. Prior to CS6, the interface was much lighter (photo from Adobe Stock):

The default color theme in Photoshop CC (and CS6). Photo credit: Adobe Stock.

Choosing the lightest color theme.

And here we see that Photoshop’s interface is now much lighter. Adobe’s idea behind the darker theme was that it’s less intrusive, allowing us to focus more easily on our images. Personally, I agree, which is why I stick with the default theme. But some people prefer the lighter interface. Choose the theme you’re most comfortable with. You can change Photoshop’s color theme in the Preferences at any time:

The lightest of the four interface color themes.

Highlight Color (Photoshop CC)

In Photoshop CC, Adobe added a new Highlight Color option to the Interface preferences. This option is not available in CS6. “Highlight Color” refers to the color Photoshop uses to highlight the currently-selected layer in the Layers panel:

The Highlight Color option in the Interface preferences.

By default, the highlight color is a shade of gray which matches the overall color theme. Here, we see Photoshop’s Layers panel with the Background layer highlighted in the default gray. We’ll be learning all about layers in our Photoshop Layers section:

The Layers panel showing the gray highlight color.

The other highlight color we can choose is blue:

Changing the highlight color to blue.

And now, we see that my Background layer is highlighted in blue. I prefer the default gray because again, it’s less intrusive. Like the color theme, you can change the highlight color, along with any of Photoshop’s Preferences, at any time:

The Layers panel after changing the highlight color to blue.

UI Font Size

Another option worth looking at in the Interface preferences is UI Font Size. This option is available in both CC and CS6. “UI Font Size” controls the size of the text in Photoshop’s interface (“UI” stands for “User Interface”). Adobe sets the default font size to Small:

The UI Font Size option.

If you have trouble reading small print, you can increase the size. To make the text bigger, choose either Medium or Large. There’s also a Tiny option if you hate your eyes and want them to suffer. Personally, I set “UI Font Size” to Large to help minimize eye strain during long hours at the computer:

Changing the UI Font Size from Small to Large.

You’ll need to close and restart Photoshop for the change to take effect. For comparison, let’s look again at my Layers panel. On the left, we see the Layers panel using the default text size (Small). On the right is the same panel after changing the size to Large (and restarting Photoshop):

The default UI font size (left) and the Large size (right).

Tools Preferences (Photoshop CC)

Switching from Interface to the Tools category (in Photoshop CC).

Show Tool Tips

The first option to look at in the Tools preferences is Show Tool Tips (in CS6, “Show Tool Tips” is found in the Interface category). A “Tool Tip” is a helpful message that pops up when you hover your mouse cursor over a tool or option in Photoshop. Tool Tips offer a short description of what the tool or option is used for:

The “Show Tool Tips” option.

For example, if you hover your mouse cursor over the “Show Tool Tips” option, a Tool Tip will appear in yellow explaining that this option determines whether or not to show Tool Tips:

Tool Tips are great for learning about different options in Photoshop.

And here, we see that when I hover my cursor over a tool icon in Photoshop’s Toolbar, a Tool Tip lets me know which tool I’m selecting:

Tool Tips make it easier to learn the tools in the Toolbar.

Tool Tips are enabled by default. If you’re new to Photoshop, they’re a great way to help you learn. But once you know your way around Photoshop, Tool Tips can start getting in the way. When you feel you no longer need them, simply uncheck “Show Tool Tips” in the Preferences.

Use Shift Key for Tool Switch

Another option in the Tools category in Photoshop CC is Use Shift Key for Tool Switch. In Photoshop CS6, you’ll find it in the General preferences. This option affects how we select Photoshop’s tools when using keyboard shortcuts. By default, “Use Shift Key for Tool Switch” is enabled (checked):

The “Use Shift Key for Tool Switch” option.

Some tools, like the lasso tools, share the same keyboard shortcut.

With “Use Shift Key for Tool Switch” enabled, pressing “L” on your keyboard will select the Lasso Tool. But no matter how many times you press “L”, you will only select the Lasso Tool. To cycle through to the Polygonal or Magnetic Lasso Tool, you need to press and hold your Shift key and press “L”. This is true for any tools in the Toolbar that share the same keyboard shortcut. To save time and avoid needing to press and hold your Shift key, uncheck the “Use Shift Key for Tool Switch” option. With the option turned off, you can cycle through all tools that share the same keyboard shortcut just by pressing the letter itself.

File Handling Preferences

Next, let’s move on to the File Handling preferences. Choose the File Handling category on the left:

Opening the File Handling preferences.

Auto Save

The first option we’ll look at here is Auto Save. Auto Save was first introduced to Photoshop in CS6. This option tells Photoshop to automatically save a backup copy of your work at regular intervals. I can say from experience that Auto Save has saved my you-know-what on several occasions, especially on my aging laptop.

By default, Auto Save is set to back up your work every 10 minutes. That’s usually fine. But depending on how quickly you work, and the reliability of your computer, you may want to shorten the interval from 10 minutes to 5 minutes instead. You can also choose a longer interval if the backups are causing performance issues, but doing so increases the risk of losing your work:

By default, Auto Save will save a backup every 10 minutes.

Recent File List Contains

Another important option in the File Handling preferences is Recent File List Contains. This option determines how many of your previously-opened files Photoshop will keep track of. In Photoshop CC, your recent files appear as thumbnails on the Start screen each time you launch Photoshop. In CS6, you can access your recent files by going up to the File menu in the Menu Bar and choosing Open Recent (this also works in Photoshop CC).

By default, Photoshop will keep track the last 20 files you worked on. You can increase the value all the way to 100. Or, if you don’t want anyone to know what you’ve been working on, set the value to 0 to disable this option:

The “Recent File List Contains” option.

Performance Preferences

Next, let’s look at some settings that have to do with Photoshop’s performance. Choose the Performance category on the left:

Opening the Performance preferences.

Memory Usage

The Memory Usage option in the Performance category controls how much of your computer’s memory is reserved for Photoshop. Photoshop loves memory and will generally run better the more memory it gets. By default, Adobe reserves 70% of your computer’s memory for Photoshop. If Photoshop is struggling when you’re working on large files, try increasing the memory usage value.

You can increase memory usage all the way to 100%. Keep in mind, though, that if you have other apps open as well, they each require memory. Whenever possible, close all other apps when you’re working in Photoshop. If you do need other apps to be open, try not to increase the memory usage value much beyond 90%. Lower it if you run into problems. You’ll need to restart Photoshop for the change to take effect:

The “Memory Usage” option.

History States

Another option that can directly impact Photoshop’s performance is History States. “History States” refers to the number of steps that Photoshop keeps track of as we work. The more steps it remembers, the more steps we can undo to get back to an earlier state. History states are stored in memory, so too many states can slow Photoshop down.

In Photoshop CS6, the default number of history states was 20. Back then, I recommended increasing the value to 30. In Photoshop CC, Adobe has increased the default value all the way to 50. I wouldn’t recommend increasing it much beyond 50 unless you really need that many undo’s. If you run into performance problems, try lowering the value. Again, you’ll need to restart Photoshop for the change to take effect:

The “History States” option.

Scratch Disks Preferences (Photoshop CC)

There’s one more performance option to look at. In Photoshop CC, choose the Scratch Disks category on the left. In Photoshop CS6, stay in the Performance category:

Choosing the “Scratch Disks” category in Photoshop CC.

Scratch Disks

A scratch disk is a section of your computer’s hard drive that Photoshop uses as additional memory if it runs out of system memory. As long as your computer has enough memory, Photoshop won’t need to use the scratch disk. If it does need the scratch disk, it will use whatever hard drive(s) you’ve selected in the Scratch Disks option.

The main hard drive in your computer is known as the Startup disk. This may be the only hard drive you have. If that’s the case, it will be selected by default and there’s really nothing more you need to do. But if you have two or more hard drives, choose a drive that is not your Startup disk. Your operating system uses your Startup disk a lot, so you’ll get better performance from Photoshop by choosing a different drive. Also, if you happen to know the speed of your hard drives, again you’ll get better performance by choosing the fastest drive.

Use SSD’s For Best Performance

Lastly, if one of the hard drives in your computer is an SSD (Solid State Drive), choose the SSD as your scratch disk. SSD’s are much faster than traditional hard drives and can greatly improve performance. Even if your SSD is also your Startup disk, it’s still the best choice. In my case, my Startup disk is an SSD drive so I’ve selected it as my primary scratch disk. I also have a fast secondary drive as a backup scratch disk. However, as I mentioned, Photoshop will only use your scratch disk if it runs out of system memory. If Photoshop is routinely running out of system memory, adding additional memory (RAM) to your computer will give you the best results:

Select the drive(s) you want Photoshop to use if it runs out of system memory.

Closing The Preferences Dialog Box

Where to go next…

And there we have it! While knowing how to customize your Preferences is important, so is knowing how to restore them to their defaults. The most common cause of sudden performance issues with Photoshop is a corrupted Preferences file. In the next lesson in this chapter, we’ll learn how to troubleshoot Photoshop by resetting the Preferences.

Or check out any of the other lessons in this chapter:

For more chapters and for our latest tutorials, visit our Photoshop Basics section!

Gradient Boosting Algorithm: A Complete Guide For Beginners

This article was published as a part of the Data Science Blogathon


In this article, I am going to discuss the math intuition behind the Gradient boosting algorithm. It is more popularly known as Gradient boosting Machine or GBM. It is a boosting method and I have talked more about boosting in this article.

Gradient boosting is a method standing out for its prediction speed and accuracy, particularly with large and complex datasets. From Kaggle competitions to machine learning solutions for business, this algorithm has produced the best results. We already know that errors play a major role in any machine learning algorithm. There are mainly two types of error, bias error and variance error. Gradient boost algorithm helps us minimize bias error of the model

Before getting into the details of this algorithm we must have some knowledge about AdaBoost Algorithm which is again a boosting method. This algorithm starts by building a decision stump and then assigning equal weights to all the data points. Then it increases the weights for all the points which are misclassified and lowers the weight for those that are easy to classify or are correctly classified. A new decision stump is made for these weighted data points. The idea behind this is to improve the predictions made by the first stump. I have talked more about this algorithm here. Read this article before starting this algorithm to get a better understanding.

The main difference between these two algorithms is that Gradient boosting has a fixed base estimator i.e., Decision Trees whereas in AdaBoost we can change the base estimator according to our needs.

Table of Contents

What is Boosting technique?

Gradient Boosting Algorithm

Gradient Boosting Regressor

Example of gradient boosting

Gradient Boosting Classifier

Implementation using Scikit-learn

Parameter Tuning in Gradient Boosting (GBM) in Python

End Notes

Table of contents

About the Author

What is boosting?

While studying machine learning you must have come across this term called Boosting. It is the most misinterpreted term in the field of Data Science. The principle behind boosting algorithms is first we built a model on the training dataset, then a second model is built to rectify the errors present in the first model. Let me try to explain to you what exactly does this means and how does this works.

Suppose you have n data points and 2 output classes (0 and 1). You want to create a model to detect the class of the test data. Now what we do is randomly select observations from the training dataset and feed them to model 1 (M1), we also assume that initially, all the observations have an equal weight that means an equal probability of getting selected.

Remember in ensembling techniques the weak learners combine to make a strong model so here M1, M2, M3….Mn all are weak learners.

Since M1 is a weak learner, it will surely misclassify some of the observations. Now before feeding the observations to M2 what we do is update the weights of the observations which are wrongly classified. You can think of it as a bag that initially contains 10 different color balls but after some time some kid takes out his favorite color ball and put 4 red color balls instead inside the bag. Now off-course the probability of selecting a red ball is higher. This same phenomenon happens in Boosting techniques, when an observation is wrongly classified, its weight get’s updated and for those which are correctly classified, their weights get decreased. The probability of selecting a wrongly classified observation gets increased hence in the next model only those observations get selected which were misclassified in model 1.

Similarly, it happens with M2, the wrongly classified weights are again updated and then fed to M3. This procedure is continued until and unless the errors are minimized, and the dataset is predicted correctly. Now when the new datapoint comes in (Test data) it passes through all the models (weak learners) and the class which gets the highest vote is the output for our test data.

What is a Gradient boosting Algorithm?

The main idea behind this algorithm is to build models sequentially and these subsequent models try to reduce the errors of the previous model. But how do we do that? How do we reduce the error? This is done by building a new model on the errors or residuals of the previous model.

When the target column is continuous, we use Gradient Boosting Regressor whereas when it is a classification problem, we use Gradient Boosting Classifier. The only difference between the two is the “Loss function”. The objective here is to minimize this loss function by adding weak learners using gradient descent. Since it is based on loss function hence for regression problems, we’ll have different loss functions like Mean squared error (MSE) and for classification, we will have different for e.g log-likelihood.

Understand Gradient Boosting Algorithm with example

Let’s understand the intuition behind Gradient boosting with the help of an example. Here our target column is continuous hence we will use Gradient Boosting Regressor.

Following is a sample from a random dataset where we have to predict the car price based on various features. The target column is price and other features are independent features.

Image Source: Author

Step -1 The first step in gradient boosting is to build a base model to predict the observations in the training dataset. For simplicity we take an average of the target column and assume that to be the predicted value as shown below:

Image Source: Author

Why did I say we take the average of the target column? Well, there is math involved behind this. Mathematically the first step can be written as:

Looking at this may give you a headache, but don’t worry we will try to understand what is written here.

Here L is our loss function

Gamma is our predicted value

argmin means we have to find a predicted value/gamma for which the loss function is minimum.

Since the target column is continuous our loss function will be:

Here yi is the observed value

And gamma is the predicted value

Now we need to find a minimum value of gamma such that this loss function is minimum. We all have studied how to find minima and maxima in our 12th grade. Did we use to differentiate this loss function and then put it equal to 0 right? Yes, we will do the same here.

Let’s see how to do this with the help of our example. Remember that y_i is our observed value and gamma_i is our predicted value, by plugging the values in the above formula we get:

We end up over an average of the observed car price and this is why I asked you to take the average of the target column and assume it to be your first prediction.

Hence for gamma=14500, the loss function will be minimum so this value will become our prediction for the base model.

Step-2 The next step is to calculate the pseudo residuals which are (observed value – predicted value)

Image Source: Author

Here F(xi) is the previous model and m is the number of DT made.

We are just taking the derivative of loss function w.r.t the predicted value and we have already calculated this derivative:

If you see the formula of residuals above, we see that the derivative of the loss function is multiplied by a negative sign, so now we get:

The predicted value here is the prediction made by the previous model. In our example the prediction made by the previous model (initial base model prediction) is 14500, to calculate the residuals our formula becomes:

In the next step, we will build a model on these pseudo residuals and make predictions. Why do we do this? Because we want to minimize these residuals and minimizing the residuals will eventually improve our model accuracy and prediction power. So, using the Residual as target and the original feature Cylinder number, cylinder height, and Engine location we will generate new predictions. Note that the predictions, in this case, will be the error values, not the predicted car price values since our target column is an error now.

Let’s say hm(x) is our DT made on these residuals.

Step- 4 In this step we find the output values for each leaf of our decision tree. That means there might be a case where 1 leaf gets more than 1 residual, hence we need to find the final output of all the leaves. TO find the output we can simply take the average of all the numbers in a leaf, doesn’t matter if there is only 1 number or more than 1.

Let’s see why do we take the average of all the numbers. Mathematically this step can be represented as:

Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking about the 1st DT and when it is “M” we are talking about the last DT.

The output value for the leaf is the value of gamma that minimizes the Loss function. The left-hand side “Gamma” is the output value of a particular leaf. On the right-hand side [Fm-1(xi)+ƴhm(xi))] is similar to step 1 but here the difference is that we are taking previous predictions whereas earlier there was no previous prediction.

Image Source

We see 1st residual goes in R1,1  ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .

Let’s calculate the output for the first leave that is R1,1

Now we need to find the value for gamma for which this function is minimum. So we find the derivative of this equation w.r.t gamma and put it equal to 0.

Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1

Let’s take the derivative to get the minimum value of gamma for which this function is minimum:

We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more than 1 residual, we can simply find the average of that leaf and that will be our final output.

Now after calculating the output of all the leaves, we get:

Image Source: Author

Step-5 This is finally the last step where we have to update the predictions of the previous model. It can be updated as:

where m is the number of decision trees made.

Since we have just started building our model so our m=1. Now to make a new DT our new predictions will be:

Image Source: Author

Here Fm-1(x) is the prediction of the base model (previous prediction) since F1-1=0 , F0 is our base model hence the previous prediction is 14500.

nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree has on the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this example.

Hm(x) is the recent DT made on the residuals.

Let’s calculate the new prediction now:

Image Source: Author

Suppose we want to find a prediction of our first data point which has a car height of 48.8. This data point will go through this decision tree and the output it gets will be multiplied with the learning rate and then added to the previous prediction.

Now let’s say m=2 which means we have built 2 decision trees and now we want to have new predictions.

This time we will add the previous prediction that is F1(x) to the new DT made on residuals. We will iterate through these steps again and again till the loss is negligible.

Image Source: Author

If a new data point says height = 1.40 comes, it’ll go through all the trees and then will give the prediction. Here we have only 2 trees hence the datapoint will go through these 2 trees and the final output will be F2(x).

What is Gradient Boosting Classifier?

A gradient boosting classifier is used when the target column is binary. All the steps explained in the Gradient boosting regressor are used here, the only difference is we change the loss function. Earlier we used Mean squared error when the target column was continuous but this time, we will use log-likelihood as our loss function.

Let’s see how this loss function works, to read more about log-likelihood I recommend you to go through this article where I have given each detail you need to understand this.

The loss function for the classification problem is given below:

Our first step in the gradient boosting algorithm was to initialize the model with some constant value, there we used the average of the target column but here we’ll use log(odds) to get that constant value. The question comes why log(odds)?

When we differentiate this loss function, we will get a function of log(odds) and then we need to find a value of log(odds) for which the loss function is minimum.

Confused right? Okay let’s see how it works:

Let’s first transform this loss function so that it is a function of log(odds), I’ll tell you later why we did this transformation.

Now this is our loss function, and we need to minimize it, for this, we take the derivative of this w.r.t to log(odds) and then put it equal to 0,

Here y are the observed values

You must be wondering that why did we transform the loss function into the function of log(odds). Actually, sometimes it is easy to use the function of log(odds), and sometimes it’s easy to use the function of predicted probability “p”.

It is not compulsory to transform the loss function, we did this just to have easy calculations.

Hence the minimum value of this loss function will be our first prediction (base model prediction)

Now in the Gradient boosting regressor our next step was to calculate the pseudo residuals where we multiplied the derivative of the loss function with -1. We will do the same but now the loss function is different, and we are dealing with the probability of an outcome now.

After finding the residuals we can build a decision tree with all independent variables and target variables as “Residuals”.

Now when we have our first decision tree, we find the final output of the leaves because there might be a case where a leaf gets more than 1 residuals, so we need to calculate the final output value. The math behind this step is out of the scope of this article so I will mention the direct formula to calculate the output of a leaf:

Finally, we are ready to get new predictions by adding our base model with the new tree we made on residuals.

There are a few variations of gradient boosting and a couple of them are momentarily clarified in the coming article.

Implementation Using scikit-learn

The task here is to classify the income of an individual, when given the required inputs about his personal life.

First, let’s import all required libraries.

# Import all relevant libraries from sklearn.ensemble import GradientBoostingClassifier import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix from sklearn import preprocessing import warnings warnings.filterwarnings("ignore") Now let’s read the dataset and look at the columns to understand the information better. df = pd.read_csv('income_evaluation.csv') df.head()

I have already done the data preprocessing part and you can look whole code chúng tôi my main aim is to tell you how to implement this on python. Now for training and testing our model, the data has to be divided into train and test data.

We will also scale the data to lie between 0 and 1.

# Split dataset into test and train data X_train, X_test, y_train, y_test = train_test_split(df.drop(‘income’, axis=1),df[‘income’], test_size=0.2)

Now let’s go ahead with defining the Gradient Boosting Classifier along with it’s hyperparameters. Next, we will fit this model on the training data.

# Define Gradient Boosting Classifier with hyperparameters gbc=GradientBoostingClassifier(n_estimators=500,learning_rate=0.05,random_state=100,max_features=5 ) # Fit train data to GBC,y_train)

The model has been trained and we can now observe the outputs as well.

Below, you can see the confusion matrix of the model, which gives a report of the number of classifications and misclassifications.

# Confusion matrix will give number of correct and incorrect classifications print(confusion_matrix(y_test, gbc.predict(X_test))) # Accuracy of model print("GBC accuracy is %2.2f" % accuracy_score( y_test, gbc.predict(X_test)))

Let’s check the classification report also:

from sklearn.metrics import classification_report pred=gbc.predict(X_test) print(classification_report(y_test, pred)) Parameter Tuning in Gradient Boosting (GBM) in Python Tuning n_estimators and Learning rate

n_estimators is the number of trees (weak learners) that we want to add in the model. There are no optimum values for learning rate as low values always work better, given that we train on sufficient number of trees. A high number of trees can be computationally expensive that’s why I have taken few number of trees here.

from sklearn.model_selection import GridSearchCV grid = {     'learning_rate':[0.01,0.05,0.1],     'n_estimators':np.arange(100,500,100), } gb = GradientBoostingClassifier() gb_cv = GridSearchCV(gb, grid, cv = 4),y_train) print("Best Parameters:",gb_cv.best_params_) print("Train Score:",gb_cv.best_score_) print("Test Score:",gb_cv.score(X_test,y_test))

We see the accuracy increased from 86 to 89 after tuning n_estimators and learning rate. Also the “true positive” and the “true negative” rate improved.

We can also tune max_depth parameter which you must have heard in decision trees and random forests.

grid = {'max_depth':[2,3,4,5,6,7] } gb = GradientBoostingClassifier(learning_rate=0.1,n_estimators=400) gb_cv = GridSearchCV(gb, grid, cv = 4),y_train) print("Best Parameters:",gb_cv.best_params_) print("Train Score:",gb_cv.best_score_) print("Test Score:",gb_cv.score(X_test,y_test))

The accuracy has increased even more when we tuned the parameter “max_depth”.

End Notes

I hope you got an understanding of how the Gradient Boosting algorithm works under the hood. I have tried to show you the math behind this is the easiest way possible.

In the next article, I will explain Xtreme Gradient Boosting (XGB), which is again a new technique to combine various models and to improve our accuracy score. It is just an extension of the gradient boost algorithm.

About the Author

I am an undergraduate student currently in my last year majoring in Statistics (Bachelors of Statistics) and have a strong interest in the field of data science, machine learning, and artificial intelligence. I enjoy diving into data to discover trends and other valuable insights about the data. I am constantly learning and motivated to try new things.

I am open to collaboration and work.

For any doubt and queries, feel free to contact me on Email

Connect with me on LinkedIn and Twitter

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.


A Quick Tutorial On Clustering For Data Science Professionals

This is article was published as a part of the Data Science Blogathon.

Welcome to this wide-ranging article on clustering in data science! There’s a lot to unpack so let’s dive straight in.

In this article, we will be discussing what is clustering, why is clustering required, various applications of clustering, a brief about the K Means algorithm, and finally in detail practical implementations of some of the applications using clustering.

Table of Contents

What is Clustering?

Why is Clustering required?

Various applications of Clustering

A brief about the K-Means Clustering Algorithm

Practical implementation of Popular Clustering Applications

What is Clustering?

In simple terms, the agenda is to group similar items together into clusters, just like this:

Let’s go ahead and understand this with an example, suppose you are on a trip with your friends all of you decided to hike in the mountains, there you came across a beautiful butterfly which you have never seen before. Further, you encountered a few more. They are not exactly the same but similar enough for you to understand that they belong to the same species. Now here you need a lepidopterist(the one who studies and collects butterflies) to tell you exactly what species they are, but there is no need for an expert to identify a similar group of items. This way of identifying similar objects/ items is known as clustering.

Why is Clustering required?

So Clustering is an unsupervised task. Unsupervised means the ones in which we are not provided with any assigned labels or scores for training our data.

Here in the above figure on the left, we can see that each instance is marked with different markers which means it’s a labeled dataset for which we can use the classification algorithms like SVM, Logistics Regression, Decision Trees, or Random Forests. On the right side if you observe it is the same dataset but without labels so here the story for classifications algorithms ends(i.e we can’t use them here). This is where the clustering algorithms come into the picture to save the day!. Right now in the above picture, it is pretty obvious and quite easy to identify the three clusters with our eyes, but that we not be the case while working with real and complex datasets.

Various applications of Clustering 1. Search engines:

You may be familiar with the concept of image search which Google provides. So what this system does is that first, it applies the clustering algorithm on all the images available in the database available. After which similar images would fall under the same cluster. So when a particular user provides an image for reference what it will be doing is applying the trained clustering model on the image to identify its cluster once this is done it simply returns all the images from this cluster.

2. Customer Segmentation:

We can also cluster our customers based on their purchase history and their activity on our website. This is really important and useful to understand who our customers are and what they require so that our system can adapt to their requirements and suggest products to each respective segment accordingly.

3. Semi-supervised Learning:

When you are working on semi-supervised learning in which you are only provided with a few labels, there you could perform clustering algorithms and generate labels for all instances falling under the same cluster. This technique is really good for increasing the number of labels after which a supervised learning algorithm can be used and its performance gets better.

4. Anomaly detection:

Any instance that has a low affinity(Measure of how well an instance fits into a particular cluster) is probably an anomaly. For example, if you have clustered the user based on the request per minute on your website,  you can detect users with abnormal behavior. So this technique is particularly useful in detecting any manufacturing detects or for some fraud detections.

5. Image Segmentation:

If you cluster all the pixels according to their colors, then after that we can replace each pixel with the mean color of its cluster, this might be helpful whenever we need to reduce the number of different colors in the image. Image segmentation plays an important part in object detection and tracking systems.

We will look at how to implement this further.

A Brief About the K-Means Clustering Algorithm

Let’s go ahead and take a quick look at what the K-means algorithm really is.

Firstly, let’s generate some blobs for a better understanding of the unlabelled dataset.

import numpy as np from sklearn.datasets import make_blobs blob_centers = np.array( [[ 0.2, 2.3], [-1.5 , 2.3], [-2.8, 1.8], [-2.8, 2.8], [-2.8, 1.3]]) blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1]) X, y = make_blobs(n_samples=2000, centers=blob_centers, cluster_std=blob_std, random_state=7)

Now let’s plot them

plt.figure(figsize=(8, 4)) plt.scatter(X[:, 0], X[:, 1], c=None, s=1) save_fig("blobs_plot")

So this is how an unlabeled dataset would look like, here we can clearly see that there are five blobs of instances. So basically k means is just a simple algorithm capable of clustering this kind of dataset efficiently and quickly.

Let’s go ahead and train a K-Means on this dataset. Now, this algorithm will try to find each blob’s center.

from sklearn.cluster import KMeans k = 5 kmeans = KMeans(n_clusters=k, random_state=101) y_pred = kmeans.fit_predict(X)

Keep in mind that we need to specify the number of cluster k that the algorithm needs to find. In our example, it is pretty straight forward but in general, it won’t be that easy. Now after training each instance would have been assigned to one of the five clusters. Remember that here an instance’s label is the index of the cluster, don’t confuse it with class labels in classification. 

Let’s take a look at the five centroids the algorithm found:


These are the centroids for clusters with indexes of 0,1,2,3,4 respectively.

Now you can easily be able to assign new instances and the model will assign it to a cluster whose centroid is closet to it.

new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]]) kmeans.predict(new)

That is pretty much it for now, we will see in detail working and types of K-Means some other day in some other blog. Stay Tuned!

Implementation of Popular Clustering Applications

1. Image Segmentation using clustering

Image Segmentation is just the task of partitioning an image into multiple segments. For example, in a self-driving car’s object detection system, all the pixels that are part of a traffic signal’s image might be assigned to the “traffic-signal” segment. Today there are state of the art model based on CNN(convolution neural network) using complex architecture are being used for image processing. But we are going to do something much simpler which is color segmentation. We will simply assign pixels to a particular cluster if they have the same color. This technique might be sufficient for some applications, like the analysis of satellite images to measure the forest area coverage in a region, color segmentation might just do the work.

Let’s go ahead a load the image we are about to work on:

from matplotlib.image import imread image = imread('lady_bug.png') image.shape

Now Let’s go ahead and reshape the array to get a long list of RGB colors and then cluster them using K-Means:

X = image.reshape(-1, 3) kmeans = KMeans(n_clusters=8, random_state=101).fit(X) segmented_img = kmeans.cluster_centers_[kmeans.labels_] segmented_img = segmented_img.reshape(image.shape)

Now what’s happening here is, for example, it tries to identify a color cluster for all shades of green. After that, for each color, it looks for the mean color of the pixel’s color cluster. What I mean is it will replace all shades of green with a light green color assuming that the mean is light green. At last, it will reshape this long list of colors to the original dimension of the image.

Output with a different number of clusters:

2. Data preprocessing using Clustering

For Dimensionality reduction clustering might be an effective approach, like a preprocessing step before a supervised learning algorithm is implemented. Let’s take a look at how we can reduce the dimensionality of the famous MNIST dataset using clustering and how much performance difference we get after doing this.

MNIST dataset consists of 1797 grayscale(one channel) 8 X 8 images representing digits from 0 to 9. Let’s start by loading the dataset:

from sklearn.datasets import load_digits X_digits, y_digits = load_digits(return_X_y=True)

Now let’s split them into training and test set:

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_digits, y_digits, random_state=42)

Now let’s go ahead and train a logistic regression model and evaluate its performance on the test set:

from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression(), y_train)

Now Let’s evaluate its accuracy on the test set:

log_reg_score = log_reg.score(X_test, y_test) log_reg_score

Ok so now we have an accuracy of 96.88%. Let’s see if we can do better by using K-Means as a preprocessing step. We will be creating a pipeline that will first cluster the training set into 50 clusters and replace those images with their distances to these 50 clusters, then after that, we will apply the Logistic Regression model:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([

(“kmeans”, KMeans(n_clusters=50)),

(“log_reg”, LogisticRegression()),

]), y_train)

Let’s evaluate this pipeline on test set:

pipeline_score = pipeline.score(X_test, y_test)


Boom! We just increased the accuracy of the model. But here we choose the number of clusters k arbitrarily. Let’s go ahead and apply grid search to find a better k value:

param_grid = dict(kmeans__n_clusters=range(2, 100)) grid_clf = GridSearchCV(pipeline, param_grid, cv=3, verbose=2), y_train)

Warning the above step might be time-consuming!

Let’s see the best cluster that we got and its accuracy:


The accuracy now is:

grid_clf.score(X_test, y_test)

Here we got a significant boost in accuracy compared to earlier on the test set.

End Notes

To sum up, in this article we saw what is clustering?, why is clustering required? , various applications of clustering, a brief about the K Means algorithm, and lastly in detail practical implementations of some of the applications using clustering. I hope you liked it!

Stay tuned!

Connect me with on LinkedIn

Thank You!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion


Update the detailed information about Excel Vlookup Tutorial For Beginners: Step on the website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!