You are reading the article Performing Data Cleaning And Feature Engineering With R updated in December 2023 on the website Achiashop.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Performing Data Cleaning And Feature Engineering With R
This article was published as a part of the Data Science BlogathonIntroduction
Feature engineering sounds so complicated but Nah! it’s really not. So what is feature engineering?
For me, it’s moulding data features (columns) according to one’s needs. Example: Carron told me that he bought 10 hotdogs at 20 bucks each( i.e. 20 bucks for 1 hotdog). But what I really want is a total prize of 10 hotdogs. So I created total price =200 bucks(10 times 20) in my mind. This total price works for me and this is exactly what feature engineering is. I turned the given data into something that makes more sense to me and my needs by adding new features to it.
Feature Engineering: A field of data science that creates new features or update given features of a given dataset or datasets such that it suits our and our team’s purpose is called feature engineering. It makes data more informatively readable
Now you can’t do everything in mind so we use tools like Excel, SQL, Python, R, and whatnot, by the way, you can also use pen and paper.Let’s go R!
R is a programming language for data science work( mostly used that way). You can use R for everything from adding 1+2 to creating heavy data science algorithms.
If you are new to R, no worries I got you. Just have a basic overview of R. First we need an IDE (Integrated development environment) to work with it. R studio is a good choice for your PC. But I will recommend you to use the kaggle website’s notebook. Why? because it’s kaggle( A website with lots of datasets, a dedicated working environment for you, tutorials, competitions and more).
Now let’s get our hands dirty:
Go ahead and Register
After signing in create your profile by adding images and doing essentials.
Direct Link or Indirect link and choose file Divvy_Trips_2023_Q1.zip then extract it.
After that switch from python to R from code dropdown.Dataset
This dataset is from a cycle company that organizes events, each having some time duration. We need to understand why customers are not becoming subscribers.
Let’s see and understand our data:# first we will read our csv(comma separated values) using read.csv(path) # kaggle_path is the path in kaggle where you uploaded your dataset # note do not enter my data link add yours like kaggle_path <- " file link " kaggle_path <- "../input/cycle-data/Divvy_Trips_2023_Q1.csv" raw_data <- read.csv(Kaggle_path) # lets see our data using head(raw_data) or understand using str() str(raw_data) # or glimpse(raw_data) glimpse is from dplyr package # it give nice view
Just by looking at data we can tell start_time and end_time are not given as date (represented as chr) but they are real dates and time.
And the trip duration is an integer type of data but it is shown as a character.Converting start_time and end_time into the timestamp
as.Date(column_name): This function allows you to change into date example:
“09-09-12” will be converted to a date: 09-09-12. But what is the difference between chr and dates? The difference is date format is R’s way to store data, so “09-09-12” is not a date for R they are just characters but as.Date(c(“09-09-12”)) is a date format for R. [Note: it counts date from 01 Jan 1970.]
as.Date() does not work with time, it won’t understand time. Example as.Date(c(“09-09-12 00:10:29”)) gives 09-09-12 only and ignores 00:10:29.
as.POSIXlt(column_name)/as.POSIXct(column_name): These two functions help you to convert date and time, making a complete time stamp. Now if you will type as.POSIXlt(“09-09-12 00:10:29”)/as.POSIXct(“09-09-12 00:10:29”) you will get 09-09-12 00:10:29 as a time stamp object. Beware these function are different in a way that one stores data in list and other count it from 01 Jan 1970.
Let’s see an example:# Converting start_time as date-time # $ is used to specify a column of raw_data i.e. start_time # notice how I am reassigning the converted columns to the data columns using <- tester <- raw_data$start_time # for demonstrating as.Date() tester<- as.Date(tester) raw_data$start_time<-as.POSIXlt(raw_data$start_time) # for start time raw_data$end_time<-as.POSIXlt(raw_data$end_time) # for end time
str(raw_data$start_time)print("as.Date()") str(tester) # as.data.frame is used to convert into dataframe Output  "as.POSIXlt()" POSIXlt[1:365069], format: "2023-01-01 00:04:37" "2023-01-01 00:08:13" "2023-01-01 00:13:23" ...  "as.Date()" Date[1:365069], format: "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-01" ... Now lets check our data's structure using str(raw_data) Output: 'data.frame': 365069 obs. of 12 variables: $ trip_id : int 21742443 21742444 21742445 21742446 21742447 21742448 21742449 $ start_time : POSIXlt, format: "2023-01-01 00:04:37" "2023-01-01 00:08:13" ... $ end_time : POSIXlt, format: "2023-01-01 00:11:07" "2023-01-01 00:15:34" ... $ bikeid : int 2167 4386 1524 252 1170 2437 2708 2796 6205 3939 ... $ tripduration : chr "390.0" "441.0" "829.0" "1,783.0" ... .... and so on Converting character into integer and dealing with N/A
as.integer(value): This function is used to convert the value into integer data type so that we can do calculations. Let’s see our code.raw_data$tripduration <- as.integer(raw_data$tripduration) # we'll get a surprise here str(raw_data$tripduration)
Did you get a surprise warning? yeah, that was my intention. So what does it? It means that your code will work but there are na(N/A) values there. N/A’s are those values that mean nothing to us and R so we need to take care of them.
Look there are various ways to deal with them like replacing them with some suitable value ( something that makes sense and does not spoil or play with our data). Doing that is easy but it’s not about the method but it’s about your knowledge + intuition + experience. I will show you only one way here. Perhaps, I will create a separate blog for imputing values.
Now if you’ll see tripduration, it means time taken during a trip. If you’ll see our data ( str(raw_data) ) then you will see two columns start_time and end_time. If we will subtract the trip’s end time and the trip’s start time (end_time-start_time) you should get trip duration. So let’s check if we have the same trip durations, if yes then we will replace our tripduration with end_time-start_time.tester <- 60*(raw_data$end_time - raw_data$start_time) # 60 is multiplied to convert min into sec
Ignore output line: time difference in mins as we explicitly converted it into seconds.
Woohoo! A perfect match…
Next, assign a tester to our tripduration replacing old values.raw_data$tripduration <- tester Power of Data and Time
Now we can use our start_time (type: POSIXlt date) and figure out weekdays (“Monday”, “Tuesday”… and so on ), week_number (number of weeks from 1 Jan) and month_number for month number. This is important when you need January data or weekly grouped data.install.packages("lubridate") library(lubridate) raw_data$weekdays <-weekdays(raw_data$start_time) # Creates new column weekdays raw_data$week_number <-lubridate::week(raw_data$start_time) # using lubridate's week function (::) raw_data$month_number <-month(raw_data$start_time) str(raw_data)
Beautiful! you made it. Great job, we learnt about how to create columns, update them, clean them and convert them. Using this basic stuff you will surely do something remarkable. All the best !!!
Hints and Tips: During the analysis and feature engineering we also use summary or describe function to view stats of our data. Like:install.packages("psych") library(psych) # use psych to use describe function describe(dataset_dataframe) # first install psych summary(dataset_dataframe) Conclusion and what’s next!
This is just basics but our data looks a lot cleaner and more useful. An analyst will definitely like it. Now I want you to correct one thing in the data which is the birth year column. Observe it, it will have a lot of N/A values think of replacing them. (Hint: using mean, median, mode or whatever your data!).
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
You're reading Performing Data Cleaning And Feature Engineering With R
This article was published as a part of the Data Science Blogathon.
Big Data is a very commonly heard term these days. A reasonably large volume of data that cannot be handled on a small capacity configuration of servers can be called ‘Big Data’ in that particular context. In today’s competitive world, every business organization relies on decision-making based on the outcome of the analyzed data they have on hand. The data pipeline starting from the collection of raw data to the final deployment of machine learning models based on this data goes through the usual steps of cleaning, pre-processing, processing, storage, model building, and analysis. Efficient handling and accuracy depend on resources like software, hardware, technical workforce, and costs. Answering queries requires specific data probing in either static or dynamic mode with consistency, reliability, and availability. When data is large, inadequacy in handling queries due to the size of data and low capacity of machines in terms of speed, memory may prove problematic for the organization. This is where sharding steps in to address the above problems.
This guide explores the basics and various facets of data sharding, the need for sharding, and its pros, and cons.What is Data Sharding?
With the increasing use of IT technologies, data is accumulating at an overwhelmingly faster pace. Companies leverage this big data for data-driven decision-making. However, with the increased size of the data, system performance suffers due to queries becoming very slow if the dataset is entirely stored in a single database. This is why data sharding is required.
Image Source: Author
In simple terms, sharding is the process of dividing and storing a single logical dataset into databases that are distributed across multiple computers. This way, when a query is executed, a few computers in the network may be involved in processing the query, and the system performance is faster. With increased traffic, scaling the databases becomes non-negotiable to cope with the increased demand. Furthermore, several sharding solutions allow for the inclusion of additional computers. Sharding allows a database cluster to grow with the amount of data and traffic received.
Let’s look at some key terms used in the sharding of databases.
Scale-out and Scaling up: The process of creating or removing databases horizontally done to improve performance and increase capacity is called scale-out. Scaling up refers to the practice of adding physical resources to an existing database server, like memory, storage, and CPU, to improve performance.
Sharding: Sharding distributes similarly-formatted large data over several separate databases.
Chunk: A chunk is made up of sharded data subset and is bound by lower and higher ranges based on the shard key.
Shard: A shard is a horizontally distributed portion of data in a database. Data collections with the same partition keys are called logical shards, which are then distributed across separate database nodes.
Sharding Key: A sharding key is a column of the database to be sharded. This key is responsible for partitioning the data. It can be either a single indexed column or multiple columns denoted by a value that determines the data division between the shards. A primary key can be used as a sharding key. However, a sharding key cannot be a primary key. The choice of the sharding key depends on the application. For example, userID could be used as a sharding key in banking or social media applications.
Logical shard and Physical Shard: A chunk of the data with the same shard key is called a logical shard. When a single server holds one or more than one logical shard, it is called a physical shard.
Shard replicas: These are the copies of the shard and are allotted to different nodes.
Partition Key: It is a key that defines the pattern of data distribution in the database. Using this key, it is possible to direct the query to the concerned database for retrieving and manipulating the data. Data having the same partition key is stored in the same node.
Replication: It is a process of copying and storing data from a central database at more than one node.
Resharding: It is the process of redistributing the data across shards to adapt to the growing size of data.Are Sharding and Partitioning the same?
Both Sharding and Partitioning allow splitting and storing the databases into smaller datasets. However, they are not the same. Upon comparison, we can say that sharding distributes the data and is shared over several machines, but not with partitioning. Within a single unsharded database, partitioning is the process of grouping subsets of data. Hence, the phrases sharding and partitioning are used interchangeably when the terms “horizontal” and “vertical” are used before them. As a result, “horizontal sharding” and “horizontal partitioning” are interchangeable terms.
Entire columns are split and placed in new, different tables in a vertically partitioned table. The data in one vertical split is different from the data in the others, and each contains distinct rows and columns.
Horizontal sharding or horizontal partitioning divides a table’s rows into multiple tables or partitions. Every partition has the same schema and columns but distinct rows. Here, the data stored in each partition is distinct and independent of the data stored in other partitions.
The image below shows how a table can be partitioned both horizontally and vertically.The Process
Before sharding a database, it is essential to evaluate the requirements for selecting the type of sharding to be implemented.
At the start, we need to have a clear idea about the data and how the data will be distributed across shards. The answer is crucial as it will directly impact the performance of the sharded database and its maintenance strategy.
Next, the nature of queries that need to be routed through these shards should also be known. For read queries, replication is a better and more cost-effective option than sharding the database. On the other hand, workload involving writing queries or both read and write queries would require sharding of the database. And the final point to be considered is regarding shard maintenance. As the accumulated data increases, it needs to be distributed, and the number of shards keeps on growing over time. Hence, the distribution of data in shards requires a strategy that needs to be planned ahead to keep the sharding process efficient.Types of Sharding Architectures
Once you have decided to shard the existing database, the following step is to figure out how to achieve it. It is crucial that during query execution or distributing the incoming data to sharded tables/databases, it goes to the proper shard. Otherwise, there is a possibility of losing the data or experiencing noticeably slow searches. In this section, we will look at some commonly used sharding architectures, each of which has a distinct way of distributing data between shards. There are three main types of sharding architectures – Key or Hash-Based, Range Based, and Directory-Based sharding.
To understand these sharding strategies, say there is a company that handles databases for its client who sell their products in different countries. The handled database might look like this and can often extend to more than a million rows.
We will take a few rows from the above table to explain each sharding strategy.
So, to store and query these databases efficiently, we need to implement sharding on these databases for low latency, fault tolerance, and reliability.
Key Based Sharding
Key Based Sharding or Hash-Based Sharding, uses a value from the column data — like customer ID, customer IP address, a ZIP code, etc. to generate a hash value to shard the database. This selected table column is the shard key. Next, all row values in the shard key column are passed through the hash function.
This hash function is a mathematical function that converts any text input size (usually a combination of numbers and strings) and returns a unique output called a hash value. The hash value is based on the chosen algorithm (depending on the data and application) and the total number of available shards. This value indicates the data should be sent to which shard number.
It is important to remember that a shard key needs to be both unique and static, i.e., it should not change over a period of time. Otherwise, it would increase the amount of work required for update operations, thus slowing down performance.
The Key Based Sharding process looks like this:
Image Source: Author
Features of Key Based Sharding are-
It is easier to generate hash keys using algorithms. Hence, it is good at load balancing since data is equally distributed among the available numbers of shards.
As all shards share the same load, it helps to avoid database hotspots (when one shard contains excessive data as compared to the rest of the shards).
Additionally, in this type of sharding, there is no need to have any additional map or table to hold the information of where the data is stored.
However, it is not dynamic sharding, and it can be difficult to add or remove extra servers from a database depending on the application requirement. The adding or removing of servers requires recalculating the hash key. Since the hash key changes due to a change in the number of shards, all the data needs to be remapped and moved to the appropriate shard number. This is a tedious task and often challenging to implement in a production environment.
To address the above shortcoming of Key Based Sharding, a ‘Consistent Hashing’ strategy can be used.
In this strategy, hash values are generated both for the data input and the shard, based on the number generated for the data and the IP address of the shard machine, respectively. These two hash values are arranged around a ring or a circle utilizing the 360 degrees of the circle. The hash values that are close to each other are made into a pair, which can be done either clockwise or anti-clockwise.
The data is loaded according to this combination of hash values. Whenever the shards need to be reduced, the values from where the shard has been removed are attached to the nearest shard. A similar procedure is adopted when a shard is added. The possibility of mapping and reorganization problems in the Hash Key strategy is removed in this way as the mapping of the number of shards is reduced noticeably. For example, in Key Based Hashing, if you are required to shuffle the data to 3 out of 4 shards due to a change in the hash function, then in ‘consistent hashing,’ you will require shuffling on a lesser number of shards as compared to the previous one. Moreover, any overloading problem is taken care of by adding replicas of the shard.
Range Based Sharding
Range Based Sharding is the process of sharding data based on value ranges. Using our previous database example, we can make a few distinct shards using the Order value amount as a range (lower value and higher value) and divide customer information according to the price range of their order value, as seen below:
Image source: Author
Features of Range Based Sharding are-
Besides, there is no hashing function involved. Hence, it is possible to easily add more machines or reduce the number of machines. And there is no need to shuffle or reorganize the data.
On the other hand, this type of sharding does not ensure evenly distributed data. It can result in overloading a particular shard, commonly referred to as a database hotspot.
This type of sharding relies on a lookup table (with the specific shard key) that keeps track of the stored data and the concerned allotted shards. It tells us exactly where the particular queried data is stored or located on a specific shard. This lookup table is maintained separately and does not exist on the shards or the application. The following image demonstrates a simple example of a Directory-Based Sharding.
Features of Directory-Based Sharding are –
The directory-Based Sharding strategy is highly adaptable. While Range Based Sharding is limited to providing ranges of values, Key Based Sharding is heavily dependent on a fixed hash function, making it challenging to alter later if application requirements change. Directory-Based Sharding enables us to use any method or technique to allocate data entries to shards, and it is convenient to add or reduce shards dynamically.
The only downside of this type of sharding architecture is that there is a need to connect to the lookup table before each query or write every time, which may increase the latency.
Furthermore, if the lookup table gets corrupted, it can cause a complete failure at that instant, known as a single point of failure. This can be overcome by ensuring the security of the lookup table and creating a backup of the table for such events.
Other than the three main sharding strategies discussed above, there can be many more sharding strategies that are usually a combination of these three.
After this detailed sharding architecture overview, we will now understand the pros and cons of sharding databases.Benefits of Sharding
Horizontal Scaling: For any non-distributed database on a single server, there will always be a limit to storage and processing capacity. The ability of sharding to extend horizontally makes the arrangement flexible to accommodate larger amounts of data.
Speed: Speed is one more reason why sharded database design is preferred is to improve query response times. Upon submitting a query to a non-sharded database, it likely has to search every row in the table before finding the result set, you’re searching for. Queries can become prohibitively slow in an application with an unsharded single database. However, by sharding a single table into multiple tables, queries pass through fewer rows, and their resulting values are delivered considerably faster.
Reliability: Sharding can help to improve application reliability by reducing the effect of system failures. If a program or website is dependent on an unsharded database, a failure might render the entire application inoperable. An outage in a sharded database, on the other hand, is likely to affect only one shard. Even if this causes certain users to be unable to access some areas of the program or website, the overall impact would be minimal.Challenges in Sharding
While sharding a database might facilitate growth and enhance speed, it can also impose certain constraints. We will go through some of them and why they could be strong reasons to avoid sharding entirely.
Increased complexity: Companies usually face a challenge of complexity when designing shared database architecture. There is a risk that the sharding operation will result in lost data or damaged tables if done incorrectly. Even if done correctly, shard maintenance and organization are likely to significantly influence the outcome.
Shard Imbalancing: Depending on the sharding architecture, distribution on different shards can get imbalanced due to incoming traffic. This results in remapping or reorganizing the data amongst different shards. Obviously, it is time-consuming and expensive.
Unsharding or restoring the database: Once a database has been sharded, it can be complicated to restore it to its earlier version. Backups of the database produced before it was sharded will not include data written after partitioning. As a result, reconstructing the original unsharded architecture would need either integrating the new partitioned data with the old backups or changing the partitioned database back into a single database, both of which would undesirable.
Not supported by all databases: It is to be noted that not every database engine natively supports sharding. There are several databases currently available. Some popular ones are MySQL, PostgreSQL, Cassandra, MongoDB, HBase, Redis, and more. Databases namely MySQL or MongoDB has an auto-sharding feature. As a result, we need to customize the strategy to suit the application requirements when using different databases for sharding.
Now that we have discussed the pros and cons of sharding databases, let us explore situations when one should select sharding.When should one go for Sharding?
When the application data outgrows the storage capacity and can no longer be stored as a single database, sharding becomes essential.
When the volume of reading/writing exceeds the current database capacity, this results in a higher query response time or timeouts.
When a slow response is experienced while reading the read replicas, it indicates that the network bandwidth required by the application is higher than the available bandwidth.
Excluding the above situations, it is possible to optimize the database instead. These could include carrying out server upgrades, implementing caches, creating one or more read replicas, setting up remote databases, etc. Only when these options cannot solve the problem of increased data, sharding is always an option.Conclusion
We have covered the fundamentals of sharding in Data Engineering and by now have developed a good understanding of this topic
With sharding, businesses can add horizontal scalability to the existing databases. However, it comes with a set of challenges that need to be addressed. These include considerable complexity, possible failure points, a requirement for additional resources, and more. Thus, sharding is essential only in certain situations.
I hope you enjoyed reading this guide! In the next part of this guide, we will cover how sharding is implemented step-by-step using MongoDB.Author Bio
She loves traveling, reading fiction, solving Sudoku puzzles, and participating in coding competitions in her leisure time.
You can follow her on LinkedIn, GitHub, Kaggle, Medium, Twitter.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
We are enthralled to bring to a new episode on ‘The DataHour’. In this session, Shashank Mishra, Ex- Amazon Data Engineer, will take you on a learning ride, ‘To Master Data Engineering in 2023‘ as we all know, how data engineering is shaping and plays a pivotal role when it comes to managing the data. I feel it is an excellent opportunity for freshers keen to build a career in Data Engineering.
Also, for professionals already in the field and aspire to work with big brands like Amazon. Shashank will be happy to share his journey with you all. It feels like working with one of the biggest brands that everyone aspires to. Also, what is all it takes to reach such a stature? I think the most critical aspect is to have your fundamentals right. Along with this, upskill yourself with the latest trends in the industry, every now and then.BOOK YOUR SEAT NOW! 🛋️ About The DataHour
Data Engineering has gained so much popularity starting in 2023 and is one of the best career profiles for aspiring Data Professionals in recent times. BigData is a very vast field with multiple job profiles, and that’s why there is a lot of confusion about different job profiles, their role & responsibilities, tech stack and how to decide which one is relevant for you?
In this DataHour, Shashank will share his 5 years of Data Engineering experience and will provide the best of the best roadmap to become a sound Data Engineer in 2023.
Prerequisites: Enthusiasm for learning Data Engineering!
Who is this Webinar for?
Students & Freshers who want to build a career in Data Engineering
Working professionals who wish to transition to a Data Science career
Data science professionals who want to accelerate their career growth
Speaker on The DataHour
Ex-Amazon Data Engineer
Shashank is an experienced Data Engineer with a demonstrated history of working in service and product companies such as Amazon, Paytm and McKinsey. He has solved various data mysteries for different domains like Aviation, Pharmaceutical, FinTech, Telecom and Employee Services. He also has good experience in designing scalable & optimized data pipelines to handle PetaBytes of data, with Batch & Real Time-frequency.
Currently, Shashank is an active contributor to the Data Science and Data Engineering community through his incredible podcasts and Youtube Channel (E-Learning Bridge)Conclusion
I hope you’re excited to attend this DataHour session with us. Shashank is a prolific speaker who is bringing his enriching experience in Data Engineering which you wouldn’t want to miss! So, Stay tuned. If you wish to read some articles on Data Engineering. Head on to our blog!
Grab this fantastic opportunity by registering here for the DataHour Webinar. If you’re attending this session and have some preliminary questions about this topic, please send them to us at [email protected], or you could ask directly to the speaker during the session.
If you missed our previously conducted ‘The DataHour’ series, head to our YouTube Channel and check out the recordings.Connect
If you’re facing any difficulty registering or wish to conduct a session with us. Then, get in touch with us at [email protected]
Now, let’s consider it this way. Does being the most popular automatically makes you the very best? Not always. I am not trying to say that Harmony OS is better than Android or iOS. But I cannot also assert that either of these two is better than Harmony OS. It is all based on individual preferences. While some may prefer Android, others may prefer iOS.
There is also another group of people that prefer both Android and iOS. This is because Android has some features that are lacking in iOS and a vice versa. Speaking of features, we have a third force that is pushing this category to the very top. Huawei believes that more features will definitely attract more users and it is doing just that.
Harmony OS comes packed with a lot of features just like Android and iOS. Some of these features may be common amongst the top two. However, it also comes with some exclusive features that you may not find in either Android or iOS.Harmony OS Auto Layout Feature
Huawei’s Harmony OS has a feature called Auto Layout and it is quite surprising why neither Android nor iOS ever thought of this. Have you ever tried rearranging the icons on your smartphone? Well, I think you have, everyone has tried doing that at least ones. If you have tried rearranging icons on your smartphone, then you are much aware of how annoying it could be.
A lot of smartphone users have very messy looking home screens. They may not really like the looks of their home screens but there is this common laziness involved in rearranging them. This is where Harmony OS steps in. With the auto layout feature in Harmony OS, it only takes one single tap to rearrange the whole home screen.How Auto Layout Feature in Harmony OS Works
This feature automatically generates combination of layouts which includes icons, large folders and widgets. These combinations make it easier for users to interact on their home screens without any form of distractions. The user will have three different sorting options. They include:
Sort by color
Sort by category.Original Layout Sort By Color Layout
The sort by color option is quite interesting and it will definitely be the most popular amongst the three. It arranges the icons based on the most common color on your home screen. With this, icons with similar colors like red will all fit into one single layout. It then pushes all the other screen elements with similar colors to the next pages. So. blue colors will be in one layout, orange in one layout and so on.Sort By Category
This option uses AI to arrange icons, large folders and widgets in categories. For example, the system automatically arranges all productivity tools on one page, social media apps can be on another page, banking and related apps can also be on one page and so on.Conclusion
Huawei’s Harmony OS keeps growing in both popularity and features every year. There are other exclusive features that you cannot find on other mobile operating systems. Such features include swipe up widgets, service widgets, super device and many more. The auto layout feature will definitely make things a lot easier for users. No Harmony OS home screen will ever look disorganized with auto layout on board. I hope Android and iOS adapt such a feature in the near future.
Data visualization is probably the most important and typically the least talked about area of data science.
I say that because how you create data stories and visualization has a huge impact on how your customers look at your work. Ultimately, data science is not only about how complicated and sophisticated your models are. It is about solving problems using data based insights. And in order to implement these solutions, your stakeholders need to understand what you are proposing.
One of the challenges in creating effective visualizations is to create images which speak for themselves. This article will tell one of the ways to do so using animated GIF images (Graphics Interchangeable format). This would be particularly helpful when you want to show time / flow based stories. Using animation in images, you can plot comparable data over time for specific set of parameters. In other words, it is easy to understand and see the growth of certain parameter over time.
Let me show this with an exampleExample – GDP vs. Life expectancy over time
Let us say you want to show how GDP and life expectancy have changed for various continents / countries over time. What do you think is the best way to represent this relationship?
You can think of multiple options like:
Creating a 3D plot with GDP, life expectancy and time on 3 plots and draw lines for each continent / country. The problem is that human eye is really bad as interpreting 3D visualizations in 2D. Especially so, if there is too much data. So, this option would not work.
Creating 2 plots side by side – one showing GDP over time and other life expectancy over time. While this is a 2 dimensional plot, we have left a lot for user to interpret. The person need to pick a country and see its movement on each plot and then correlate them. Again, I would ask this from my stakeholders.
Now, let us look at this using an animated plot using .gif file:
The recent development of gganimate package had made this possible and easier. By the end of this article, you will be able to make your own .gif file and create your own customised frame to compare different parameters on global or local scale.Pre-requisites
Please install the following packages:
In addition to the above libraries in R, you will also need Image Magick Software in your system. You may download and install the same from Image MagickGet the Data
The data set contains data for global seismic activity from 1965 to 2023. Please visit the above link and scroll down to get the .csv file.Earthquake magnitude of 7 points on Richter Scale from 1965-2023
The dataset had been modified and only seismic value of 7 points on richter scale has been considered for the study.Data Manipulation
From the .csv file we have only selected few parameters for the sake of simplicity.
Type is the type of seismic activity
Depth is the distance of the epicenter from the seal level.
Magnitude is the reading on the richter scale
ID is the event ID of the seismic activity
We are all set to start coding in R. I have used RStudio environment. You are free to use any environment you prefer.R Codes ## Read the datatset and load the necessary packages library(plyr) library(dplyr) library(ggmap) library(ggplot2) library(gganimate) EQ=read.csv("eq.csv",stringsAsFactors = FALSE) names(EQ) ## Only Select the data with magnitude greater than or equal to 7. Speed up projection in .gif using animation package
As we can see that plot has too many years from 1965 to 2023. Thus, in order to speed up the visualization, we can use the animation package to fast forward using ani.option()library(animation) ani.options(interval=0.15) gganimate(p) Conclusion
This article was an introductory tutorial to the world of animated map. Readers can try this and apply the same in other projects. Some of the example are,
The same technology can be used to compare the heat map for the weather data across nation
Flood or other natural disaster in a particular location over a period of time.
Aritra Chatterjee is a professional in the field of Data Science and Operation Management having experience of more than 5 years. He aspires to develop skill in the field of Automation, Data Science and Machine Learning.
This post was received as part of our blogging competition – The Mightiest Pen. Check out other competitions here.
Eraser is an enormous all-in-one toolbox devoted to secure file deletion. This free utility can securely delete files on your command, or according to a schedule.
There are lots of ways to obliterate sensitive data from of your drive: blast furnaces, degaussers (magnet field generators), sledgehammers, and secure-deletion software among them. These tools vary in effectiveness—especially as applied variously to hard drives, solid-state drives, and USB flash drives—and in the subsequent usability of the drive.
For the sake of argument (and a more interesting article), let’s assume you’d like to preserve your drive’s functionality. This rules out violence and degaussing, which, though wonderfully effective and perhaps therapeutic, will render a drive useless. Excluding those options leaves you with a choice between software and software-combined-with-firmware methods.Free secure-erase utilities
You can easily erase an entire hard drive or SSD by using any of the free utilities listed below. All invoke the secure-erase (sometimes called quick-erase) functions integrated into nearly every ATA/SATA drive produced since 2001. By and large it’s a great feature, but using it on older drives has some potential pitfalls, such as buggy implementations, an out-of-date BIOS, or a drive controller that won’t pass along the commands. You might also need to fiddle with the ATA/IDE/AHCI settings in your BIOS, and in most cases the drive should be mounted internally.
Parted Magic’s DriveErase utility makes it a breeze to perform secure erases on your SSDs and HDDs.
I’ve never had a problem secure-erasing a hard drive, but about a year ago I did brick a Crucial M500 SSD. (A firmware problem was probably responsible for this disaster; Crucial accepted the drive for return but never told me why the hardware had gone belly-up.) An enhanced secure-erase operation overwrites a drive’s housekeeping data as well as its normal user-data areas, but at least one vendor (Kingston) told me that its normal secure-erase routine does both, too. In the bad old days, running a secure-erase on some SSDs sometimes left data behind.
Depending on the controller you use (notably SandForce), a secure-erase can be cryptographic or physical. If a drive is encrypted—and some are by nature—a secure-erase operation simply deletes the encryption keys, and then regenerates them. Without the original keys, the data is useless. A physical erase involves zapping the drive’s magnetic particles or NAND cells back to their default state.
To entirely avoid the danger of erasing the wrong drive in a multiple-drive system, you should power down, disconnect all of the drives except the one to be erased, and then boot from a CD or a flash drive with the utility that does the job. I learned that lesson the hard way.
Parted Magic is free to use, but it now costs $5 to download.
Linux-based boot disc Parted Magic (formerly donationware, now free to use but $5 to download) has many features, including a file manager and a partition manager. It’s handy for recovering data and operating systems, but it also has a link on its desktop to DiskEraser, a simple utility that will erase your drive or invoke the drive’s own secure-erase routine. Parted Magic is basic and lightweight, and it will work with any drive. In fact, several SSD vendors recommend it—though the recommendations date from when it was completely free.
Little, command-line-lovely chúng tôi isn’t for inexperienced users—it’s a bit too geeky and can require multiple steps. Another drawback of the app is that it can’t bypass the frozen security stat that most modern drives employ to avoid malware erasures. But otherwise it invokes the secure-erase function just fine. It also comes in .ISO form, so you can burn it to disc or create a bootable flash drive from it.
Note that the NSA sponsored HDDerase. Yes, the folks there like to secure as well as monitor data. Not to mention dip their hands into open-source security projects. Interpret that historical nugget as you will.
Hitachi’s Drive Fitness Test analyzes drive health and wipes unwanted data. Other vendors offer similar utilities.
Most drive vendors provide a utility that can run S.M.A.R.T. diagnostics to check drive health, update firmware, and invoke a drive’s secure-erase routine. Odds are you’ll have to sign an agreement accepting that the tool may brick your drive—but hey, that’s life in the big city. A short list of such utilities includes Data Lifeguard (from Western Digital), Drive Fitness Test (from Hitachi), OCZ Toolbox, Samsung Magician (SSD only), and SeaTools (from Seagate).For hard drives only: Block-overwrite software
Block-overwrite software is more versatile than the secure-erase command because it lets you wipe data from a hard drive while leaving the operating system, program files, and other keepers intact. Unfortunately, this type of software is ineffective on SSDs or USB flash drives, and in many cases it can’t wipe a hard drive’s HPA (Host Protected Area), which contains data about the low-level organization of the drive. That said, with high-powered algorithms and multiple passes, it will effectively render your data unreadable even when subjected to all but the most expensive forensic techniques.
O&O Software’s versatile SafeErase offers full and partial wipes, and it can find and delete common types of sensitive data.
O&O SafeErase 7 ($30, free demo) is a jack-of-all-trades that can remove individual files and folders or erase entire partitions and disks. Like the previously reviewed PrivaZer, SafeErase scans your hard drive for possibly sensitive files, presents them to you for inspection (or you can elect to accept its assessment across the board), deletes them, and then wipes them. SafeErase did a good job of finding sensitive stuff while ignoring what I wanted to save, and it includes options on general types of files to look for.
SafeErase can also wipe free space (erasing the tracks left by deleted files) and your entire computer (all drives, everything), though those options aren’t available in the demo version. But the $30 that O&O charges for those extra features may money well spent if you want to maintain a clean system. SafeErase is a nicely realized, versatile data-destruction program.
MediaTools Wipe 1.2 ($99, free demo) is all about erasing a lot of hard disks with minimal fuss. It’s designed for professionals who erase in bulk and will dedicate a (rather powerful) PC to the task. MediaTools Wipe 1.2 can handle up to 18 drives at once, all presented in a convenient console view. The program has its own wipe routines, but it can’t invoke a drive’s own secure-erase routines.
MediaTools Wipe lets you lock drives to prevent accidental erasures.
MediaTools Wipe 1.2 has so many handy features (user-definable erase patterns, smart handling of bad blocks, and so on) that I can’t mention them all here. Check out our review of the functionally equivalent version 1.1. You’ll likely dedicate a PC to it, so the $49, single-seat technician’s license will suffice for most situations. However, $500 single-site and $1000 multi-site licenses are available for the corporate crowd.
The handy and free Eraser 6 utility deletes files, folders, and free space on a schedule. It’s just the thing for users who want to maintain a minimal data presence on their PC. You must know what you need to erase, since Eraser 6 doesn’t have automatic selection of sensitive data, as O&O SafeErase and PrivaZer do. But Eraser 6 does have a large array of government-level algorithms to choose from, and it’s super-simple to use.
Active@ KillDisk presents a concise, information-laden view of the drives on your system. A DOS boot disc version is available as well.
Active@ KillDisk is very effective as far as it goes, but most users will be just as well off with the free Eraser 6—or better off by paying less for a program that automatically selects and deletes sensitive data and wipes free space. Then again, if you run Piriform’s CCleaner before KillDisk (or Eraser 6), you’ll have a very effective data-killing combo.Wiping SSDs and USB Flash drives
USB flash drives are convenient for everybody, including anyone trying to get data off one that isn’t securely erased.
That said, SSDs that support the TRIM command and run under a TRIM-supported environment (Windows 7 and 8, OS X 10.6.8 or better, Linux 2.6.28 or better, plus a modern BIOS and drive controller that pass on the command) should wipe deleted data continually. Note that I said “should.”
Ideally (for security purposes) an SSD’s garbage collection routines, invoked by the TRIM command, would quickly erase the NAND blocks formerly occupied by your file. The whole reason for TRIM is that NAND must be erased before being rewritten. If a drive runs out of clean, unwritten blocks and must erase previously used blocks immediately prior to writing to them, performance suffers drastically.
Unfortunately, from what I could glean from data recovery experts such as strategic technical alliance manager Chris Bross of DriveSavers and SMB partner manager Leon Feldman of ACE Data Recovery, some disk vendors put off block erasures for long periods of time or until they’re forced to resort to them. Sad but true: You can’t rely on housekeeping to remove data. Even sadder, there seem to be no utilities that will force the garbage collection. That seemingly simple solution has so far been ignored.
USB flash drives don’t support standard ATA secure-erase or TRIM—so unless you’re using a secure, encrypted type, you’ll need to contact the vendor for an erase utility.
You could overwrite the entire drive or just free space with files. This will work to a point, but—especially on SSDs—some blocks used in over-provisioning and marked as bad can’t be copied over. They may retain data you want to erase. ACE Data Recovery
Data recovery companies can sift through raw data, block by block.
In the end, the only sure way to remove all unwanted sensitive data from the free space on an SSD or USB flash drive while retaining the data you still want is to back it up (use imaging if an operating system is involved), secure-erase the drive, and then restore the desired data. Sigh.And when that’s not enough…
All the methods and programs I’ve described will work great for the average user. That said, forensic data recovery technology has come a long way. Normal affordable methods won’t counteract anything I’ve discussed. But if you have a formula for cold fusion, or a trade secret that will topple the global economy overnight…go for the degausser, the hammer, and then the blast furnace. You can’t be too sure.
Update the detailed information about Performing Data Cleaning And Feature Engineering With R on the Achiashop.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!