The review should be one page, double spaced. The review should be a ? page summary of the article, and a ? page application of the material to the real world. Ple
The review should be one page, double spaced. The review should be a ½ page summary of the article, and a ½ page application of the material to the real world.
Please, see attached the article to review and a document with two samples of two articles review so you can have a better idea how I want the review.
Thank You.
Journal of Intelligent & Fuzzy Systems 38 (2020) 6159–6173 6159 DOI:10.3233/JIFS-179698 IOS Press
The impact of big data market segmentation using data mining and clustering techniques
Fahed Yosepha,b,∗ , Nurul Hashimah Ahamed Hassain Malimb, Markku Heikkiläc, Adrian Brezulianud, Oana Gemane and Nur Aqilah Paskhal Rostamb a Faculty of Social Sciences, Business and Economics, Åbo Akademi University, Turku, Finland bDepartment of School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia cFaculty of Social Sciences, Business and Economics, Åbo Akademi University, Turku, Finland dFaculty of Electronics, Telecommunications and Information Technology, Gheorghe Asachi Technical University, Iaşi, Romania eDepartment of Health and Human Development, Stefan cel Mare University, Suceava, Romania
Abstract. Targeted marketing strategy is a prominent topic that has received substantial attention from both industries and academia. Market segmentation is a widely used approach in investigating the heterogeneity of customer buying behavior and profitability. It is important to note that conventional market segmentation models in the retail industry are predominantly descriptive methods, lack sufficient market insights, and often fail to identify sufficiently small segments. This study also takes advantage of the dynamics involved in the Hadoop distributed file system for its ability to process vast dataset. Three different market segmentation experiments using modified best fit regression, i.e., Expectation-Maximization (EM) and K- Means++ clustering algorithms were conducted and subsequently assessed using cluster quality assessment. The results of this research are twofold: i) The insight on customer purchase behavior revealed for each Customer Lifetime Value (CLTV) segment; ii) performance of the clustering algorithm for producing accurate market segments. The analysis indicated that the average lifetime of the customer was only two years, and the churn rate was 52%. Consequently, a marketing strategy was devised based on these results and implemented on the departmental store sales. It was revealed in the marketing record that the sales growth rate up increased from 5% to 9%.
Keywords: Market segmentation, data mining, customer lifetime value (CLTV), RFM model (recency frequency monetary)
1. Introduction is the key success to brand loyalty, repeat store visits, and ultimately, sales conversions. This relationship
The retail industry collects enormous volumes of has been affected by recent economic and social. The POS data. However, this RAW POS data has min- retail industry is prompted to be more strategic in imal use if it’s not properly processed to generate their planning and to develop a deep understanding retail insights, optimize marketing efforts and drive of its consumers as well as their competitors. Under- decisions. The retailer’s relationship with customers standing customers’ behavior as well as establishing a
loyal relationship with customers has become the cen- ∗Corresponding author. Fahed Yoseph, Faculty of Social Sci- tral concern and strategic goal for most retailers [1]
°ences, Business and Economics, Abo Akademi University, Turku, interested in tracking and managing their customer
Finland, and Deparment f School of Computer Sciences, Uni- lifetime value on a systematic basis [44]. Market seg-versiti Sains Malaysia, 11800, Penang, Malaysia. E-mail:
[email protected] mentation is the process to divide the market base
ISSN 1064-1246/20/$35.00 © 2020 – IOS Press and the authors. All rights reserved
6160 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques
of potential customers into similar or homogeneous groups or segments that possess mutual characteris- tics helps marketers to gather individuals with similar choices and interests [2]. This enables retailers to avoid selling unprofitable and irrelevant products with regards to their marketing purpose, which will result in better management of the available resources through the selection of suitable market segment and the primary focus of specific promising segments [3–15].
Furthermore, as far as the research scope is concerned, there has been number of studies that examine customer purchase behavior and lifetime value among different products based on a variety of market segmentation with demographic variables and characteristics. Instead of addressing individual consumers based on their purchasing behavior, most market segmentation studies merely considered the overview of consumers’ historical data to produce assumptions of what makes consumers similar to one another. It is significant to highlight that this method hides critical facts about individual consumers.
Among those customer lifetime value models, a highly regarded model cited by many experts is the Pareto/NBD Counting Your Customers proposed by Schmittlein, Morrison, and Colombo (1987). The model investigates customer purchase behavior in settings where customer purchase dropout is unob- served. However, the model is powerful for analyzing customer purchase behavior, but it has been proven to be empirically complex to implement due to the computational challenges, and only a handful of researches claim to have implemented it [44].
Based on previous studies of market segmenta- tion on the retail domain, Recency, Frequency, and Monetary (RFM) has been extensively employed as this model can divide customers into groups which, therefore, enables retailers to decide on ways to fully utilize their limited resources in providing effective customer service through the categorization of cus- tomers. Nonetheless, RFM also has its own limitation [4] where it only focuses on customers’ best scores in addition to providing less meaningful scoring on recency, frequency and monetary for most consumers (Wei, Lin, and Wu, 2010). Moreover, RFM analy- sis is not able to prospect for new customers, as it mainly concerns the organization’s current customers [6] and that it is not considered as a precise quan- titative analysis model as the importance of each RFM measure is different among other industries [16–20]. The current research foresees an enhanced user-friendly market segmentation modeling method,
which is more advanced and effective than conven- tional RFM method. The integration of Customer Lifetime Value and newly proposed RFM variants (PQ) (T) into a closed-loop model represents dif- ferent variation in customer purchase behavior. The enhanced model has the capability to simultane- ously analyze millions of raw POS data, identify groups of customers by criteria the retailer may never have considered. This goldmine knowledge is expected to help marketers avoid the assumptions when doing customer deep-dive and trend analysis, which subsequently tapped marketers to device tar- geted marketing campaign resulting in sales growth and higher ROI. The RFMPQ and RFMT dataset con- centrate on the idea of identifying the purchasing power history of an individual customer or segment. P variable represents the average purchasing power per customer per all transactions, Q variable repre- sents the average purchasing power per product, and T represents the change of consumer buying behav- ior or trend using change rate. The enhanced RFM model also incorporates CLTV for predicting future cash flow attributed to the customer’s shopping period with the retailer [8], followed by applying a mod- ified best-fit regression technique, and K-means++ and Expectation-Maximization (EM) clustering algo- rithms to analyze the customer buying behavior as well as to assess the clustering technique’s perfor- mance using cluster quality assessment. The analysis can also identify marketers’ area of focus and ensure the highest quality of customer service.
2. Installing and using the microsoft word template
Market segmentation is the process of categorizing large homogenous market into similar or homoge- neous smaller groups who share characteristics such as income, shopping habits, lifestyle, age, and per- sonality traits [9]. These segments are relevant to marketing and sales and can be used to optimize products, customer service and advertising to differ- ent consumers [6]; It is seen that many companies across the retail industry have identified customer service as a market key differentiator and tend to segment their customers for positive customer expe- rience and service delivery [10]. There are three types of market segmentation bases, namely demo- graphic, geographic, and behavioral. Demographic segmentation is the most commonly used variable when segmenting a market. It has the ability to
F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6161
provide the retailer with a clear vision with the future advertising plans, a precise customer shop- ping profile and focuses on the measurable factors of consumers and their households. Furthermore, this segment is primarily descriptive in terms of gender, race, age, income, lifestyle, and family status [11]. In contrast, behavioral segmentation divides cus- tomers based on their attitude toward products. Many marketers consider behavioral variables such as occa- sions, benefits, usage rate, customer status, readiness, loyalty status, and attitude towards a product as the ideal starting points for creating market segmentation [11]. According to behavioral segmentation, con- sumer behavior is the segmentation process based on their evaluation and buying activities, as well as the use and disposal of goods to recognize consumer needs. These criteria can provide a thorough under- standing of consumer behavior as they reason from social psychology, anthropology, economics, sociol- ogy, and psychology that influence consumers on their purchasing decision of products [21–25]. To get a sense of the overall customer lifetime value for the customer-base, [45] proposed a framework to integrate customers’ distribution with the iso-value curves, by grouping customers on the basis of RFM characteristics and to understand the factors that trig- ger consumer’s defection. [48] proposed analytical model for consumer engagement, related to the subse- quent stages of the consumer life-cycle like customer development, customer acquisition, and customer retention. The authors concluded that the availabil- ity of data is vital to the development of advanced analysis in each consumer’s stage. However, sev- eral organizational issues of analytics for consumer engagement remain, which constitute barriers to implementing analytics for customer engagement. In order to solve the problems of consumer behav- ior that evolved with time, this research examines the behavioral, demographic segmentation model and identifies customer behavior using model Customer Life Time Value (CLTV) and Recency, Frequency and Monetary (RFM) model.
2.1. Customer life time value (CLTV) model
Customer Lifetime Value (CLTV) is an important metric to measure the total worth or profit to a busi- ness obtained from a customer over the whole period of their relationship with the retailer [8]. The liter- ature defines the customers churn as the extinction of the contract between the firm and the customer, where customer retention refers to the collection of
activities organizations take to reduce the number of customer’s defections.
Churn rate and retention rate critical matrix for any company and considered primary components of the future CLTV. Where CLTV is an estimation of the average profit, a customer is expected to gen- erate before he or she churn [48]. The concept of retention and churn is often correlated with industry life-cycle. When the industry is in the growth phase of its life-cycle, sales increase exponentially. How- ever, customer churn is the most challenging task for the retailer industry. In this perspective, more insight is needed to know the reason for customer churn in a dynamic industry.
The three main components of CLTV are customer acquisition, customer expansion, and customer reten- tion [46]. Nevertheless, it is crucial to consider COGS (Cost of Goods Sold) and acquisition cost to square off the real CLTV. The basic model to calculate CLTV is presented in Equation (1).
�n ptCLTV = (r)t (1) t=1 (1 + d)t
The above CLTV formula is more of a proxy for an average customer who stays for X period of time and pays Y total amount of money. The t represents a specific period of time, while (t = 1) represents the first year, and (t = 2) denotes the second year. The n represents the total time period the customer will stay with the retailer before churn occurs. The r represents the month over retention rate. Pt is the profit that the customer/customers will contribute or generate to the Retailer in the Period t, and finally, d refers to the churn rate. Additionally, the customer’s loyalty can be calculated using the Retention Rate formula, as illustrated in Equation (2). Based on the Retention Rate formula, CE denotes to the number of customers at the end of each time period, where, CN is the total number of new customers acquired in the chosen time period, and CS denotes to the number of customers at the start of the time period.
� � (CE – CN)
Retention rate = × 100 (2) CS
Management of consumer retention requires the tools that allow decision-makers to assess the risk of each consumer to defect and understanding the factors that trigger consumers’ defection [47]. Cus- tomer retention strategy also known as a loyalty rate is the collection of activities a retailer uses to maintain on a long-term relationship basis by engag- ing existing customers to increase profitability by
6162 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques
Table 1 Criteria of customers in each segment
Segment Criteria of Customers
Best The average purchase amount > the total average purchase amount
The average purchase frequency of customer > the total average frequency
Spender The average purchase amount > the total average purchase amount
The Average frequency of customer < the total average frequency
Frequent The average purchase amount < the total average purchase amount
The average frequency of customer > the total average frequency
Uncertain The average purchase amount < the total average purchase amount
The average frequency of customer < the total average frequency
increasing the number of repeat customers. CLTV represents a greater improvement compared to the tra- ditional RFM analysis as the frequency of customer’s purchases, and the amount of customers’ average pur- chase is used for segmenting customers-base. CLTV matrix classifies customer purchase behavior using different segments, namely Best, Spender, Frequent, and Uncertain classified by Marcus (1998). Table 1 illustrates the criteria of each segment that were clas- sified by Marcus (1998).
2.2. Recency, frequency, and monetary (RFM) model
RFM is a standard statistical marketing model for customer behavior segmentation assess consumer lifetime value. The model is very popular in the retail industry as it groups customers based on their shopping power history – how recently, how often, and how much did the customer buy. RFM model helps retailers group customers into various segments or categories to identify customers who are more likely to respond to marketing promotions and future customer personalization services [17]. The R sym- bolizes recency refers to the interval between the time since last purchase the customer made. The F sym- bolizes the frequency of consumer behavior in a time period, and the M symbolizes monetary referring to the amount of money consumption in a period [18]. Quintiles scoring is the most commonly used scor- ing in the RFM method in arranging customers in ascending or descending order or (Best to Worst). Customers are grouped into five equal groups where the best group receives the highest score of (5), and
the worst receives the lowest score of (1) [1]. The RFM score is the weighted average of its individual components and is calculated as portrayed in equation 3 and 4 to derive a continuous RFM Score. Finally, these scores can be re-scaled to the 0 –1 range [17].
RFM score = (recency score × recency weight) + (frequency score × frequency weigh + (monetary score x monetary weight)) (3)
Rescaled RFM score = (RFM score − minimum RFM Score)/(Maximum RFM score − minimum RFM score) (4)
2.3. Market segmentation using data mining, RFM, CLTV models and clustering techniques
Market segmentation helps to differentiate and cus- tomize marketing strategies into segments. Market Segmentation is a significant key in data mining, where data mining is used to interrogate segmenta- tion data to create data-driven behavioral information segments that are applied to detect meaningful pat- terns and rules underlying consumer behavior [19]. Furthermore, [26] and [27] were among the stud- ies that performed market segmentation using data mining, RFM, CLTV, and clustering technique to form a decision-making system. [28] proposed clus- tering and profiling of customers using customer relationship management (CRM) and RFM for rec- ommendations were proposed. On the other hand, data mining was conducted on historical data of cus- tomer’s sales using the RFM model with K-Means algorithm where results have outlined recommen- dations to perform customer relationship strategy. Also, f (2016) proposed a three-dimensional mar- ket segmentation model based on customer lifetime value, customer activity, and customer satisfaction. For more accuracy, the author grouped customers into several different groups. RFM, Kano, and BG/NBD models obtained the corresponding variables.
Furthermore, the market segmentation model helps enterprises to maximize their profits. In [29–35], cus- tomers were classified into various clusters using RFM technique and association rules were mined to identify high-profit customers. RFM statistical Tech- niques and Clustering methods for Customer Value
F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6163
Analysis were combined in [26] for a company’s online selling. As far as the methodology is con- cerned, there is no standard convention for measuring customer purchase behavior as each literature differs in examining customer purchase behavior. Neverthe- less, the K-Means algorithm is noted as an extensively used clustering algorithm in previous research due to its simplicity and speed in working with a large amount of data [36]. Despite its strength, it has been recorded that the K-Means method uses a random dis- tribution for the seeding positions and does not main- tain the same result each time that it is run [37]. New, improved K-means algorithm called K-means++, which uses sophisticated seeding procedure for the initial choice of the center positions and often twice as fast as the standard k-means. Contrastingly, Neha, Kirti, and Kanika (2012) noted that K-means and (EM) Expectation-Maximization algorithms are the two most commonly used based algorithms for the identification of growth patterns. Even though EM is similar to the K-Means algorithm, this algorithm is based on two different steps iterated until there are no more changes in the current hypothesis [29]. Expectation (E) refers to computing the probability that each datum is a member of each class. Maximiza- tion (M) refers to altering the parameters of each class to maximize those probabilities. Eventually, they con- verge, although not necessarily correct. Furthermore, EM algorithm is embedded with a significant feature where it can be applied to problems with observed data that provide “partial” information only [30]. Based on several comparative studies of EM and K- Means methods [31–34], it was observed that EM outperformed K-Means and results were improved when they were hybridized. The current study inte- grates two dynamics models, namely CLTV and RFM models, with the addition of new RFM variants, i.e., P, Q and T to cater the weakness and inaccuracy of consumer modeling that are caused by the limita- tions RFM. In addition, this study applies K-means++ and Expectation Maximization (EM) clustering algo- rithms to offer the retail industry with effective analy- sis of customer buying behavior through the combina- tion of customer profitability and product profitability in creating a strategic marketing campaign as explained previously in the introduction [38–40].
2.4. Mining big data
Data mining is the process of extracting infor- mation from large data sets and transform it into an understandable form for further use. Data min-
ing can be used in such a case where the database is large, and the classification of such data is dif- ficult [35]. The term Big data is often used for very large databases whose size in terabytes to many PETA bytes and it is beyond the ability of commonly used Relational Database Management (RDBM) to pro- cess the data within a tolerable elapsed time. Patel, Birla, & Nair (2012) have done a lot of experiment on the big data problem. The result was the finding Hadoop Distributed File System (HDFS) for storage and map-reduce method for parallel processing on a large volume of data. However, the research in Big Data analysis using data mining especially with clus- tering methods is still considered to be young, and therefore attracts many researchers to conduct fur- ther research in this potential area [37, 38] proposed a fast-parallel k-means clustering algorithm based on Map Reduce, which has been widely embraced by both academia and industry. They used to speed up, scale-up, and size up to evaluate the performances of their proposed algorithm. Their finding showed that the proposed model could process very large dataset on commodity (Low-cost) hardware effectively.
Hadoop is becoming a commodity for every data- driven organization, where data is larger and comes in many formats, mining and extracting intelligence from data has always been a challenge [39]. The new dynamic in the database has brought new chal- lenges to the current analytical models and traditional databases and emphasize the need for a paradigm shift in data extraction and data analysis. Such challenges are the performance of the data retrieval and the vari- eties of data sources for which the format of the relational databases may no longer be the best option. [39] stated that Traditional database systems fall short in handling scalability to boost the performance effi- ciency and dealing with Big Data effectively and thus the adoption of based systems such as Hadoop is increasing. Hadoop is an open-source framework for data-intensive distributed system processing of large-scale data, based on Map Reduce programming model and a distributed file system called Hadoop Distributed File system (HDFS).
Map Reduce programming model is a methodol- ogy that deals with implementation and generating large datasets, making Hadoop the preferred as a solu- tion to the problems in the traditional Data Mining [41].
The main components of Hadoop are Hadoop distributed file system (HDFS) a high bandwidth clustered storage allows writing an application that rapidly processes massive data in parallel, which is
6164 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques
vital for large files. Map Reduce, is the heart of Hadoop. HDFS is high bandwidth clustered storage, while Map-Reduce processing enormous pieces of data and divide the input dataset into independent smaller pieces and be distributed amongst multiple machines referred to as nodes to parallel process them [42].
3. Research method
The methodology of this work, which involved Five phases, is outlined in Fig. 1. The first phase focused on the implementation of our POS database on a single node Hadoop distributed file system.
The experiments were performed on the following system: Single node using Java: Hardware: Intel Core i5, 8GB RAM, CPU 2.4 GHz Software: Java with JDK 1.8 Hadoop implementation: Software: Ubuntu 16.10, Java with JDK 1.8, Hadoop 2.7.0. The sec- ond phase focused on (ETL) Extract, Transform, and Load involving data preprocessing steps, i.e., data cleansing, features selection, and data transforma- tion. Since the dataset that was used in the current research was different than those in existing litera- ture, a controlled experiment was performed where the work of [20] was replicated as the baseline of this research.
The hybrid approach (RFMPQ & CLTV) was included in the third and fourth phase, where differ- ent methods were employed stepwise. As the data was transformed into three different variants, i.e., RFM, PQ, and T, the first processing step differed in one of them.
The classification was used to categorize cus- tomer purchase behavior into CLTV matrix based on the RFM and PQ dataset, while modified best-fit regression was performed on the T dataset to find the customer purchase trend (curve). Even though [20] only employed K-means technique, the present experiment was extended to include the utilization of K-means++ and (EM).
Subsequently, the outputs were fed into the clus- tering algorithms, i.e., K-means++ and EM at the fourth phase for further demographical segmentation. The accuracy of these clustering algorithms was mea- sured during the final phase using the cluster quality assessment that was introduced by Draghici & Kuklin (2003). Additionally, the retention rate was calcu- lated, and human judgment was also included as a measure of the effectiveness of this method for a marketing campaign.
Fig. 1. Methodologies for the hybrid of classification and cluster- ing of market segmentation.
The Methodologies for this research is illustrated in Fig. 1.
3.1. Proposed data transformation using RFM model, RFMPQ, RFMT
In this study we are using Apache Hadoop on single-node Hadoop cluster using Ubuntu Linux 12.04 64 Bits Server Edition was preferred as the operating system and KVM (virtual memory) was selected as virtualization environment. Hadoop (HDFS) node was accessed via Secure Shell (SSH). In this study, no parameter or optimization adjust- ment was made on the operating system to cause performance improvement. This type of Hadoop implementation serves the purpose and sufficient to have a running Hadoop environment in order to con- duct our experiment. The market segmentation model uses retail POS data acquired from a medium-sized retailer from the State of Kuwait. The POS data con- tains Three years (2012 – 2015) of customers initial and repeated purchases who made their purchases at different geographical branches. Each transac- tion represented a product purchased, with each line consists of a cashier number, store-code, item-code, brand- code, product (quantity) sold, product price, date and time of the transaction, sub-total, grand total as well as the customer’s demographic information.
F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6165
Since the data was in separate compilations based on the stores’ geographic locations, a common data for- mat with consistent definitions for more descriptive keys and fields was developed to merge the informa- tion. In this phase, the string variables are converted to numeric variables, and subsequently, missing values were checked and replaced default or mean values manually. In this research, the PQ variants will be used to describe customer’s purchase power in differ- ent demographic and behavioral eras and customer’s attractiveness to a specific product and service. For the T variant customers from the best segment are used to identify customer purchase curve, and we cal- ibrate the market segmentation model using repeated transactions for 3220 customers over two years’ period.
The age attribute was grouped into four (i.e., ages from 1–17, ages from 18–24, ages from 25–34, ages from 35–44, ages from 45–54 and ages above 55). The age group analysis was based on the premise that a typical customer’s needs would change as they age. The customer’s age was classified into six categories, where each category was identified using a unique number. Category 1 = (1–17), Cat- egory 2 = (18–24), Category 3 = (25–34), Category 4 = (35–44), Category 5 = (45–54), and senior Cate- gory 6 = (55 +). The Gender attribute was encoded as 1 for Male customers, 2 for Female customers, and 3 for Companies. Furthermore, the demographic con- cept hierarchy method such as city and country was replaced by higher-level concept nationality. Citizens of Middle Eastern nationalities, Asian nationalities, USA, and Canada, were assigned unique numeric (binary) value. Other nationalities were grouped based on continents, namely Europeans and Africans. One exception was made for British nationalities due to the high volume of purchases. To ensure the maximum accuracy of RFM scores, the values of five-dimension attributes from the POS Data were necessary.
The attributes are described in Table 2 as follows: The next step involved the calculation of RFM
scores as well as the newly proposed variation PQ and T. The implementation of CLVT, retention rate, and RFMPQ and T are developed using advanced PL/SQL programming language. It must be noted that the RFMPQ score refers to the weighted aver- age of its individual components in which the scoring analysis typically involves grouping customers into equal buckets (quantiles) sizes. As far as this study is concerned, the grouping procedure was applied inde- pendently to the five RFMPQ component measures.
Table 2 Attributes of RFMPQ
CUSTOMER ID Customer unique identifier used to capture customer’s related information.
TRX DATE: Transaction date used to capture customer’s Recency (R).
TXH COUNT Number of Transaction used to capture the Frequency (F) number of each transaction made by a customer.
TRX TOTAL SALE The total amount of each transaction used to capture the Monetary (M) value made by the customer.
TRXUNIT PRICE The average purchase power (Monetary) used to capture Average Monetary (P) per customer.
TRX QTY The average purchase power (quantity – Q) used to capture the Average Items purchased per customer.
Customers were grouped according to the respec- tive measure into classes of equal sizes.
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.