The maturation of database management systems (DBMS) technology has coincided with significant developments in distributed computing and parallel processing technologies. The end result is th
The maturation of database management systems (DBMS) technology has coincided with significant developments in distributed computing and parallel processing technologies. The end result is the emergence of distributed database management systems (DDBMS) and parallel database management systems. These systems have started to become the dominant data-management tools for highly data-intensive applications. Also read the attached article and answer the questions:
Discuss at least four reasons for distributed databases.
- Discuss, also, some advantages of distributed databases – what additional functions does DDBMS have over a centralized DBMS?
- Discuss the architecture of a DDBMS – what conditions must exist in order for the architecture to properly be call distributed?
- Discuss/Explain the main software modules of a DDBMS. What are the main functions of each of these modules in the context of the client-server architecture?
- Discuss fragmentation, replication, and allocation that are defined as DDBMS issues.
300 words
Journal of Software Engineering and Applications, 2014, 7, 891-905 Published Online October 2014 in SciRes. http://www.scirp.org/journal/jsea http://dx.doi.org/10.4236/jsea.2014.711080
How to cite this paper: Al-Sayyed, R.M.H., Al Zaghoul, F.A., Suleiman, D., Itriq, M. and Hababeh, I. (2014) A New Approach for Database Fragmentation and Allocation to Improve the Distributed Database Management System Performance. Jour- nal of Software Engineering and Applications, 7, 891-905. http://dx.doi.org/10.4236/jsea.2014.711080
A New Approach for Database Fragmentation and Allocation to Improve the Distributed Database Management System Performance Rizik M. H. Al-Sayyed1, Fawaz A. Al Zaghoul1, Dima Suleiman1, Mariam Itriq1, Ismail Hababeh2 1King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan 2Faculty of Computer Engineering and Information Technology, German Jordanian University, Amman, Jordan Email: [email protected], [email protected]@ju.edu.jo, [email protected], [email protected], [email protected] Received 10 August 2014; revised 5 September 2014; accepted 1 October 2014
Copyright © 2014 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/
Abstract The efficiency and performance of Distributed Database Management Systems (DDBMS) is mainly measured by its proper design and by network communication cost between sites. Fragmentation and distribution of data are the major design issues of the DDBMS. In this paper, we propose new approach that integrates both fragmentation and data allocation in one strategy based on high performance clustering technique and transaction processing cost functions. This new approach achieves efficiently and effectively the objectives of data fragmentation, data allocation and net- work sites clustering. The approach splits the data relations into pair-wise disjoint fragments and determine whether each fragment has to be allocated or not in the network sites, where allocation benefit outweighs the cost depending on high performance clustering technique. To show the performance of the proposed approach, we performed experimental studies on real database ap- plication at different networks connectivity. The obtained results proved to achieve minimum to- tal data transaction costs between different sites, reduced the amount of redundant data to be ac- cessed between these sites and improved the overall DDBMS performance.
Keywords Distributed Database Management System, Fragmentation, Allocation, Clustering, Network Sites
R. M. H. Al-Sayyed et al.
892
1. Introduction Many researches have been done in the last few years in order to improve the performance of distributed data- base management systems (DDBMS). DDBMS is a collection of logically interrelated data that are physically allocated at different locations over a computer network [1]. Most of recent researches are concerned about keeping the performance of DDBMS high so that they focused on how to design the database such that the over- all cost is kept minimal. Keeping the cost minimal is not an easy task as there are huge amount of transactions processing that increase the complexity of distributed databases. Several techniques have been proposed in order to improve database performance which can be achieved by improving at least one of the following database management issues: database fragmentation, data allocation and replication, and the network sites clustering.
The fragmentation process divides the database into portions each of which is called a fragment. Fragmenta- tion can be horizontal, vertical or mixed. The main advantage of fragmentation is to improve the performance of distributed database design by increasing the efficiency since data is stored only where it is needed. Fragments can be allocated at different network sites in a process called data allocation.
The fragments allocation is an NP-complete problem so that the complexity is high. In order to reduce the complexity, some heuristic algorithms have been proposed to solve the problem. In the allocation process, each fragment is assigned to a network node and sometimes to more than one node to achieve the data availability, system reliability and performance.
The clustering technique is used for grouping distributed database network sites into logical clusters. In order to reduce the communication time for data allocation, there are many algorithms that are use to find the optimal solution for grouping distributed database network sites into a disjoint clusters and making a better data distribu- tion among them. The clustering technique aims at eliminating the extra communication costs between the net- work sites and then enhancing the DDBMS performance.
Many existing algorithms of data fragmentation and allocation in DDBMS assume some restrictions on the number of network sites so that the results of such algorithms are impractical, and reflected on the efficiency and validity of their outcomes. Moreover, some constraints on network connectivity and transactions processing time will limit the applicability of the proposed solutions to a small number of DDBMS cases [1].
One of the drawbacks of fragmentation and allocation solutions is the high computational complexity of their associated algorithms. In fact, when distributing a database over a network with a big number of sites and then finding an efficient, reliable and optimal solution for fragmentation and allocation are considered difficult tasks.
This paper proposed a new technique that splits the database relations into disjoint fragments. In addition, it introduces a high speed clustering technique that groups the distributed network sites into a set of clusters ac- cording to their communication cost. Also, it proposes a new intelligent technique for data allocation and redi- stribution based on transactions processing cost functions. Moreover, it implements a user-friendly simulation tool that performs fragmentation, clustering, allocation and replication of a database, in addition to assisting da- tabase administrators in measuring DDBMS performance.
The rest of this paper is organized as follows. Section 2 summarizes related work. A method for partitioning the database is presented in Section 3. A description of network site clustering is covered in Section 4. In Sec- tion 5, fragments allocation to the clusters and then to their sites is discussed. Performance evaluation is pre- sented in Section 6. Finally, the conclusions and the future work are presented in Section 7.
2. Related Work Many studies have been published on attempts of improving the performance of DDBMS. These researches have mostly investigated fragmentation, allocation and sometimes clustering problems. In this section, we pre- sent the main contributions related to these problems, discuss them and highlight novelties of our proposed solu- tions with respect to fragmentation, allocation and clustering.
There are many types of fragmentation [2] vertical, horizontal and mixed. In vertical fragmentation each schema is divided into small fragments and all fragments must contain a common candidate key. A new algo- rithm; as proposed by [3]; has employed vertical fragmentation for object relational database system; this frag- mentation method depends on user input at different sites.
In [4], the authors provide a solution for initial fragmentation problem of relational database for DDBMS. It partitioned the relational database properly at the initial stage when no data statistics and query execution fre- quencies are available.
R. M. H. Al-Sayyed et al.
893
In order to minimize communication time required for data allocation and query processing, many researches have been done to group large number of network sites into a small number of logical clusters which improve the performance of a DDBMS by increasing transactions response time [5].
The authors of [6] present a new formulation for the problem of fragmentation and allocating those fragments with minimum cost for both structured and unstructured data, by grouping sites which are nearer to each other into one cluster, hence they have low cost. Also, a dynamic clustering method is adopted for both structured and unstructured database to reduce the movement of data between sites.
In a DDBMS, the cost of communication is high so that many researches tried to minimize the cost and load sharing by making load balancing which can be achieved by making analysis of sharing resources, allocation of fragments and transaction in DDBMS [7].
The complexity of a distributed database algorithm depends on the allocation method used. Some enhance- ments have been done in reallocation algorithms. The authors of [8] proposed an algorithm that reallocates fragments based on calculating the cost for each fragment individually. The reallocation depends on finding the maximum update cost value for each fragment. This technique takes into account the network topology and set of queries frequency values employed over the network.
Data allocation of a DDBMS can be done by means of mobile agents and many algorithms implemented to make optimization and solve allocation problems [9].
The authors of [10] present a biogeography-based optimization technique for no-replicated static allocation of data fragments during database design that minimize total data transmission cost during the execution of a set of queries.
3. Database Partitioning Various methods already exist describe data fragmentation in distributed DDBMS. Naturally, there are benefits and drawbacks to all schemes. Some methods need to incorporate performance evaluation ways, may not mi- nimize the transactions response time and cannot guarantee the ability to process a given portion of a given transaction in all sites [2] [3] and [4].
In our proposed approach, the database will be partitioned into pair-wise disjoint fragments by using a hori- zontal partitioning technique, in which the records of a relation split into disjoint fragments; this strategy guar- antees the ability to processing all portions of a given transaction and distributes it precisely over the DDBMS sites.
Generating data fragments accomplished by performing the following processes respectively: defining trans- actions, creating segments and extracting disjoint fragments. Figure 1 below describes the architecture of the fragmentation method that supports the use of knowledge extraction and helps to achieve the effective use of small data packets.
As shown in Figure 1, the data request is initiated from the DDBMS sites (Site1, Site2, and Site3) and defines transactions as queries (Query1, Query2, and Query3) and then these transactions (queries) are processed into disjoint fragments (Fragment1, Fragment2, and Fragment3). The database transaction could be associated with
Figure 1. Generating disjoint fragments.
R. M. H. Al-Sayyed et al.
894
more than one relation, in this case, the transaction should be divided into a number of sub-transactions equal to the number of relations used that transaction. The process of generating fragments is described as follows:
1) Database fragmentation starts with any two fragments having intersection records between them. If there is an intersection, then three disjoint fragments will be generated as follows: • The common records in the two intersected fragments, • The records in the first fragment but not in the second segment, and • The records in the second fragment but not in the first segment.
2) The intersecting fragments are then removed from the fragments list. This process continues until removing all the intersecting fragments.
3) The new derived fragments and the non-overlapped ones that do not intersect with any other fragments from the new list of totally disjoint fragments.
We developed the fragmentation algorithm described in Listing 1 to generate database disjoint fragments. In the partitioning method, all transactions are processed, redundant transactions are eliminated, and the ap-
plication speed and efficiency are improved by getting the minimum number of fragments to be accessed.
4. Network Sites Clustering The benefit of generating database disjoint fragments can’t be completed unless it enhances the performance of the distributed database system. As the number of database sites becomes too large, a problem of supporting high system performance with consistency and availability constraints becomes crucial. Different techniques could be developed for this purpose; one of them consists of clustering distributed networking sites. Clustering database sites is a technique in which the sites that have similar physical property (e.g., having comparable communication costs) are logically grouped together in order to increase the performance of the distributed da- tabase system. However, grouping sites into clusters is still an open problem and it is proven that the optimal solution to this problem is NP-Complete since it is transformed to a cheapest path problem. Therefore, near-op- timal solution for grouping database sites into clusters helps to eliminate the extra communication costs between the sites during the process of data allocation and improves the system performance. Performing sites grouping after database fragmentation, will speed up the process of data allocation by distributing the fragments over clusters of sites rather than site by site. Thus, the communication costs are minimized and the distributed data- base system performance is improved.
A high speed clustering technique based on the least average communication cost between sites will be intro- duced. This is suitable for distributed databases where the communication costs between two sites are equal or near-equal, and similar computers on the network are used. The clustering parameters that will be used to con- trol the input/output computations for generating clusters and determining the set of site(s) in each cluster are described and computed as follows: • The Clustering Decision Value (cdv) that determines whether or not a site can be grouped in a specific clus-
ter.
( ) ( ) ( )
1: IF , ,
0 : IF ,
i j i j
i j
CC S S CCR i j cdv S S
CC S S CCR i j
<= ∧ ≠= > ∨ =
(1)
Accordingly, if cdv(Si, Sj) is equal to 1, then the sites Si and Sj are assigned to one cluster, otherwise they are assigned to different clusters. Each site should be included in only one cluster, and the final distribution of the network sites will be represented by the cluster that satisfies the least average communication cost between the sites. • The Communication Cost between the sites CC(Si, Sj). This is the cost of loading and processing database
fragment(s) in ms/byte between any two sites in the distributed database system. CC(Si, Sj) = cost of creation the data packet + cost of transmitting the data packet between Si, Sj
• The Communication Cost Range CCR (ms/byte) that the site should match to be grouped in a specific cluster. This value is determined by the network administrators and depends on how much time is allowed for the sites to transmit or receive their data to be considered in the same cluster.
Based on the parameters, assumptions, and computations described above, we developed the following clus- tering algorithm illustrated in Listing 2 to classify the networks clusters and their respective sites.
R. M. H. Al-Sayyed et al.
895
Listing 1. Fragmentation algorithm.
Input: K: Number of the last fragment Rmax: Number of database relations Nmax: Number of fragments in each relation Step 1: Set 0 to K Step 2: Set 1 to R Step 3: Do steps (4 – 21) until R > Rmax Step 4: Set 1 to I Step 5: Do steps (6 – 20) until I > Nmax Step 6: Set 1 to J Step 7: Do steps (8-18) until J > Nmax Step 8: If I ≠ J and ∃ Si ,Sj Є SR go to step (9) Else Add 1 to J, go to step (18) Step 9: If Si ∩ Sj ≠ Ø do steps (10)-(17) Else, Add 1 to J and go to step (19) Step 10: Add 1 to K Step 11: Create new fragment Fk = Si ∩ Sj and add it to F Step 12: Create new fragment Fk+1 = Si – Fk and add it to F Step 13: Create new fragment Fk+2 = Sj – Fk and add it to F Step 14: Delete Si Step 15: Delete Sj Step 16: Set Nmax + 1 to J Step 17: End IF Step 18: End IF Step 19: Loop Step 20: Add 1 to I Step 21: Loop Step 22: Set 1 to I Step 23 Do steps (24 – 35) until I > Nmax Step 24: Set 1 to J Step 25: Do steps (26 – 33) until J > Nmax Step 26: If I ≠ J and ∃ Si, Sj Є SR go to step (27) Else Add 1 to J, go to step (33) Step 27: If Si ∩ Sj = Ø do steps (28)-(33) Step 28: Add 1 to K Step 29: Create new fragment Fk = Rj – UF Step 30: End IF Step 31: If Fk ≠ Ø Add Fk to the set of F Step 32: End IF Step 33: Loop Step 34: Add 1 to I Step 35: Loop Step 36: Set 1 to I Step 37: Do steps (38 – 53) until I > F Step 38: Set 1 to J Step 39: Do steps (40 – 51) until J > F Step 40: If I ≠ J and ∃ Fi, Fj Є FR go to step (41) Else, Add 1 to J and go to step (50) Step 41: If Fi ∩ Fj ≠ Ø do steps (42)-(49) Else, Add 1 to J and go to step (49) Step 42: Add 1 to K Step 43: Create new fragment Fk = Fi ∩ Fj and add it to F Step 44: Create new fragment Fk+1 = Fi – Fk and add it to F Step 45: Create new fragment Fk+2 = Fj – Fk and add it to F Step 46: Delete Fi Step 47: Delete Fj Step 48: Set F + 1 to J Step 49: End IF Step 50: End IF Step 51: Loop Step 52: Add 1 to I Step 53: Loop Step 54: Add 1 to R Step 55: Loop
R. M. H. Al-Sayyed et al.
896
Listing 2. Clustering algorithm.
Input: CC(Si, Sj): Matrix of communication cost between sites CR: Clustering Range NS: Number of sites in the distributed database system network Output: CSM: Clusters Set Matrix Step 1: Set 1 to i Step 2: Do steps (3 – 12) until i > NS Step 3: Set 1 to j Set 0 to k Set 0 to Sum Set 0 to Average Set 0 to clusters matrix CM Step 4: Do steps (5 – 10) until j > NS Step 5: If i ≠ j AND CC(Si, Sj) <= CR, go to step (6) Else, go to step (7) Step 6: Set 1 to the CM(Si, Sj) and CM(Sj, Si) in the clusters matrix Add CC(Si, Sj) to sum Add 1 to k Go to step 8 Step 7: Set 0 to the CM(Si, Sj) and CM(Sj, Si) in the clusters matrix Step 8: End IF Step 9: Add 1 to j Step 10: Loop Step 11: Average = Sum/k Average(i) = Average Add 1 to i Step 12: Loop Step 13: Set 1 to m Step 14: Do steps (15 – 36) until m > NS Step 15: Set 1 to q Set 0 to Minaverage Set 0 to Minrow Step 16: Do steps (17 – 20) until q > NS or Minaverage > 0 Step 17: If Average(q) > 0 Then Minaverage = Average(q) Else Go to Step 18 Step 18: End If Step 19: Add 1 to q Step 20: Loop Step 21: If Minaverage = 0 Then Set site number to a new cluster Else Go to Step 22 Step 22: End If Step 23: Set 1 to p Step 24: Do steps (25 – 28) until p > NS Step 25: If Average(p) > 0 AND Average(p) < Minaverage Then Minaverage = Average(p) Minrow = p Step 26: End IF Step 27: Add 1 to p Step 28: Loop Step 29: Set 1 to a Step 30: Do steps (31 – 34) until a > NS Step 31: If CM(Sminrow, Sa) = 1 Then Set 1 to CSM(Sminrow, Sa) CM(Sminrow, Sa) = 0 Step 32: End IF Step 33: Add 1 to a Step 34: Loop Step 35: Add 1 to m Step 36: Loop Step 37: Stop
R. M. H. Al-Sayyed et al.
897
Clustering Example To demonstrate the applicability of clustering, a simulation of the clustering algorithm on the given communica- tion cost between 12 sites will take place at clustering range CR equals to 5 (CR could be any communication cost as multiples of 0.125 ms/byte). The cluster set matrix resulted as shown in Table 1.
From the cluster set matrix, it is clear that only S1 can be grouped in the first cluster because the other sites can’t match the cluster range along with S1. The sites S2, S5, S6, S7, and S8 are also grouped in the second cluster because their communication costs match the cluster range between them. Since the cluster range match the communication costs between S3, S9, S10 and the other sites are far from them, then these three sites can be only grouped in the third cluster. In the same way, the fourth cluster is constructed only from sites S4, S11, and S12. Table 2 displays the generated clusters and their respective sites.
The communication costs within and between clusters have to be taken into consideration in the computation of the fragment allocation. The optimal way to find the communication cost between clusters and between sites within each cluster is to find the shortest path between the clusters/sites and compute the communication cost between them. However, this way is considered to be NP complete, therefore, the cluster average communica- tion cost with symmetric communication cost between clusters is suitable for the computations of fragments al- location in many distributed database environments where similar computers with similar communication costs between sites are used. Table 1. Cluster set matrix.
Site # S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12
S1 1 0 0 0 0 0 0 0 0 0 0 0
S2 0 1 0 0 1 1 1 1 0 0 0 0
S3 0 0 1 0 0 0 0 0 1 1 0 0
S4 0 0 0 1 0 0 0 0 0 0 1 1
S5 0 1 0 0 1 1 1 1 0 0 0 0
S6 0 1 0 0 1 1 1 1 0 0 0 0
S7 0 1 0 0 1 1 1 1 0 0 0 0
S8 0 1 0 0 1 1 1 1 0 0 0 0
S9 0 0 1 0 0 0 0 0 1 1 0 0
S10 0 0 1 0 0 0 0 0 1 1 0 0
S11 0 0 0 1 0 0 0 0 0 0 1 1
S12 0 0 0 1 0 0 0 0 0 0 1 1
Table 2. The generated clusters and their sites.
Cluster # Cluster Sites
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12
C1 – – – – – – – – – – –
C2 – – – – – – –
C3 – – – – – – – – –
C4 – – – – – – – – –
R. M. H. Al-Sayyed et al.
898
5. Fragments Allocation Data allocation and replication technique places and distributes the database fragments on the clusters and their respective sites. Initially fragments are allocated to the clusters that have transactions using that fragments. Al- locating fragments to a small number of clusters instead of large number of sites will reduce the number of communications and therefore enhance the system performance. Figure 2 illustrates the structure of data alloca- tion and replication technique.
A heuristic fragment allocation and replication technique will be introduced to perform the processes of frag- ments allocation in the distributed database system. Initially, all fragments are subject for allocating to all clus- ters having transactions using these fragments at their sites. If the fragment shows positive allocation decision value (i.e. allocation benefit greater than or equal to zero) for a specific cluster, then the fragment is subject for allocating at each site in this cluster, otherwise the fragment is not allocated (cancelled) from this cluster. This step is repeated for each cluster in the distributed database system.
As a result of the previous step, the fragment that shows positive allocation decision value at any cluster is a candidate for allocating at all sites of this cluster. If the fragment shows positive allocation decision value at a site of cluster that already shows positive allocation decision value, then the fragment is allocated at this site, otherwise, the fragment is not allocated to this site. This step is repeated for each site at this cluster.
To ensure data availability in the distributed database system, each fragment should be allocated to at least one cluster and one site. In case a fragment shows negative allocation decision value at all clusters, the fragment is allocated to the cluster that holds the least average communication cost and then to the site that has the least communication cost in this cluster.
5.1. Allocation Cost Functions The allocation cost functions identifies the allocation status which is computed as a logical value for the com- parison between the cost of remote access the fragment to the cluster/site and the cost of allocating the fragment to the cluster/site. If the cost of remote access the fragment to the cluster/site is greater than or equals to the cost of allocating the fragment to the cluster/site, then the allocation status is positive and the fragment is allocated to the cluster/site. On the other hand, if the cost of remote access is less than the cost of allocating, then the alloca- tion status is negative and the fragment is cancelled from the cluster/site.
5.1.1. Cost of Allocating Fragments The cost of allocating the fragment Fi issued by the transaction Tk to the cluster Cj, identified as CA(Tk, Fi, Cj), is defined in terms of the following costs: • Cost of Local Retrievals issued by the transaction Tk to the fragment Fi at cluster Cj. It is computed as the
multiplication of the average cost of local retrievals for all sites (m) at cluster Cj times the average number of frequency of local retrievals issued by the transaction Tk to the fragment Fi for all sites at cluster Cj.
( ) ( ) ( )1 1, , , , , , , ,
m m k i j q k i j qq q
k i j
CLR T F C S FREQLR T F C S CLR T F C
m m = =
= ∗
∑ ∑ (2)
• Cost of Local Updates issued by the transaction Tk to the fragment Fi at cluster Cj. This cost is computed as the result of the product of the average cost of local updates for all sites (m) at cluster Cj times the average frequency of local updates.
( ) ( ) ( )1 1, , , , , , , ,
m m k i j q k i j qq q
k i j
CLU T F C S FREQLU T F C S CLU T F C
m m = =
= ∗
∑ ∑ (3)
Figure 2. Data allocation and replication technique.
R. M. H. Al-Sayyed et al.
899
• Cost
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.