Sunday, March 31, 2019
Data storage in Big Data Context: A Survey
entropy terminus in Big selective information Context A SurveyData entrepot in Big Data Context A SurveyA.ELomari, A.MAIZATE*, L.HassouniRITM-ESTC / CED-ENSEM, University Hassan IIAbstract- As entropy volumes to be adjoined in all domains scientific, professional, cordial etc., atomic number 18 increasing at a high speed, their worry and shop raises more than and more challenges. The growing of highly scalable infrastructures has contributed to the evolution of keepho determination management technologies. However, numerous difficultys shake off emerged overmuch(prenominal) as trunk and costability of information, scalability of environments or yet the competitive access to selective information. The objective of this paper is to review, prove and oppose the main characteristics of whatsoever major technological orientations existing on the market, such(prenominal) as Google charge up System (GFS) and IBM General Parallel consign System (GPFS) or yet on th e open cite remainss such as Hadoop Distributed File System (HDFS), Blobseer and Andrew File System (AFS), in found to understand the needs and constraints that led to these orientations. For from each one case, we lead discuss a set of major problems of big info retentivity management, and how they were addressed in order to provide the best depot serve.IntroductionTodays, the amount of data generated during a bingle day may exceed the amount of info contained in all printed materials all over the world. This quantity far exceeds what scientists bring in imagined there are just a few decades. Internet Data Center (IDC) estimated that between 2005 and 2020, the digital universe will be work out by a itemor of 300, so it will pass from star hundred thirty Exabyte to 40,000 Exabyte, the equivalent of more than 5,200 gigabytes for each person in 2020 i.The tralatitious clays such as centralized nedeucerk- found storage organizations ( client-server) or the tradition al distributed systems such as NFS, are no pineer able to respond to new requirements in terms of volume of data, high performance, and evolution capacities. And besides their cost, a categorization of technical constraints are raised, such as data replication, continuity of services etc. In this paper, we try to discuss a set of technologies use in the market and that we think the most relevant and representative of the declare of the art in the field of distributed storage systems.What is Distributed File systems (DFS)A distributed appoint system (DFS) is a system that allows ninefold users to access, through the network, a lodge structure residing on one or more remote machines (File Servers) apply a similar semantics to that used to access the local filing cabinet system. This is a client / server architecture where data is distributed across doubled storage spaces usually called nodes. These nodes consist of a single or a belittled number of physical storage disks res iding usually in basic equipment, tack together to nevertheless provide storage services. As such, the material finish be relatively low cost.As the material used is generally flashy and by large quantities, failures become unavoidable. Nevertheless, these systems are designed to be unbigoted to failure by having recourse to data replication which makes the loss of one node an event of minimal emergency because data is al authoritys recoverable, a bargain automatically, without any performance degradation.A. Andrew File System(AFS) architectureAFS (or OpenAFS currently) is a standard distributed tear system originally developed by Carnegie Mellon University. It is supported and developed as a product by Transarc Corporation (now IBM Pittsburgh Labs). It crevices a client-server architecture for unify consign sharing and distribution of replicated transform-only content ii.AFS offers many improvements over traditional systems. In particular, it provides the independence of the storage from location, guarantees system scalability and transparent migration capabilities.As shown in see 1, the distribution of treates in AFS potentiometer be summarized as follows A process called Vice is the backbone of information sharing in the system it consists of a set of utilize file servers and a complex LAN. A process called Venus runs on each client workstation it mediates access to shared files iii. account 1 AFS devise.AFS logic assumes the following hypothesis ivShared files are rarely updated and local user files will remain valid for long periods.An allocation of a large enough local disk amass, for slip 100 MB, foundation keep all users files.Using the client amass may actually be a good compromise to system performance, but it will only be effective if the assumptions adopted by AFS designers are respected, otherwise this can make a huge anaesthetise for data integrity.B. Google File System (GFS) architectureAnother interesting approach is that proposed by GFS, which is not using special cache at all.GFS is a distributed file system developed by Google for its own industrys. Google GFS system (GFS cluster) consists of a single master and eight-fold Chunkservers (nodes) and is accessed by multiple clients, as shown in Figure 2 v.Each of these nodes is typically a Linux machine running a server process at a user level.Figure 2 GFS DesignThe files to be stored are divided into pieces of fixed size called chunks. The Chunkservers store chunks on local disks as Linux files. The master maintains all metadata of the file system. The GFS client code uses an application programming interface (API) to interact with the master regarding transactions related to metadata, but all communications relating to the data themselves goes directly to Chunkservers. unlike AFS, neither the client nor the Chunkserver use a utilize cache. Customers caches, according to Google, offer little benefit because most applications use large which ar e as well as big to be cached. On the other hand, using a single master can drive to a chokepoint situation. Google has tried to tighten the impact of this weak point by replicating the master on multiple copies called shadows which can be accessed in read-only even if the master is down.C. Blobseer architectureBlobseer is a project of KerData team, INRIA Rennes, Brittany, Francevi. The Blobseer system consists of distributed processes (Figure 3), which communicate through remote procedure calls (RPC). A physical node can run one or more processes and can play several roles at the same judgment of conviction.Figure 3 Blobseer DesignUnlike Google GFS, Blobseer do not centralize access to metadata on a single machine, so that the risk of bottleneck situation of this typewrite of node is eliminated. Also, this feature allows load balancing the workload across multiple nodes in parallel.D. Hadoop Distributed File System (HDFS)The Hadoop Distributed File System (HDFS) is a section of Apach Hadoop project vii. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.As shown in figure 4, HDFS stores file system metadata and application data separately. As in other distributed file systems, HDFS stores metadata on a dedicated server, called the NameNode. masking data are stored on other servers called DataNodes viii.Figure 4 HDFS DesignThere is one NameNode per cluster and it makes all decisions regarding replication of b takes ix.Data warehousing as blobThe architecture of a distributed storage system must take into make doation how files are stored on disks. One smart way to make this possible is to organize these data as objects of considerable size. such objects, called Binary Large Objects (BLOBs), consist of long sequences of bytes representing unstructured data and can provide the basis for a transparent data sharing of large-scale. A BLOB can usually reach sizes of 1 Tera Byte (TB).Using BLOBs offers two main advantagesThe Scalability Maintaining a small set of huge BLOBs including billions of small items is much easier than directly managing billions of small ones. The primary mapping between the application data and file names can be a big problem compared to the case where the data are stored in the same BLOB and that only their offsets must be maintained.The Transparency A data management system based on shared BLOBs, uniquely identifiable through ids, relieves application developers of the burden of explicit management and transfer of their locations on the codes. The system thereof offers an intermediate layer that masks the complexity of access to data wherever it is stored physically x.Data stripeData striping is a well-known proficiency for increasing the data access performances. Each BLOB or file is divided into small pieces that are distributed across multiple machines on the storage system. Thus, requests for access to data may be distributed over multiple machines in parallel way, all owing achieving high performances.Two factors must be considered in order to maximize the benefits of this techniqueConfigurable strategy of distribution of chunks Distribution strategy specifies where to store the chunks to execute a predefined goal. For example, load balancing is one of the goals that such strategy can allow.Dynamic configuration of the size of the chunks If the chunks size is too small, applications would have to witness the data to be processed from several chunks. On the other hand, the use of too large chunks will complicate simultaneous access to data because of the increasing probability that two applications require access to two contrastive data but both stored on the same chunk.A lot of systems that use this type of architecture, such as GFS and Blobseer use a 64 MB sized chunks, which seems to be the most optimized size for those two criteria.concurrency impact concurrency is very dependent on the nature of the desired data touch on and of the nature of data changes. For example, Haystack system that manages Facebook pictures which never changes xi, will be different from Google GFS or IBM General Parallel File System (GPFS) which are managing a more dynamic data.The lock method is used by many DFS to manage concurrency and IBM GPFS has developed a more effective mechanics that allows fasten a byte range instead of whole files/blocks (Byte Range Locking) xii.GFS meanwhile, offers a relaxed consistency model that supports Google highly distributed applications, but still relatively simple to implement.Blobseer developed a more sophisticated technique, which theoretically gives better results. The guess approach using displacementing that Blobseer brings is an effective way to meet the main objectives of maximise competitive access xiii. The disadvantage of such a mechanism based on snapshots, is that it can easily explode the required physical storage space. However, although each write or append generates a new version of t he blob snapshot, only the differential updates from previous versions are physically stored.DFS bench markAs we have detailed in this article, generally there is no better or worse methods for technical or technological choices to be adopted to make the best of a DFS, but rather compromises that have to be managed to meet very specific objectives.In Table 2, we compare quintet distributed file systems GFS, GPFS, HDFS, AFS and Blobseer. Choosing to compare only those specific systems disrespect the fact that the market includes dozens of technologies is led particularly by two points1. It is technically difficult to study all systems in the market in order to know their technical specifications, especially as several of them are trademarked and closed systems. Even more, the techniques are similar in several cases and are comparable to those of the five we compared.2. Those five systems allow making a open air idea about the DFS state of the art thanks to the following particula ritiesGFS is a system used internally by Google, which manage huge quantities of data because of its activities.GPFS is a system developed and commercialized by IBM, a ball-shaped leader in the field of Big DataHDFS is a subproject of HADOOP, a very popular Big Data systemBlobseer is an open source initiative, particularly driven by research as it is maintained by INRIA Rennes.AFS is a system that can be considered as a pair between conventional systems such as NFS and advanced distributed storage systems.In Table 2, we compare the implementation of some key technologies in those five systems.Analysis of the results of Table 2 leads to the following conclusions The five systems are expansile in data storage. Thus, they cover one of the principal issues that lead to the emergence of Distribute File System. Only Blobseer and GPFS offer the extensibility of metadata management to overcome the bottleneck problem of the master machine, which manage the access to metadata. Except AFS, all canvass systems are natively tolerant to crash, relying essentially on multiple replications of data. To minimize the retardation caused by locking the whole file, GPFS manage locks on specific areas of the file (Byte range locks). merely the most innovative method is the use of versioning and snapshots by Blobseer to allow simultaneous changes without exclusivity. Except AFS, all systems are using the striping of data. As discussed earlier this technique provides a higher input / end product performance by striping blocks of data from individual files over multiple machines. Blobseer seems to be the only one among the systems studied that implements the storage on blobs technique, despite the apparent advantages of such technique. To allow a better scalability, a DFS system must support as much operating systems as possible. But while AFS, HDFS and GPFS supports multiple platforms, GFS and Blobseer run exclusively on Linux, this can be explained partly by the commercial ba ckground of AFS, HDFS and GPFS. Using a dedicated cache is also a point of disagreement between systems. GFS and Blobseer consider that the cache has no real benefits, but rather causes many consistency issues. AFS and GPFS uses dedicated cache on both client computers and servers. HDFS seems to use dedicated cache only at client level.ConclusionIn this paper, we reviewed some specifications of distributed file storage systems. It is clear from this analysis that the major common solicitude of such systems is scalability. A DFS should be extendable with the minimum cost and effort.In addition, data availability and fault tolerance remains among the major concerns of DFS. umteen systems tend to use non expensive hardware for storage. Such condition will expose those systems to frequent or usual breakdowns.To these mechanisms, data striping and lock mechanisms are added to manage and optimize concurrent access to the data. Also, Working on multiples operating systems can bring big a dvantages to any of those DFS.None of these systems can be considered as the best DFS in the market, but rather each of them is excellent in the scope that it was designed for.Table 2 relative table of most important characteristics of distributed file storageGFS by GoogleGPFS IBMHDFSBlobseerAFS (OPEN FS)Data ScalabilityYESYESYESYESYESMeta Data ScalabilityNOYESNOYESNOFault tolerance unfluctuating Recovery.Chunk Replication.Master Replication.Clustering features. Synchronous and asynchronous data replication.Block Replication.Secondary NameNode.Chunk ReplicationMeta data replicationNOData access ConcurrencyOptimized for concurrent appendsDistributed byte range lockingFiles have strictly one writer at any timeYESByte-range file lockingMeta Data access ConcurrencyMaster shadows on read onlyCentralizedmanagementNOYESNOSnapshotsYESYESYESYESNOVersioningYESunknownNOYESNOData Striping64 MB ChunksYESYES (Data blocks of 64 MB)64 MB ChunksNOStorage as BlobsNONONOYESNOSupported OSLINUXAIX, Red Hat, SUSE , Debian Linux distributions, Windows Server 2008Linux and Windows supported , BSD, Mac OS/X, Open Solaris known to workLINUXAIX, Mac OS X, Darwin, HP-UX, Irix, Solaris, Linux, Windows, FreeBSD, NetBSD OpenBSDDedicated cacheNOYES by AFM technologyYES (Client)NOYES John Gantz and David Reinsel. THE DIGITAL public IN 2020 Big Data, Bigger Digital Shadows, and Biggest Growth in the remote East. Tech. rep. Internet Data Center(IDC), 2012.2 OpenAfs www.openafs.org/3 Monali Mavani Comparative Analysis of Andrew Files System and Hadoop Distributed File System, 2013.4 Stefan Leue Distributed Systems Fall, 20015 Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* The Google File System.6 Blobseer blobseer.gforge.inria.fr/7 Hadoop hadoop.apache.org/8 Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo The Hadoop Distributed File System, 2010.9 Dhruba Borthakur HDFS Architecture Guide, 2008.0 Bogdan Nicolae, Gabriel Antoniu, Luc Boug_e, Diana Mois e, Alexandra, Carpen-Amarie BlobSeer Next Generation Data guidance for Large Scale Infrastructures, 2010.1 Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, Peter Vajgel, Facebook Inc Finding a chevy in Haystack Facebooks photo storage,2 Scott Fadden, An Introduction to GPFS Version 3.5, Technologies that enable the management of big data, 2012.3 Bogdan Nicolae,Diana Moise, Gabriel Antoniu,Luc Bouge, Matthieu Dorier BlobSeer Bringing High Throughput under Heavy Concurrency to Hadoop Map-Reduce Applications, 2010.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment