Investigation of Amazon and Google for Fault Tolerance Strategies in Cloud Computing Services

Shereen Al-raheym
Sinan Can Açan

doi:10.5824/1309-1581.2016.4.001.x

2016 Fall/Güz | Vol. 7, No. 25

Investigation of Amazon and Google for Fault Tolerance Strategies in Cloud Computing Services

Amazon ve Google Bulut Bilişim Servislerinin Hata Dayanıklılığı Stratejileri Açısından İncelenmesi

Shereen Al-raheym, Sinan Can Açan

DOI: 10.5824/1309-1581.2016.4.001.x

Pages: 7-22

Download PDF

View on DergiPark

Crossmark Check publication record

6,013 views

1,413 downloads

English

EN Abstract

Investigation of Amazon and Google for Fault Tolerance Strategies in Cloud Computing Services

Cloud computing has recently become an attractive topic due to its ability to offer information technology solutions through virtual machines as on-demand services to share and consume resources over the Internet. As a result of rapid development in such services, the necessity of fault tolerance in the cloud is a major concern with reliability, availability and dependability which are more critical to this new service type. This work investigates techniques and means of tolerating cloud services as well as cloud customers' systems/enterprises execution over the cloud safe from failures. Failures in cloud enabled services should be expected to occur hence they should be handled. The essential features of implementing fault tolerance strategies guarantee the business continuity, avoid financial lost, recovering systems from failures, and provide disaster recovery as well. The specific focus is to explore scenarios of avoiding/recovering from failures through redundancy, checkpoint and replication. Commercial IaaS providers such as Amazon's AWS and Google's GCE are taken as examples as they tolerate their infrastructure from failures; in this way a robust architecture with fault tolerance property could be built for a system/enterprise. Hence, general conceptual steps with fault tolerance considerations have been proposed.

Keywords

cloud computing fault tolerance reliability availability dependability redundancy checkpoint replication AWS GCE

TR Öz

Amazon ve Google Bulut Bilişim Servislerinin Hata Dayanıklılığı Stratejileri Açısından İncelenmesi

Bulut bilişim, bilgi teknolojileri çözümlerini, talep üzerine sanal cihazlarla aracılığı ile İnternet üzerinden sunarak kaynakları paylaşma ve tüketme yeteneği sayesinde, yakın zamanda cazip bir konu haline gelmiştir. Bahsedilen bu hizmetlerin hızlı gelişimi sonucunda, bulut bilişimde daha kritik olan güvenilirlik ve bulunurluk hizmetleriyle, hata dayanıklılığı gerekliliği, bu hizmet tipinde önemli bir kaygı konusudur. Bu çalışmada, bulut hizmetlerinin hata dayanıklılığı yöntem ve teknikleriyle birlikte bulut hizmeti alan müşterilerin sistemlerinin ve/veya işletmelerinin bulut üzerinde hatadan etkilenmeyecek şekilde çalışması incelenmiştir. Bulut üzerinde çalışan sistemlerde hata beklenmelidir ve hata oluştuğunda da giderilmelidir. Hata dayanıklılığı stratejilerini oluşturmaktaki temel amaç iş sürekliliğini sağlamak, parasal kayıplardan kaçınmak, sistemlerdeki arızaları gidermek ve felaketlerden kurtarmaktır. Çalışmanın özel odağı, yedeklilik, denetim noktası kullanma ve çoklama kullanarak hatalardan kaçınma ya da hataları giderme senaryoları üzerindedir. Ticari altyapı hizmeti IaaS sunan Amazon'un AWS ve Google'ın GCE hizmetleri, altyapılarının hatalara karşı dayanıklı olduğu için örnek olarak alınmıştır. Bu sayede bir sistem ya da işletme için hata dayanıklılığı olan güçlü bir mimari yapı kurulabilir. Bu çalışmada hata dayanıklılığı için gerekli genel kavramsal adımlar önerilmiştir.

Anahtar Kelimeler

bulut bilişim hata dayanıklılığı güvenilirlik bulunurluk yedeklilik denetim noktası çoklama AWS GCE

References 36

Kepes, B. (2011). Revolution Not Evolution How Cloud Computing Differs from Traditional IT and Why it Matters - White paper: Diversity Limited.
Carlin, S., & Curran, K. (2012). Cloud Computing Technologies. International Journal of Cloud Computing and Services Science (IJ-CLOSER), 1(2), 59-65.
Latchoumy, P., & Khader, P. S. A. (2011). Survey on Fault Tolerance in Grid Computing. International Journal of Computer Science & Engineering Survey (IJCSES), 2(4), 97-110.
Ganga, K., Karthik, S., & Paul, A. C. (2012). A Survey on Fault Tolerance in Work flow Management and Scheduling. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), 1(8), 176-179.
Pullum, L. L. (2001). Software Fault Tolerance Techniques and Implementation: Artech House.
Dubrova, E. (2013). Fault Tolerant Design: Springer Science+Business Media.
Selic, B. (2006). Fault Tolerance Techniques for Distributed Systems: IBM.
Patel, A., Taghavi, M., Bakhtiyari, K., & Junior, J. C. (2013). An intrusion detection and prevention system in cloud computing: A systematic review. Journal of Network and Computer Applications, 36, 25-41.
Dave, S., & Raghuvanshi, A. (2012). Fault Tolerance Techniques in Distributed System. International Journal of Engineering Innovation & Research, 1(2), 124-130.
Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33.
Laprie, J.-C. (1995). Dependable Computing Concepts and Fault Tolerance: Terminology. Paper presented at the 25th International Symposium on Fault-Tolerant Computing, Highlights from Twenty-Five Years, Pasadena, California
Lussier, B., Chatila, R., Guiochet, J., Ingrand, F., Lampe, A., Killijian, M.-o., & Powell, D. (2005). Fault Tolerance in Autonomous Systems: How and How Much? Paper presented at the 4th IARP/IEEE-RAS/EURON Joint Workshop on Technical Challenge for Dependable Robots in Human Environments, Nagoya, Japan.
VMware. (2009). Protecting Mission-Critical Workloads with VMware Fault Tolerance - White Paper: VMware.
Li, H., Shang, L., Dang, J., & Jin, H. (2009). Fault Recovery Approach in Fault-Tolerant Processor. Paper presented at the International Conference on Scalable Computing and Communications; 8th International Conference on Embedded Computing, Dalian, China.
Runge, A. (2012). Reliability Enhancement of Fault-prone Many-core Systems Combining Spatial and Temporal Redundancy. Paper presented at the 14th International Conference on High Performance Computing and Communications, Liverpool, UK.
Babcock, C. (2010). Management Strategies for The Cloud Revolution: McGraw-Hill.
Li, Y., & Lan, Z. (2011). FREM: A Fast Restart Mechanism for General Checkpoint/Restart. IEEE Transactions on Computers, 60(5), 639-652.
Gokuldev, S., & Valarmathi, M. (2013). Fault Tolerant System for Computational and Service Grid. International Journal of Engineering and Innovative Technology (IJEIT), 2(10), 236-240.
Garg, R., & Singh, A. K. (2011). Fault Tolerance in Grid Computing: State of The Art and Open Issues. International Journal of Computer Science & Engineering Survey (IJCSES), 2(1), 88-97.
Sun, D., Chang, G., Miao, C., & Wang, X. (2013). Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments. The Journal of Supercomputing, 66(1), 193-228.
Ghoreyshi, S. M. (2013). Energy-Efficient Resource Management of Cloud Datacenters Under Fault Tolerance Constraints. Paper presented at the International Green Computing Conference (IGCC), Arlington, Virginia.
Amazon Web Services. (2016). Amazon Elastic Compute Cloud: User Guide for Windows Instances: Amazon Web Services.
Ferraioli, J. (2014). Blurring the IaaS PaaS Divide with Julia Ferraioli [Press release]. Retrieved from https://www.youtube.com/watch?v=tmhGlaXuIn8&list=PLXI5ri9BGtlG12MaiClkARwv_p6n7 xJn6
Varia, J., & Mathew, S. (2014). Overview of Amazon Web Services (pp. 22): AWS.
Google. (2016a, 4.October.2016). Creating Persistent Disk Snapshots. Retrieved 16.Nov.2016, 2016, from https://cloud.google.com/compute/docs/disks/create-snapshots
Baron, J., & Kotecha, S. (2013). Storage Options in the AWS Cloud - White Paper: AWS.
Amazon Web Services. (2014). Elastic Load Balancing Developer Guide API Version 2012-06-01: AWS.
Google. (2016b, 8.November.2016). Using Networks and Firewalls. Retrieved 14.Nov.2016, 2016, from https://cloud.google.com/compute/docs/networking
Villatore-Silva, T. (2012). Creating an Enterprise-Wide Cloud Strategy - White Paper: NetApp.
Marks, E. A., & Lozano, B. (2010). Executive's Guide to Cloud Computing: Wiley.
Das, P., & Khilar, P. M. (2013). VFT: A Virtualization and Fault Tolerance Approach for Cloud Computing. Paper presented at the Conference on Information and Communication Technologies (ICT 2013).
Gómeza, A., Carril, L. M., Valin, R., Mouriño, J. C., & Cotelo, C. (2014). Fault-tolerant virtual cluster experiments on federated sites using BonFIRE. Future Generation Computer Systems, 34, 17- 25.
Feng, Q., Han, J., Gao, Y., & Meng, D. (2012). Magicube: High Reliability and Low Redundancy Storage Architecture for Cloud Computing. Paper presented at the 7th International Conference on Networking, Architecture, and Storage, Xiamen, Fujian, China.
Egwutuoha, I. P., Chen, S., Levy, D., & Selic, B. (2012). A Fault Tolerance Framework for High Performance Computing in Cloud. Paper presented at the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, Canada.
Nandi, B. B., Paul, H. S., Banerjee, A., & Ghosh, S. C. (2013). Fault Tolerance as a Service. Paper presented at the 6th International Conference on Cloud Computing, Santa Clara, California.
Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., & Warfield, A. (2008). Remus: High Availability via Asynchronous Virtual Machine Replication. Paper presented at the NSDI '08: 5th USENIX Symposium on Networked Systems Design and Implementation, San Francisco, CA.

Article Information

Type: Research Article

DOI: 10.5824/1309-1581.2016.4.001.x

Year: 2016

Volume: 7

Issue: 25

Pages: 7-22