Welcome!

Adobe Flex Authors: Matthew Lobas, PR.com Newswire, Shelly Palmer, Kevin Benedict

Related Topics: Artificial Intelligence, Machine Learning , @CloudExpo

Artificial Intelligence: Article

AWS Broke the Internet Again or, Better, a Typo | @CloudExpo #AI #ML #DL

An AI-defined infrastructure can help to avoid service disruptions

Amazon Web Services (AWS) broke the Internet again or better "a typo". On February 28, 2017, an Amazon S3 service disruption in AWS' oldest region US-EAST-1 shuts down several major websites and services like Slack, Trello, Quora, Business Insider, Coursera and Time Inc. Other users were reporting that they were also unable to control devices which were connected via the Internet of Things since IFTTT was also down. Those kinds of disruptions are becoming more and more business critical for today's digital economy. To prevent these situations, cloud users should always consider the shared responsibility model in the public cloud. However, there are also ways where Artificial Intelligence (AI) can help. This article describes that an AI-defined Infrastructure respectively an AI-powered IT management system can help to avoid service disruptions of public cloud providers.

Amazon S3 Service Disruption - What has happened
After every service disruption AWS writes a summary of what was going on during an incident. This is what happened on the morning of February 28.

"The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable."

Read more under "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region".

Bottom line, a typo crashed the AWS powered Internet! AWS outages already have a long history and the more AWS customers running their web infrastructure on the cloud giant, the more issues end customers will experience in the future. According to SimilarTech only Amazon S3 is already used by 152,123 websites and 124,577 unique domains.

However, following the philosophy of "Everything fails all the time (Werner Vogels, CTO Amazon.com)" means if you are using AWS you must "Design for Failure".  Something cloud role model and video on demand provider Netflix is doing in perfection. In doing so, Netflix has developed its Simian Army an open source toolset everyone can use to run a cloud infrastructure on AWS high-available.

Netflix "simply" uses the two levels of redundancy AWS offers. Multiple regions and multiple availability zones (AZ). Multiple regions are the masterclass of using AWS, very complex and sophisticated since you must build and manage entire separated infrastructure environments within AWS' worldwide distributed cloud infrastructure. Multiple AZs are the preferred and "easiest" way for high availability (HA) on AWS. In this case, the infrastructure is built within more than one data center (AZ). In doing so, a single region HA architecture is deployed in at least two or more AZs - a load balancer in front of it is controlling the data traffic.

However, even if "typos" shouldn't happen the recent accident shows, that human error is still the biggest issue running IT systems. In addition, you can blame AWS only to a certain extend since the public cloud is about shared responsibility.

Shared Responsibility in the Public Cloud
An important public cloud detail is the self-service. Depending on its DNA the providers are only taking responsibility for specific areas. The customer is responsible for the rest. In the public cloud, it is about sharing responsibilities - this model is called Shared Responsibility. The provider and its customers divide the field of duties among themselves. In doing so, the customer's self-responsibility plays a major role. In the context of IaaS utilization, the provider is responsible for the operations and security of the physical environment. He is taking care of:

  • Set up and maintenance of the entire data center infrastructure.
  • Deployment of compute power, storage, network and managed services (like databases) and other micro services.
  • Provisioning the virtualization layer customers are using to demand virtual resources at any time.
  • Deployment of services and tools customers can use to manage their areas of responsibility.

The customer is responsible for the operations and security of the logical environment. This includes:

  • Set up of the virtual infrastructure.
  • Installation of operating systems.
  • Configuration of networks and firewall settings.
  • Operations of own applications and self-developed (micro) services.

Thus, the customer is responsible for the operations and security of his own infrastructure environment and the systems, applications, services, as well as stored data on top of it. However, providers like Amazon Web Services or Microsoft Azure provide comprehensive tools and services customers can use e.g. to encrypt their data as well as ensure identity and access controls. In addition, enablement services (micro services) exist that customers can adopt to develop own applications more quickly and easily.

In doing so, the customer is all alone in its area of responsibility and thus must take self-responsibility. However, this part of the shared responsibility can be done by an AI-defined IT management system respectively an AI-defined Infrastructure.

An AI-defined Infrastructure can help to avoid Service Disruptions
An AI-defined Infrastructure can help to avoid service disruptions in the public cloud. However, the basis of this kind of infrastructure is a General AI that combines three major human abilities that enable enterprises to tackle IT and business process challenges.

  • Understanding: By creating a semantic data map the General AI understands the world of the company in which its IT and business exists.
  • Learning: By creating Knowledge Items the General AI learns best practices and reasoning from experts. Knowledge is taught in atomic pieces of information (Knowledge Items) that represent separate steps of a process.
  • Solving: With machine reasoning problems are solved in ambiguous and changing environments. The General AI dynamically reacts to the ever-changing context, selecting the best course of action. Based on machine learning the results are optimized through experiments.

To put this into the context of an AWS service disruption:

  • Understanding: The General AI creates a semantic map of the AWS environment as part of the world in which the company exists.
  • Learning: IT experts create Knowledge Items while they are configuring and working with AWS from what the General AI learns best practices. Thus, the experts teach the General AI contextual knowledge that includes what, when, where and why something needs to be done - for example when a specific AWS service is not responding.
  • Solving: The General AI dynamically reacts to incidents based on the learned knowledge. Thus, the AI (probably) knows what to do at this very moment - even if no high availability setup was considered from the beginning.

Frankly speaking, everything described above is no magic. Like every new born organism an AI-defined Infrastructure needs to be trained but afterwards can work autonomously as well as can detect anomalies as well as service disruptions in the public cloud and solve them. Therefore, you need the knowledge of experts who have a deep understanding of AWS and how the cloud works in general. These experts need to teach the General AI with their contextual knowledge that includes not only what, when and where but also why. They have to teach the AI with atomic pieces (Knowledge Items, KI) that can be indexed and prioritized by the AI. Context and indexing enable this KIs to be combined to form many solutions.

KIs created by various IT experts create pooled expertise that is further optimized by machine selection of best knowledge combinations for problem resolution. This type of collaborative learning improves process time task by task. However, the number of possible permutations grows exponentially with added knowledge. Connected to a knowledge core, the General AI continuously optimizes performance by eliminating unnecessary steps and even changing routes based on other contextual learning. And the bigger the semantic graph and knowledge core gets, the better and more dynamically the infrastructure can act in terms of service disruptions.

On a final note, do not underestimate the "power of we"! Our research at Arago revealed that with an overlap of 33 percent in basic knowledge, this knowledge can and is used outside a specific organizational environment, i.e. across different client environments. The reuse of knowledge within a client is up to 80 percent. Thus, exchanging basic knowledge within a community becomes imperative from an efficiency perspective and improve the abilities of the General AI.

More Stories By Rene Buest

Rene Buest is Director Market Research & Technology Evangelism at Arago. Prior to that he was Senior Analyst and Cloud Practice Lead at Crisp Research, Principal Analyst at New Age Disruption and member of the worldwide Gigaom Research Analyst Network. At this time Rene was considered as top cloud computing analyst in Germany and one of the worldwide top analysts in this area. In addition, he was one of the world’s top cloud computing influencers and belongs to the top 100 cloud computing experts on Twitter and Google+. Since the mid-90s he is focused on the strategic use of information technology in businesses and the IT impact on our society as well as disruptive technologies.

Rene Buest is the author of numerous professional technology articles. He regularly writes for well-known IT publications like Computerwoche, CIO Magazin, LANline as well as Silicon.de and is cited in German and international media – including New York Times, Forbes Magazin, Handelsblatt, Frankfurter Allgemeine Zeitung, Wirtschaftswoche, Computerwoche, CIO, Manager Magazin and Harvard Business Manager. Furthermore Rene Buest is speaker and participant of experts rounds. He is founder of CloudUser.de and writes about cloud computing, IT infrastructure, technologies, management and strategies. He holds a diploma in computer engineering from the Hochschule Bremen (Dipl.-Informatiker (FH)) as well as a M.Sc. in IT-Management and Information Systems from the FHDW Paderborn.

@ThingsExpo Stories
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
SYS-CON Events announced today that HTBase will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. HTBase (Gartner 2016 Cool Vendor) delivers a Composable IT infrastructure solution architected for agility and increased efficiency. It turns compute, storage, and fabric into fluid pools of resources that are easily composed and re-composed to meet each application’s needs. With HTBase, companies can quickly prov...
SYS-CON Events announced today that Infranics will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Since 2000, Infranics has developed SysMaster Suite, which is required for the stable and efficient management of ICT infrastructure. The ICT management solution developed and provided by Infranics continues to add intelligence to the ICT infrastructure through the IMC (Infra Management Cycle) based on mathemat...
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
There are 66 million network cameras capturing terabytes of data. How did factories in Japan improve physical security at the facilities and improve employee productivity? Edge Computing reduces possible kilobytes of data collected per second to only a few kilobytes of data transmitted to the public cloud every day. Data is aggregated and analyzed close to sensors so only intelligent results need to be transmitted to the cloud. Non-essential data is recycled to optimize storage.
"I think that everyone recognizes that for IoT to really realize its full potential and value that it is about creating ecosystems and marketplaces and that no single vendor is able to support what is required," explained Esmeralda Swartz, VP, Marketing Enterprise and Cloud at Ericsson, in this SYS-CON.tv interview at @ThingsExpo, held June 7-9, 2016, at the Javits Center in New York City, NY.
SYS-CON Events announced today that Outlyer, a monitoring service for DevOps and operations teams, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Outlyer is a monitoring service for DevOps and Operations teams running Cloud, SaaS, Microservices and IoT deployments. Designed for today's dynamic environments that need beyond cloud-scale monitoring, we make monitoring effortless so you ...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, will discuss the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information,
China Unicom exhibit at the 19th International Cloud Expo, which took place at the Santa Clara Convention Center in Santa Clara, CA, in November 2016. China United Network Communications Group Co. Ltd ("China Unicom") was officially established in 2009 on the basis of the merger of former China Netcom and former China Unicom. China Unicom mainly operates a full range of telecommunications services including mobile broadband (GSM, WCDMA, LTE FDD, TD-LTE), fixed-line broadband, ICT, data communica...
With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, will discuss some of the security challenges of the IoT infrastructure and relate how these aspects impact Smart Living. The material will be delivered i...
Apache Hadoop is emerging as a distributed platform for handling large and fast incoming streams of data. Predictive maintenance, supply chain optimization, and Internet-of-Things analysis are examples where Hadoop provides the scalable storage, processing, and analytics platform to gain meaningful insights from granular data that is typically only valuable from a large-scale, aggregate view. One architecture useful for capturing and analyzing streaming data is the Lambda Architecture, represent...
As organizations realize the scope of the Internet of Things, gaining key insights from Big Data, through the use of advanced analytics, becomes crucial. However, IoT also creates the need for petabyte scale storage of data from millions of devices. A new type of Storage is required which seamlessly integrates robust data analytics with massive scale. These storage systems will act as “smart systems” provide in-place analytics that speed discovery and enable businesses to quickly derive meaningf...
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, will provide a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services ...