[Professor – Name]
[Month Day, Year]
Question 1: NIST Cloud Deﬁnition
NIST SP800-145 lists ﬁve essential characteristics of cloud computing.
(a) Argue Amazon Web Services (AWS) meets these characteristics. Address each characteristic, and discuss an AWS service feature that suggests it is met. Answer There are five essential characteristics of cloud computing of NIST SP800-145: “on-demand self-service, broad network access, resource pooling, rapid elasticity or expansion, and measured service.” (Pedro Hernandez, 2014). It is the part of three “service representation” (system, stage and structure), and four “deployment representation” which is jointly classify behavior to convey cloud facility. Here we discuss either AWS meets these characteristics or not. The AWS service is planned to provide like a revenue for broad evaluation of cloud services and deployment plan, and to offer a baseline for argument from what is AWS to how to good use cloud computing. “When agencies or companies use this definition,”, (Pedro Hernandez, 2014)”they have a technique to decide the amount to which the detail knowledge implementations they are consider gather the cloud characteristics and structures. This is significant as adopt a reliable cloud, they are further possible to gather the promise settlement of cloud—reduce cost, reduce energy, fast consumption and consumer empowerment and identical an execution to the cloud definition may help in assess the protection possessions of the cloud.” (b) A medium-sized, geographically-localized agency has just removed all its workers’ desktops and replaced these with thin clients, backed by a virtual desktop infrastructure (VDI). All workers have a common work schedule (i.e., 0800-1700). Argue this does not meet NIST’s deﬁnition in terms of essential characteristics, noting which characteristics this deployment of VDI likely has, and which it likely does not. Answer As we discuss before that this cloud representation is collected of five essential characteristics On-demand self-service, Broad network access, Resource pooling, Rapid elasticity and Measured Service. According to given scenario, from a developer perspective, cloud computing let you to flexibly distribute included satisfied, request, and services to any tool, wherever, anytime, in an effortlessly scalable structure, using and paying for simply the income you want, when you want them. They don’t meet two characteristics On-demand self-service and Measured Service while Broad network access, resource pooling and rapid elasticity are meet according to scenario. From an IT viewpoint, cloud computing let association to free them from have to obtain and assign classy hardware, software, and networking capital, or use big panel to direct and maintain infrastructure. From a business viewpoint, cloud computing allow source to serve company of every volume, mainly SMBs. References: Pedro Hernandez. (2014). Backblaze Debuts Open Source Storage Pod 4.0. http://www.infostor.com/disk-arrays/raid/backblaze-debuts-open-source-storage-pod-4.0.html
Question 2: Metadata
“Separating data from metadata is an essential abstraction that facilitates
scalable, replicated storage.”
Defend this view, giving examples from systems that we’ve studied. Don’t just recognize and repeat this pattern as found in other systems; actually explain how it makes the design simpler (i.e., assume it wasn’t true and explain the complexity that would need to be added to the system).
We assume it wasn’t true and explain the complexity that would need to be added to the system:
• Protection/management of fixed data by external metadata basis.
• Is technique accessible to simply run and authorize batch inform?
• Special metadata maintain, distinct metadata system with different “levels” of report (set, sequence, item, file)
• Several foundation of metadata to describe from
• Whole nonappearance of item-level (or file-level) metadata
• Un access of metadata by time of making
• Incapability to copy metadata from basis
Workflow problems: how frequently to inform? How do we recognize to update fixed metadata if outside metadata and image records (and many implanted metadata they hold) are maintained with distinct custodial group?
Can fixed record be efficient with no moving the reliability of the image records itself?
If image records are migrate to an innovative file format, may textbook metadata relocation be definite? If images are not transfer to a new record design and the offered record can’t be open, then any fixed metadata will probably be unapproachable also. Metadata is drop throughout copied manufacture or other upholding behavior.
Reference: Julian BirkinshawJordan Cohen. (2013). Make Time for the Work That Matters. https://hbr.org/2013/09/make-time-for-the-work-that-matters
Question 4: Advanced RAID properties
In RAID, each controller is associated with a group of disks. We can use many controllers, each associated with their own group of disks. “Orthogonal RAID” organizes parity disks in such a way that they are spread across groups associated with orthogonal hardware.
(a) Why does one care about orthogonal RAID? The RAID survey from class discussed two beneﬁts. Explain both. Answer Raid tools present several important reward like a storage medium: Reasonable option to mass storage space High throughput and dependability The key works of a RAID System are:
Put of disk drives, disk arrays, outlook with consumer as one or more rational drive.
Data can be dispersed across drive
Job loss additional in order to permit for disk breakdown
(b) I claim that the Backblaze storage pod is not designed to withstand the types of failures motivating orthogonal RAID. In fact, Backblaze’s storage business must not care if an entire storage pod is lost. Do you agree with this statement? Either defend it, or refute it based on facts from the case study.
Storage Pod 4.0 is a large removal from before side. Backblaze controls from a 5-drive backplane plan to “individual direct wire SATA/Power connectors.” Two 40-port SATA 3 HighPoint Rocket 750” (Paul Krzyzanowski, 2010) cards put back the old three 4-port SATA card system.
Newer, well-organized Intel Core i3-3240 processors (22nm “Ivy Bridge”) take the place of the previous generation’s Core i3-3240 (32nm “Sandy Bridge”) CPUs. The corporation moreover cut the number of control supplies from two to an only unit able of provide sufficient authority for every the system’s hard drives and other workings.
References: Paul Krzyzanowski. (2010). A NoSQL massively parallel table. BigTable. https://www.cs.rutgers.edu/~pxk/417/notes/content/bigtable.html
Question 5: MapReduce Workers
Which of the following statements about the MapReduce system are true:
(a) A MapReduce job will complete even if one of the worker nodes fails.
o True o False Explain:
True: A Hadoop cluster contains an only master and several slaves’ nodes. The master node has “JobTracker, TaskTracker, NameNode and DataNode.” (Jason Cooper, 2009). A slave node acts like equally a DataNode and TaskTracker, while it is probable to have data-only slave’s nodes and multiply worker nodes. If any single slaves or worker nodes fails then MapReduce job is auto fail.
(b) A MapReduce job will complete even if the master node fails.
o True o False Explain:
Answer False: For efficient working, each Hadoop-compatible file structure must offer location responsiveness: the identity of the stand (more exactly, of the system switch) where a slaves node is. so if the master node fails MapReduce will not be completed.
(c) Map workers send their results directly to reduce workers.
o True o False Explain:
True: Our research specify that facts workers expend a great agreement of their time—an standard of 41%—on optional behavior that present little individual approval and may be handle capably with others. Consequently why do they stay doing them? Since elimination oneself of employment is easier said than completed. We impulsively stick to responsibilities that build us sense hard and consequently significant, as our boss, frequently determined to do more by less, pile on like several tasks as we’re prepared to believe.
(d) If there are M map tasks, using more than M map workers may improve MapReduce performance.
o True o False Explain:
True: When the user initiates a MapReduce task, a single master will be created and it is responsible for assigning tasks to workers and keeping track of the progress. The mappers will take split input and generate intermediate outputs to feed the reducers. The reducers will write final results to one or multiple files. Under the case of worker failure, the master will simply reschedule that part of work. However, the whole task will be aborted if the master fails. The user could initiate the task again. Last, the paper mentions several useful refinements they made including locality optimization, skipping bad records, and redundant execution.
References: Jason Cooper. (2009). How Entities and Indexes are Stored. Google App Engine. https://cloud.google.com/appengine/articles/storage_breakdown
Question 6: Map Functions
In the MapReduce framework, Map functions are required to be side-effect free, deter¬ministic and idempotent. What are the repercussions if a job has side effects? What are the repercussions if a job is non-deterministic? What benefits of the MR framework are undermined if jobs are non-idempotent? Answer
If the function has effects, it may add up on individual called initial on every essentials figure 0, and then on every that number 1, and so on. The kind of the effect sequence is particular with the discussion result-type (that should be a subtype of the kind series), since for the function force. In adding, single may identify nil for the consequence kind, importance that no consequence order is to be formed; in this case the purpose is invoke only for result, and map output nil. This provides a result related to that of map. Benefits of the MR framework are undermined if jobs are non-idempotent is:
• A clear expression of the dealing behavior a compact is keen to keep in and the stage of risk it is prepared to suppose
• An sympathetic of every matter hazard taken through the compact, both by the company element level and in collective References: UNIBLUE. (2013). Where is the best place to store data?. Tech Articles. http://www.uniblue.com/articles/windows-optimization/where-best-place-store-data/
Question 7: CAP
Consider the following hotel reservation systems:
• Cozy Hotels sometimes overbooks their hotels, but someone always answers the phone.
• Awesome Hotels sometimes gives you a busy signal when you call them, but when you can get through they are reliable.
• Pleasant Hotels always books your stay reliably, but you sit on hold for an indeﬁnite period of time before you get a reservation conﬁrmation.
Make up a story abut how each system is implemented that would explain the above observations. Describe each hotel reservation strategy in terms of CAP tradeoﬀs.
Note: Make up whatever story you want, so long as it demonstrates your understanding of CAP and could explain the observations. State how many call centers each has, how many people work in each, etc. Assume only pen-and-pencil book-keeping and telephones for communication, in your stories. Make your explanation as simple as possible; no extra complexity for realism please.
Scientific setting, we are observer a tough and rising wish to range scheme out when extra capital (compute, storage space, etc.) are wanted to effectively total workloads in a rational time outline. This is proficient during addition extra product hardware to an organization to switch the amplified stack. As a consequence of this extent plan, an additional price of difficulty is incurred in the structure. This is what the CAP theorem approach in participate.
The CAP Theorem condition to, in a spread organization (a set of consistent nodes to split records.), you may only contain two out of the next three assurances transversely a write/read matchup: reliability, accessibility, and divider acceptance – single of them should be give up. Though, see in diagram, you don’t have like several choice here like you may feel.
Consistency – A interpret is certain to go back the mainly new write for an agreed client.
Availability – A non-failing node will go back a sensible reply by a sensible sum of moment.
Partition Tolerance – The structure will carry on functioning when system partition occurs.
We want to set single thing directly. There are supposition that we get for decided when structure purpose that share memory, that fail nodes are tear crossways space and time. Reference: KODI. (2014). HOW-TO:Backup the video library. http://kodi.wiki/view/HOW-TO:Backup_the_video_library
Question 8: BigTable Operations
You are using BigTable to store the following video rental inventory data:
(a) You execute rent(“Beaches:dvd”, 1). Your program crashes somewhere during execution. What are the possible states of the “Beaches:dvd” row in the table? Explain: Answer Bigtable executes many sevral features that agree to the user to influence records in more compound behavior. First, Bigtable executes single-row dealings, which may be used to carry out “atomic read-modify-write” (Harrison Hoffman, 2007). series on records accumulate under an only row key. Bigtable does not now maintain general transactions crossways row keys, though it offer UI for batching writes crossways row key by the customers. Second, Bigtable let Beaches:dvd to use like an digit counter. Finally, Bigtable helps the execution of all possible states to deal by spaces of the servers. (b) You execute rent (“Beaches:vhs”, 2). The program completes. Immedi¬ately after the program completes, the power goes out in the data center, and all the machines reboot. When BigTable recovers, what are the possible states of the “Beaches:vhs” row in the table? Explain: Answer To states reboot power Beaches:vhs and program break down, a system have to keep data on reasonable media, like disks and flash recollection. Single data store on reasonable media will live reboot. (Reboot is a delightful solution for several impermanent crashes.) However durable media cannot lookout next to enduring failure. That needs duplication, where the scheme stays several copy of the records: backup, essentially. If the records are stored numerous times, on some geographically distributed PCs, then simply a main disaster will reason information loss. Reference: Harrison Hoffman. (2007). Six places to store your files online. Tech Culture. http://www.cnet.com/news/six-places-to-store-your-files-online/
Question 9: GFS / BigTable Relationship
The CAP properties of BigTable are directly linked to those of GFS, because BigTable utilizes GFS to hold its state at each chunk server. In particular, BigTable operations have pretty strong consistency, because GFS has this property.
(a) Pretend that GFS only had read-your-own-writes consistency (a weaker consistency guarantee). For example, when node 1 appends to a chunk, the view of that chunk by node 2 may not immediately reﬂect the result of node 1’s operation. Claim: under this scenario, BigTable no longer has the same consistency properties described in its paper. Do you agree with this claim? If so, give examples of scenarios during correct operation or under failure explaining your conclusions. If you disagree, explain your reasoning. Use a clear diagram for your scenarios, if it helps.
GFS control their search engine, is comparable to BigTable by a tunable constancy representation and no master (central server). BigTable is intended by semi-structured data storage space in intellect. It is a huge map that is indexed with a row input, column input, and a timeline. Every value inside the map is a collection of bites that is understand with the request. Each read or write of records to a row is single, anyway of how much distinct columns are examine or insert in that row.
(b) Again, pretend that GFS only had read-your-own-writes consistency. Propose controls you might add to the design of BigTable (but not GFS) that ensure the resulting system has the consistency properties described in the original BigTable paper (or stronger ones). Explain the full repercussions of your proposal. Use a clear diagram, if it helps.
BigTable controls via numerous of Google function, like web indexing, MapReduce, that are frequently utilize for make and change record stored “in BigTable, Google Maps, Google Book Search, “My Search History”, Google Earth, Blogger.com, Google Code hosting, YouTube, and Gmail.” (KP Krishna Kumar, 2014) Google’s motives for rising its own record consist of scalability and improved manage of presentation individuality. Reference: KP Krishna Kumar* and G Geethakumari. (2014). Detecting misinformation in online social networks using cognitive psychology. http://www.hcis-journal.com/content/4/1/14
Question 10: BigTable Transactions
From the paper, BigTable does not support general transactions across rows.
Your script-savvy colleague suggests that using a script to chain together a series of individual row transactions using a bash script is as good as multi-row transactions with BigTable. Your design-savvy colleague tells you that this suggestion does not have the same properties as a multi-row transaction.
Who is right? What are the properties of BigTable’s row operations that would impact your conclusion? Explain your answer in terms of what could happen during a multi-row batch operation that makes it transactional or non-transactional.
Script-savvy is multi-row transactions with BigTable that is used in a language developed by Google method named as Sawzall. By that instant, our Sawzall-based API will not tolerate user scripts to inscribe back by Bigtable, however it’s do permit several appearance of record alteration, clean support on arbitrary language, and summarization by a selection of worker. Reference
Google.org. (2014). Explore flu trends – United States. National. https://www.google.org/flutrends/us/#US
Question 11: Google Infrastructure
Google issues a statement tomorrow claiming it cannot delete your data quickly, and that it might take more than a month to purge the data you’ve marked for deletion to really disappear. Using your knowledge of BigTable, GFS, and general information man¬agement processes in an enterprise, explain why this statement is likely true. Mention all the places your data and metadata could be cached, saved, backed up, etc using your understanding of those systems.
All the places your data and metadata could be cached, saved, backed up, etc using your understanding of those systems are:
Conventional hard-drives create employ of a space to magnetically safe records. Hard-drive space is prepared in group that may every store a convinced amount of records.
Files accumulate on a hard drive be inclined to get more break than is severely essential. Records recurrence is general place by a case (believe of how they use in any agreed statement document) and its practice may be securely restricted during file compression.
Drawers have a custom of build up garbage till then it develops into a physical confront to slide them unlock
Every cloud has a storage lining
Cloud Storage is a future tool that resolves several of the logistic problems nearby records distribution and accessibility. Reference: Deloitte. (2014). The Benefits of Implementing a Risk Appetite Framework. http://deloitte.wsj.com/riskandcompliance/2015/01/05/the-benefits-of-implementing-a-risk-appetite-framework/
Question 12: Google Flu Trends
Representatives from the CDC report that the current Google Flu model is performing well across the U.S. The head of a U.S.-based relief eﬀort believes that, due to the model’s validation in experiments with the US, relief eﬀorts will be designed around Google’s ﬂu model’s predictions for data collected from queries originating in other countries. The list of target countries include both Mexico and the Philippines, for which no correlating CDC data exist. The relief eﬀorts involve shipping and stockpiling ﬂu treatment in response to early spikes in ﬂu incidents reported by the model. The agency asks you for a recommendation on this policy.
(a) What factors do you believe might impact the soundness of the model in those countries?
First incentive for GFM is that individual talented to recognize validation movement early on and relief rapidly can decrease the affect of recurring data collection. One statement is that Google Flu Trends is capable to forecast area occurrence of flu more than 10 days earlier than they are inform with CDC.
(b) What risks may be associated with this eﬀort, if the model is incorrect? What small-scale eﬀorts (small, relative to a national relief eﬀort) could be employed to ameliorate possible risks?
Google Flu Trends struggle to keep away from privacy infringement with only collective millions of unidentified explore inquiry, with no reorganization of persons that do the search. Their investigate log hold the IP address of the user, that may be used to draw back to the area where the search query is initially present. Google executes programs on system to keep and work out the records; consequently no manual work is concerned in the procedure. Google also apply the strategy to anonymize IP address in their out logs later than 9 months. Reference: DANIEL ABADI. (2010). Problems with CAP, and Yahoo’s little known NoSQL system. My problems with CAP. http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html
Question 13: Machine Learning
A 2010 research paper from Yahoo! investigated data mining Twitter, comparing how correct and false information spread across the network. They analyzed tweets during the 2010 earthquake in Chile. They found that the way rumors and misinformation spread (like Ricardo Arjona is dead) was diﬀerent from tweets about actual incidents that could be considered news or actionable data (small aircraft with six people crashes near Concepci´on). The research “seeks to contribute towards a deeper understanding of valid news and baseless rumors during a disaster.”
(a) In “The Unreasonable Eﬀectiveness of Data” paper, the authors advo¬cate leveraging “unsupervised learning on unlabeled data, which is so much more plentiful than labeled data.” Do you think the idea of detecting misinformation an example of Norvig’s suggestion (i.e., using lots and lots of unlabeled data to dif¬ferentiate misinformation from valid news)? Or does it undermine the suggestion (i.e., expressing that data lacking veracity is unsuitable for use)? Answer The discovery of propaganda in huge amount of records is a demanding charge. Technique applying “machine learning and Natural Language Processing (NLP)” (Robert Greiner, 2014) method live to mechanize the procedure to several extents. Though, since of the semantic natural world of the inside, the correctness of automated technique is imperfect and quite frequently need physical involvement. (b) Describe brieﬂy the diﬀerence between supervised and unsupervised learn¬ing using the idea of “rumors vs. news” in Twitter. What are the ‘labels’ in this scenario? Answer Supervised: Labels are known and you attempt to study how to forecast/representation them. Example: Rumors – Study how to forecast if a Twitter will not get more customer because of limited features Unsupervised: Labels are not agreed and you aim to remove information in common out of your records Example: News – How many distinct issue faced by users In Twitter? Reference: Robert Greiner. (2014). CAP Theorem: Revisited. http://robertgreiner.com/2014/08/cap-theorem-revisited/