Reactions to story from DBMS2--Database management and analytic technologies in a changing world

Reactions / posts that link to this post

  • Photo of baint

    What is MapReduce?

    http://blog.tonybain.com/tony_bain/2008/09/what-is-mapreduce...
    67 days ago in Tony Bain · Authority: 9

    There has been a lot of recent talk about MapReduce, particularly in relation to its addition to several of the specialized data warehouse platforms.  In this post I will try to answer the question of what is MapReduce and how does it relate to a relational RDBMS (at a high level). MapReduce In simplicity MapReduce is a framework that allows developers to write functions that process data.  There are two types of key functions in the MapReduce framework, the Map function which separates out the data to be processed and the reduce function which performs analysis on that data.  MapReduce is a logical concept, it is not a technology owned by any one vendor but it was made popular by Google as it is part of their search engine core technology. A very simple example of MapReduce that is commonly used is a process that counts the occurrence of a particular word in a document.  The “Map” function in this example would produce a set of data that contained all occurrences the desired word from the source data, the “Reduce” function would then count the number of items produced by the Map function are return a numeric value indicating the total number of word occurrences. While MapReduce seems very simplistic in its structure, it has been found that this type of 2 stage processing can be used to answer a large number of varied data analysis questions. MapReduce Scalability The real benefits of MapReduce start to occur when the framework for the execution of its functions is implemented in a large scale, shared nothing data cluster.  The platform that implements a MapReduce framework can abstract the complexity of running distributed data processing functions across multiple nodes in the cluster.  This allows a developer with no specific knowledge of distributed programming to create their own MapReduce functions and have the platform run those functions in parallel across multiple nodes in the cluster, and then also handle the gathering of the results from across the cluster to return a single result or set.  The platform can also abstract important cluster functions such as dealing with nodes that fail during execution, or nodes that are slow to respond during execution. Again Google is a common example of a large scale implementation of a MapReduce framework.  Google is said to run clusters of 10,000 + shared nothing nodes with PetaBytes of data, yet a developer can reportedly write a query that performs analysis across data in these clusters within a relatively short space of time (e.g. 30 mins) as they are purely concerned with the data analysis details not the physical execution details. MapReduce and RDBMS MapReduce’s integration with traditional RDBMS’s is now occurring, with some initial implementations been undertaken by some of the specialist data warehouse platform vendors such as Greenplum. Initially the benefits of MapReduce within a RDBMS are less clear as a distributed shared nothing RDBMS already has a mechanism for abstracting the execution of requests across nodes from the developer, through the use of SQL and the underlying query processor.  There is a lot of debate occurring in both academic and commercial fields as to if the inclusion of MapReduce in a relational RDBMS is a positive measure or not. For SQL and the query processor to be effective, the data has to be structured and have a pre-existing, well defined schema (tables, columns etc).  Once this is done optimization techniques, such as indexing, can be implemented making SQL an optimal method for accessing and processing that data.  However when the data is unstructured (i.e. without predefined schema) the ability to process that data using SQL becomes a lot more limited.  MapReduce on the other has no assumes no predefined schema which allows the developers to create functions that make their own assumptions about the schema definition of the data they are accessing.  It is common for a RDBMS to store unstructured data in the form of a BLOB (binary object), examples of this being documents, images, XML files, binary files.  A MapReduce function can be written to perform data processing across that unstructured data, for example counting words in a document or finding nodes in XML files. Much of the debate is centered on if the data itself should be unstructured or not, i.e. with or without schema.  If the data does not have a pre-defined schema (unstructured), then processing on that data cannot be validated by an external mechanism, such as the RDBMS.  Therefore it is up to the individual functions that make use of that data to verify its integrity.  Also without predefined schema, optimization mechanisms such as indexes cannot be created which means the processing of the MapReduce function uses a “brute-force” approach and scans all relevant data across all relevant nodes during the execution of the Map function. However when the data is structured into tradition database tables and columns, the benefit of MapReduce over traditional SQL for processing that data is less clear.   Some of the discussion in this area focuses on the fact that MapReduce functions can be written in native programming languages (perl, java etc) which are more familiar to developers than the SQL language.  This argument seems a little weak as due to the universal implementation of SQL across all RDBMS platforms, any experienced developer will have considerable SQL experience. Summary MapReduce appears to be a simplistic framework for data analysis across shared nothing clusters.  The clear benefit of MapReduce is when that analysis is taking place across unstructured data.  When structured data is being analyzed the use of SQL and the query processor seems preferable as this makes use of countless optimization techniques, such as indexing and join processing.  And finally, much of the debate against the use of MapReduce is less focused on MapReduce itself and more focused on if the underlying data should in fact be structured into table/columns within a traditional RDBMS. References http://en.wikipedia.org/wiki/MapReduce http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/ http://www.greenplum.com/resources/mapreduce/ [IMG Reblog this post [with Zemanta]]

  • Photo of ZDNetBlogger

    Databases leverage MapReduce technology to radically juice data scale, performance, analytics

    http://blogs.zdnet.com/Gardner/?p=2718

    In what could best be termed a photo finish, Greenplum and Aster Data Systems have both announced that they have integrated MapReduce into their massively parallel processing (MPP) database engines. MapReduce, pioneered by Google for analyzing the Web, now becomes available to enterprises and service providers, giving them more access and visibility into more data from more origins. Originally created to analyze massive amounts of unstructured data, the approach has been updated to analyze structured data as well. Greenplum, San Mateo, Calif., says that MapReduce will be part of its Greenplum Database beginning in September. Aster Data, Redwood Shores, Calif., says that MapReduce will be included in its Aster nCluster. [Disclosure: Greenplum is a sponsor of BriefingsDirect podcasts.] Curt Monash, president of Monash Research, editor of DBMS2, and a leading authority on MapReduce, sees this as a major leap forward. He reports that both companies had completed adding MapReduce to their existing products and had been racing to the finish line to get their news out first. As it turned out, both made their announcements within hours of each other. Curt lists some points on his blog about what this new technology marriage means. Google’s internal use of MapReduce is impressive. So is Hadoop’s success. Now commercial implementations of MapReduce are getting their shots too. The hardest part of data analysis is often the recognition of entities or semantic equivalences. The rest is arithmetic, Boolean logic, sorting, and so forth. MapReduce is already proven in use cases encompassing all of those areas. MapReduce isn’t needed for tabular data management. That’s been efficiently parallelized in other ways. But, if you want to build non-tabular structures such as text indexes or graphs, MapReduce turns out to be a big help. In principle, any alphanumeric data at all can be stuffed into tables. But in high-dimensional scenarios, those tables are super-sparse. That’s when MapReduce can offer big advantages by bypassing relational databases. Examples of such scenarios are found in CRM and relationship analytics. Greenplum customers have been involved in an early-access program using Greenplum MapReduce for advanced analytics. For example, LinkedIn is using Greenplum Database for new, innovative social networking features such as “People You May Know” and sees it as a way to develop compelling analytics products faster. A primary benefit of the new capability is that customers can combine SQL queries and MapReduce programs into unified tasks that are executed in parallel across hundreds or thousands of cores. Part of the appeal of business intelligence and its huge ramp-up over the past five years is that IT assets play an ever larger role in providing unprecedented strategic guidance and insights to leaders of enterprises, governments, telecos and cloud providers. IT has gone from an automating business functions role to an essential crystal ball service of the highest order. By consequently gaining access to larger data sets that — more than ever before can be mined and analyzed for higher levels of process and business refinements — IT has become a member of the board. With better data reach and inclusion, come better results. So BI allows leaders can establish the trends early that will determine their future success or failures. In a fast-paced, global, hyper competitive business landscape these insights are the currency of success for the future. The better you do BI, the better you do business … current, near-term and long-term. There’s no better way to know your customers, competitors, employees and the variables that buffet and stir markets than effective BI. Now, by exanding the role and reach of MapReduce technologies and methods, a powerful new tool is added to the BI arsenal. More data, more data types, more data sources — all rolled into an analytical framework that can be directly targeted by developers, scripters, business analysts, exectutives, and investors. These new MapReduce use announcements mark a significant advancement that helps makes IT another notch higher in its utility and indespensible nature to business. And it comes at a time when more data, meta data, complex events, transactions and Internet-scale inferences demand tools that can do for enterprise BI what Google has done for Web search and indexing. Being comprehensive and deep with massive data sets analytics offers a new mantra: The database is dead, long live the data. Structured data and the containers that contain it are simply not enough to organize an access the intelligence lurking on modern networks, at Internet scale and Internet time.