Growing trend of SciFi movies is not just symptomatic of advanced green screen and animation technologies. Thanks to advancements in the big data boom, cinematic fantasies are actually realities in the making as the 2013 big data conference, NoSQL Now!, unveiled.
Here is a glimpse: Imagine the “precogs” from Minority Report or the Oracle from The Matrix actually anticipating future crimes or user actions, down to the second. Picture a network database as complex and comprehensive as the Matrix or the Tree of Souls from Avatar, linking all living beings together and automatically deploying labor resources (or animal troops from Avatar) in critical moments of insufficient capacity.
The prequel to the “precogs” and the Oracle is already here. Target can detect, with accuracy, when a customer is pregnant before she or her doctor does based on her purchase behavior. Women who are pregnant in the first trimester --as early as the initial week – have hyper sense of smell that triggers not only nausea, but also changes in brand loyalty for scented skincare products. Target is not the only prophet. “The best way to tell who can make their flights is whether they preordered a vegetarian meal,” says the founder of Kaggle, Anthony Goldbloom, “the psychology of making that trip personal, by knowing your meal is on it, makes you more likely to get on the plane on time.” Currently, businesses are not racing to implement predictive analytics, but rather to do it better. Leading market shares and profits goes to those who can do it best.
The complex data analytic toolset spans from programs written in statistical data-mining languages such as R, to virtualized distributed computing in a cloud space similar to the proverbial matrix of The Matrix. The databases have gathered intel from your social posts and reviews, purchase patterns, and even surveillance video feeds to forecast your next move. Our unconscious habits, rather than consciously made decisions, influence 45% of our choices according to research from Duke University. Hollywood calls this process artificial intelligence; CIA calls it terrorist intel; and businesses call it business intelligence.
Adoptions have heighted from mobile devices, social networking and media, networked devices, and even sensors and surveillance videos. Data stores are billowing up by 50% year over year from terabytes to zettabytes (1B terabytes). By the end of this year, a total of 2.5 ZB will be stored globally, not including the NSA’s new 5 ZB Utah Data Center.
Traditional “Structured Query Language” (SQL) databases, which organizes information into neat relational tables (think Excel), just cannot scale for complex, less structured datasets. Over 80% of data today is unstructured files, such as SlideShare documents, that do not work well on typical relational databases. For example, LinkedIn’s graph networks that capture complex first, second, and third degree connectivity is best stored in databases like Espresso DB or Neo Technology. Twitter stream or Uber’s taxi hailing geo data are best stored in column-family databases like Cassandra or HBase. These databases are non-SQL and complement traditional SQL stores.
It is no secret that NoSQL databases (Not Only “Structured Query Language”) combines the powers of both SQL and non-SQL software to seed the next generation of predictive models. But how does non-SQL compare to SQL?
Non-SQL, or non-relational, databases are still in rapid growth stage versus the already matured SQL stores. The pro of non-SQL databases, thanks to the open-source innovations from Hadoop since 2005, is their free upfront investment and flexible structures, enabling infinite more data that do not fit into a neat table to be captured.
However, like any growth stage technology, there is a lot to be figured out. It has costly ongoing, operating costs from running large throughput because of the sheer size of the datasets (up to zettabytes) compared to SQL’s terabytes or even megabytes in some cases. Because of the size, real-time analytics is challenging. Sure, storage and processing can be distributed to multiple servers for parallel simultaneous processing. But zettabytes of data requires thousands of servers, each of which has several terabytes at best. These are the problems startups in noSQL analytics are tackling.
Three trends are reinforcing sentiments of Box CEO, Aaron Levie: CTOs are coming to the forefront of businesses today to play a more active role in strategies.
1. Bringing data closer to where the querying and processing is executed. Historically, data has been centralized remotely from its application, which increases latency and dampens accessibility. According to GigaSpaces Founder, Nati Shalom, “data can be stored in highly concurrent ‘in-memory’ indices and is brought closer to the application of the data to streamline data referencing. Large data sets are partitioned according to its application, making partitions smaller in size…to scale-out and scale-up.”
2. Cloud platform for non-relational databases is more widely adopted by enterprises as cloud giants like VMWare and AWS (Amazon Web Services) innovate to virtualize non-SQL accessibility and processing. SMBs have smaller datasets and thus have traditionally operated their non-relational data stores in the cloud with Amazon EC2 or its partners, MongoDB and Couchbase. Enterprises have considerably larger datasets which, if put on a non-SQL database will become less stable and cannot be processed in real-time.
3. Pre-packaged business intelligence solutions is gaining popularity with growing trend in predictive analytics among SMBs. Shalom from GigaSpaces remarks that enterprises can afford “highly customized big data systems and would therefore build solutions on top of low level middleware components.” Here is a look at LinkedIn University Pages’ architecture and its highly customized architecture with middleware stacks.
SMBs lack the resources to build custom systems and often rely on high level solutions, skipping the middleware. Acunu is an example where it offers a platform for low-latency analytics with embedded dashboards to help monitor and control noSQL data environment.
According to the Technical Director of MongoDB, Jared Rosoff, noSQL has helped many businesses scale, such as FourSquare. “Rather than storing check-ins and tags (“has wifi”, “hotspot”) in tables and mapping them to form complex, inefficient interrelationships, MongoDB tags are embedded directly into the document representing a venue.”
Tim Moreton, founder of Acunu, helps companies map around particular key words on Twitter to gauge the real-time sentiment of brands or trending topics. Geo data can also be put to use. “Every time you move your little man on Hailo to hail a cab, Acunu records the demand details [geo, time], personalized customer journeys, and cab availability. This control over supply and demand completely instruments the entire Hailo infrastructure.”
Shalom from GigaSpaces is working with one of the largest US airlines to reduce delays and cancellations. The performance boost from in-memory processing is now computing 10 times more activities than the previous relationship system. This is automating much of the flight control based on weather or rout changes.
The sequel to the current development of noSQL analytics has potential that far exceeds business disruptions and enterprise profitability. A revolution is in the works.
If decisions can be automated with real-time analytics of complex data, Asiana flight 214 may not have crashed in SFO this past July. The pilots had 1.5 seconds before impact to make a life-altering decision, one which real-time analytics can conduct in nanoseconds and one which human intuition cannot accurately make.
Thanks to the scalability of noSQL databases, sky (or politics rather) is the limit. Imagine real-time traffic GPS, Waze, combining its data with Traffic Management Authority to reroute traffic and optimize traffic lights. Airlines are already doing this for air traffic, thanks to GigaSpaces.
What else is possible?
SQL took two decades to mature to its current low operating cost structure. Assuming noSQL continues to boom with waves of startups innovating analytics, it is likely we are on the brink of finding Steven Spielberg’s AI or iRobot in two decades.
The paradox raised by Neal from The Matrix asks whether we have a choice. Given the “big data” oracle already knows what we will do next, it naturally pushes products and facilitates that decision for us. Maybe we have less freedom than we think and The Matrix has already come to life.