IBM researcher Kevin Nowka talks about the big, big data

Dr. Kevin Nowka at the AT&T Center at the Austin Forum.

Dr. Kevin Nowka is cute. He’s a little nervous to leave his laptop in the AT&T conference room just to go out for a photo shoot. But when he stands in front of those pretty red flowers and start smiling into the sun for the Austin Forum photogs, he looks as so cute I want to hand off all of my personal data to IBM.

Dr. Nowka, a grad of Stanford University and the director of IBM Research, Austin, specializes in high-performance and low-power circuits, processor design, and technology. He works with teams of scientists studying system models, creating faster and ever more efficient VLSI circuits (Very-large-scale integrated circuits, see WIKI for more: http://en.wikipedia.org/wiki/Very-large-scale_integration.) The short version: they are packing thousands of transistors on a single chip. Go, IBM. Make my phone smarter. Or make my Nest thermostat. And other intelligent whatnot.)

A VLSI integrated-circuit die

Big data, big opportunities

The dilemma of big data: we can capture it, but who will put it to effective use? Dr. Nowka discussed the new tech twists that will put the tools for data management into play.

So…big problems cause big data. But, to solve big problems, we need big data. They are interrelated.

Nowka listed some examples of big data, big problems, and big opportunities:

Highway congestion: urban roadways that are broken by being underbuilt and causing congestion cost the U.S. roughly 5.5 billion an hour and 2.9 billion gallons of wasted fuel. (Statistics from Texas Transportation Institute).
The U.S. could save $130 billion annually by deploying smart-grid technology to electrical delivery systems.
Big data analysis’s goal is to draw value from data that has variety, velocity, volume, and veracity. Apply this intentionality to law enforcement, traffic control, telecom, manufacturing, and more.
Gross waste of resources in government systems could be addressed by clever applications of tech to big data, going after fraud, and reducing waste.

The volume of digital data is expected to double every two years. That goes for you, for me, for the US, for the Library of Congress, etc. Just think how much data you personally store; you are probably creating increasing amounts of personal data with no end in sight. By 2017, the total digital data will surpass the number of stars in the observable universe.

And the more access people have, the more data they create. About two-thirds of the world still does not have access to the Internet, so we can expect our data creation to grow exponentially as more of the world gets connected.

There were 5.9 trillion text messages sent in 2011. That represents five times more data than the voice data sent via phones. (Phone fact: there are more than 6.3 billion mobile phones out there.)

A picture of the IBM RAMAC disc storage from 1950s. We now can store a thousand times more data on the average memory stick. (http://en.wikipedia.org/wiki/IBM_305_RAMAC )

Social interactions as well as mobile communications create almost unimaginable amounts of data. And the type of data is changing: currently, 80 percent of the data being created is now unstructured. (Structured data is data in a relational database. Unstructured is…everything else.) And data is connecting to other data, as refrigerators hook up in a horrifying and obscene way with phones, toasters, tablets, and, ultimately the 2001 Hal computer. (I added that last part, not Dr. Nowka).

Something to think about the next time you take a ride on a plane: it takes a billion lines of code to run the software that runs an airplane. Each engine on a plane generates 10 TB every 30 minutes.

Also, 70 percent of most data is multimedia. Don’t just think of images from your phone. More than a billion medical images were generated in 2012.

Velocity: data is in motion, coming at us at gigabit speed. It can be managed in “real-time” models and used to predict. We can take action based on what the data tells us. Homeland security requires 50 billion records a day; 320 terabytes of deep analysis.

A scary reality: one in three business leaders polled said they were making business decisions without a clear understanding of what their company data is indicating.

So, how do you make sense of unstructured text data? Since computers got us into this giant data situation, perhaps we can use them to help us make sense of it.

Currently, a tiny percentage of potentially useful data is tagged, and less is analyzed. This makes me think of crowd-sourced data tagging, such as crowd corrections of facts in WIKI, Google Maps, WAZE, and a hundred more such loose but effective collaborations.

Tagging data: word-based or topic-based tagging. Machine learning is being used to classify words into topics, which can then be mined, to retrieve and analyze the specific data that is relevant to a specific topic or keyword. Think “Ben Laden.” You probably should not say that phrase on your cell phone in an email — or in a blog post. Whoops! JK. But, seriously. Watch what you say, type, and blog.

Nowka showed us an IBM application that sifts through Facebook data to find selected topics. He can do the same thing with your tweets, and snag location information, too.

Image data.

Computers are making sense out of it. Consider medical image category recognition software: it combs through millions of images to locate images that correlate to a topic of interest on a specific disease. Consider ImageCLEF 2012: a computer attempt to classify images into categories that yielded about 88% correct image classification. (http://www.imageclef.org/2012 )

The next step is creating natural language access to big data. Watson is an open-domain question answering system that delivers precise answers to questions, with accuracy. IBM Watson finds, reads, scores, and combines information. It searches structured and unstructured data. It finds potential answers and compares the results in a scoring engine to determine the confidence level in the potential answer.

It is important to know when you do not know. (“There are unknown unknowns – the ones we don’t know we don’t know.” – Donald Rumsfeld, U.S. Secretary of Defense at the time.) A system like WATSON can help us avoid the “unknown unknowns.”

Dr. Nowka’s vision is big: data analytics taking a variety of high-volume, high-velocity data of all types, and using natural language accessible systems such as IBM Watson to mine that data for meaning and substance. There is no shortage of problems that we can apply to analytics.

So, questions. How can big data not become…evil? Nowka says, “Knowledge is power. But those in control of data should be making sure that privacy is protected for those whose data is being processed.”

What’s next? How close can we come to AI mimicking the robustness of human analysis? Nowka speaks of IBM Watson and what it can, and cannot do, at this time.

What sort of cool places will IBM go as they play with their big data? Currently, IBM is investing in Smarter Planet. IBM tech is going after big city issues, after safety, petroleum, traffic, after health issues. IBM wants to apply Watson to big health centers such as Sloan Kettering. So much more can be found on their website. (http://www.ibm.com/smarterplanet/us/en/?ca=v_smarterplanet ).

Links

Austin Forum event for Big Data http://www.austinforum.org/speakers/nowka.html