Cleaning Up Big Data

Wednesday, February 22, 2017 @ 01:02 PM gHale


Big Data can be such a messy proposition that end users just do not trust the information gathered from all the systems.

Malfunctioning computers, data entry errors and other hard-to-spot problems can skew datasets and mislead people trying to draw conclusions from raw data.

RELATED STORIES
Hacking for Safer City Streets
Defending Grid From ‘Nightmare’ Attacks
Simulated Attack Shows ICS Weakness
Working to Fight Advanced DDoS Attacks

Vizier, a software tool under development by a University at Buffalo-led research team, aims to catch those errors.

The project, backed by a $2.7 million National Science Foundation grant, launched in January. Like Excel and other spreadsheet software, Vizier allows users to interactively work with datasets. For example, it will help people explore, clean, curate and visualize data in meaningful ways, as well as spot errors and offer solutions.

But unlike spreadsheet software, Vizier is for much larger datasets; it can examine millions or billions of data points, as opposed to hundreds or thousands typically plugged into spreadsheet software.

“We are creating a tool that’ll let you work with the data you have, and also unobtrusively make helpful observations like ‘Hmm… have you noticed that two out of a million records make a 10 percent difference in this average?’” said Oliver Kennedy, PhD, assistant professor of computer science and engineering at UB, and the grant’s principal investigator.

Co-principal investigators include Juliana Freire, professor of computer science and engineering at New York University, and Boris Glavic, assistant professor in the Department of Computer Science at the Illinois Institute of Technology. The award is from NSF’s Data Infrastructure Building Blocks (DIBBs) program.

For years, companies like Google, Microsoft and Apple have utilized Big Data to improve their products and services. That same power is now spreading to the masses as government agencies in the United States and elsewhere publish massive amounts of public data on the Internet.

One case in point is New York City and the federal government have open data portals making it possible for anyone with an Internet connection to download information and ask questions about their government. When properly used, these portals can shed light on issues relating to health code violations, discrimination, bias and other matters, Kennedy said.

Vizier will release as free, open-source software.

“We want to make it easier for data scientists — and eventually data hobbyists — to discover and communicate not only what the data says, but why the data says that,” Kennedy said.



Leave a Reply

You must be logged in to post a comment.