Of Dark Data, Beware You Must
April 4, 2013 No CommentsBig data there is. To master it you must learn, but of dark data, beware you must.
A Data Padawan, on his quest to become a Data Jedi, many dangers he will encounter. As big data slips from the peak of inflated expectations and into the trough of disillusionment at intergalactic speed, temptations to stray beyond the limits of the Trade Federation abound. Dark data that beyond these limits resides, if properly mastered, incredible opportunities for Data Jedis will create, for the Force to unleash and for their organization’s bottom line to levitate.
Dark data is usually defined as data that is kept “just in case” but hasn’t (so far) found a proper usage, or can be harvested and leveraged beyond its primary (intended) usage.
Examples abound but could include:
– Measurements collected by the hundreds of sensors built all over a car (or the Millennium Falcon). These measurements are handy for the mechanic (or for Chewbacca) when the car/spacecraft is in the shop. But the manufacturer can also use it to diagnose patterns of failures, optimize performance, or even perform preventive maintenance.
– Access logs from facilities doors (or from the shield of the Death Star). Beyond their primary use (to prevent unauthorized access by Rebel vessels), such logs allow to analyze visitor flow, optimize elevator traffic, better regulate HVAC, protect from total destruction, etc.
– Unstructured data, such as audio, video, 3D holograms, Death Star blueprints, etc. – stored on servers, in the Cloud or in R2 droids, that can be mined for information beyond the intended message they mean to convey.
The first challenge faced by the Data Padawan is to identify which data is available, and where. By definition, dark data is data that was not meant to be used in that particular way. It’s usually not stored in databases or systems managed by IT, and rarely inventoried in the enterprise’s metadata catalog (when such a catalog exists). Rather, logs are often kept as files stored on disk/in memory inside the system itself, or in an embedded database. Another obstacle is dark data collection. Connectivity to the systems can be difficult, because of protocols, security/permissions, firewalls, or even simply lack of APIs.
The next step in the Data Padawan’s apprenticeship is to process this dark data, and to produce value – the kind of value that develops the Force of the organization. Thankfully, many tools and technologies are available. Hadoop and NoSQL databases, data integration and data quality tools generating native MapReduce code, optimized SQL query systems for Hadoop such as Hive/Stinger or Impala, all make the life of the Data Padawan easier. Because frankly, while a light saber may come in handy for slicing and dicing data, it is a bit crude for detailed analysis…
There remains one major obstacle on this quest: the dark data island. A dark data system is not, cannot be, an isolated system. Dark data must be used in conjunction with the rest of the information system. Dark data applications must be connected and must exchange with other databases, applications, analytical platforms, etc. Only then will dark data embrace the Force, and forgo its Dark Side. To become simply data.
And only then, a Data Jedi the Padawan will become.
May the Force of data be with you.
Master Yves de Montcheuil is a Data Jedi and the Vice President of Marketing at Talend, the recognized leader in open source integration. Yves holds a master’s degree in electrical engineering and computer science and has 20 years of experience in software product management, product marketing and corporate marketing. He is also a presenter, author, blogger, social media enthusiast, Star Wars fan, and can be followed on Twitter: @ydemontcheuil.