What solutions are there to free organisations from the suffocating grip of dark data?

Table of contents
Share:
Discover Phoenix Data Platform
Qu'est-ce que la dark data ?

Dark data: an octopus with exponential growth

It was the Gartner research institute that helped bring the term ‘Dark Data’ into fashion. In an article dated September 2017, Sony Shetty gives the following definition: ‘the information assets that organisations collect, process and store in the course of their day-to-day work, but then no longer use.’

With the massive digital transformation of society, and the accompanying proliferation of digital exchanges, it is clear that the quantity of dark data has multiplied in recent years. But what exactly is the scale of the phenomenon?

The multifaceted nature of data production and storage within organisations means that it is not easy to measure it precisely. Several studies have nevertheless sought to quantify it, and estimate that dark data will account for more than 50% of the world’s data assets: no less than 52% according to Statista in 2020, and up to 65% according to the Digital Decarb website.

As well as the fact that dark data now accounts for a majority share of IS, what is also certain is that this proportion is only increasing. And why is that? Quite simply because the digitisation of businesses and government departments is not yet complete, and the multiplication of sources is giving an exponential profile to the number of interactions they have, and hence to the volume of data they generate.

mdm white paper

Master Data Management : Data quality and traceability at the heart of your information system

By the way, what’s the problem with dark data?

In fact, the proliferation of what is also known as “cold data” poses not one problem, but several: security, financial and – last but not least – environmental. It will not escape anyone’s notice that if “dark data” corresponds to the part of the data that the organisation that produced it has not been able to classify, it is more than likely that a significant proportion of this grey area contains latent risks, which will come to the fore in the event of an unforeseen exposure: hacking, external audit, internal malicious acts, etc.

These risks themselves are of various kinds, depending on what the data contains:

  • Regulatory risk in the event of non-compliance with an ever-increasing number of standards (we immediately think of the RGPD, but almost every economic sector adds its own layers of standards).
  • Competitive risk in the event of sensitive R&D or commercial information.
  • Reputational risk, particularly if the data concerns individuals, etc. The materialisation of any of these risks inevitably results in a financial loss, directly in the event of a fine for non-compliance with the rules, but also indirectly through a deterioration in competitive position.

It should also be noted that the general attitude towards the responsibility of organisations in the event of a data leak has changed radically over time. Just a few years ago, a company or public authority that suffered an incident leading to data loss was seen as a victim. Today, it is seen as a failure to take adequate precautions, and expectations are growing, not only on the part of authorities such as ANSSI in France, but also on the part of customers and public opinion.

Even in the absence of problems such as data leaks, the mere storage of dark data generates costs for organisations which, while often not identified, are nonetheless considerable. Several layers are added on top of each other to contribute to the overall economic inefficiency: energy consumption of server rooms, construction and maintenance of the corresponding premises, oversized machines, billing for storage space by hosting providers, etc.

A figure? 2 billion euros per month! That’s the estimate made by the IDC analysis institute, for all businesses worldwide.

But in the context of the climate emergency to which all players in society must react, the most unacceptable cost is perhaps the environmental cost of dark data. The mobilisation of considerable resources to keep available data that will never be used is pure waste.

In an article in Computerworld in August 2019, Charlotte Trueman wrote that the electricity consumption of data centres had already exceeded 3% of global consumption (i.e. more than the total consumption of Great Britain!). Unfortunately, electricity is not the only resource being used unnecessarily: in a May 2021 publication, contributors to the scientific journal IopScience estimated that data centres are among the top 10 water-consuming industries in the USA, due to their cooling requirements.

And what about France? In its June 2020 report, GreenIT.fr, a group of independent experts, estimates that data centres account for 13% of overall electricity consumption in the digital sector. If we take a conservative estimate of 50% dark data in the data hosted on these servers, we arrive at more than one million tonnes of CO2 emitted ‘for nothing’ every year, or the equivalent of around 1.5 million return journeys from Paris to New York by plane (according to the calculator provided on the government’s civil aviation website).

So what should we do about dark data?

The first prerequisite for solving a problem is to have identified that it exists. Although awareness of this issue was still limited in the world of work a few years ago, it is now widespread, and many organisations have made digital sobriety one of the objectives of their IT architecture and operations. More and more have introduced indicators, and some are now publishing targets for improvement. This is the case of Société Générale, for example, which has committed to reducing its greenhouse gas emissions by 50% between 2019 and 2025. In the public sector, the ANCT recently decided to include data lifecycle management among the levers that local authorities are invited to consider when putting in place their responsible digital roadmap.

Under these conditions, can we afford to be optimistic and predict the short-term disappearance of the (almost) bottomless pit of dark data? In reality, it’s not quite so simple. The first factor of resistance comes from reflexes that are deeply rooted in organisations, at both management and employee level, and which could be called the “you never know” syndrome.

When in doubt, even data that is clearly of no interest for future use is kept and often duplicated as soon as it is produced. This de facto condemns the data to endless wandering in the cloud, as no one will take the time to go looking for it, and a new layer of useless data quickly covers up the previous one.

white paper mdm versus pim

MDM versus PIM: bitter rivals or a dream team ?

Behaviour can – and must – change, but that is not enough. The desire to tackle the issue of dark data head-on is coming up against a number of difficulties. A study carried out by TRUE Global Intelligence on behalf of software publisher Splunk has sought to identify them more precisely. Questioned in early 2019 about their perception of the main obstacles, a panel of 1,300 IT decision-makers in 7 countries, including France, the US and the UK, came up with the following ranking:

  • The amount of data involved: 39% of respondents
  • The lack of necessary skills: 34%
  • The lack of availability of resources: 32%
  • The difficulty of coordinating between departments: 28%.

And yet, these managers cannot be accused of lacking motivation, since 77% of them consider that searching for and finding dark data in their organisation should be a top priority.

Even so, and this is what gives cause for a degree of optimism about a situation that for the moment is only getting worse, solutions are now available to help organisations tackle the problem effectively. The data discovery platforms offered by Blueway and other publishers make it easier to map the content of data assets, and to identify ‘cold data’.

With our MyDataCatalogue module, we have sought to extend the scope and automation of the data cataloguing process as far as possible: programmable scans, field name recognition algorithms, application of criteria to metadata as well as to data, whether structured or unstructured, multilingual glossaries, consideration of synonyms, and so on….

As noted in the TRUE Global Intelligence survey cited above, one of the obstacles to implementing a data governance policy designed to reduce the volume of dark data is the difficulty of organising interaction between departments that often have different cultures and objectives. With the Collaborative Data Cleaning service, Dawizz has taken this reality into account by defining an intuitive interface with each department, enabling it to carry out its task efficiently and quickly: data management, business users and IT simply contribute to the end result.

The Crédit Agricole Group is one of the users of Collaborative Data Cleaning, with several Regional Banks having already deployed the service or preparing to do so. Feedback from the Caisse Régionale de Normandie clearly illustrates the benefits of a properly equipped data cleaning campaign, with this joint testimony from the CDO and the DPO:

Like the Hydra of Lerna in Greek mythology, dark data has an unfortunate tendency to regenerate as it is destroyed. That’s why it’s important to launch regular clean-up campaigns to avoid replenishing excessive stocks, and to monitor the results so as to be able to identify any areas within the organisation that are potentially less effective than average in their clean-up efforts, and to support them with targeted awareness-raising initiatives.

Schedule a call

Want to discuss your Data Catalog challenges with an expert?

Stephane Le Lionnais
Stéphane Le Lionnais
Entrepreneur passionné et polyvalent, Stéphane est le co-fondateur de Dawizz, la société à l’origine de MyDataCatalogue, module de Data Catalog intégré à la plateforme Phoenix que Blueway. Grâce à son expertise terrain et à son écoute attentive des besoins clients, il conjugue savoir-faire pratique et vision stratégique.
In the same category: Data Catalog & Data Discovery