One Big Risk With Big Data: Format Lock-In
By Rutrell Yasin and Tobias Naegele
October 30, 2017
Share this page:
Insider threat programs and other long-term Big Data projects demand users take a longer view than is necessary with most technologies.
If the rapid development of new technologies over the past three decades has taught us anything, it’s that each successive new technology will undoubtedly be replaced by another. Vinyl records gave way to cassettes and then compact discs and MP3 files; VHS tapes gave way to DVD and video streaming.
Saving and using large databases present similar challenges. As agencies retain security data to track behavior patterns over years and even decades, ensuring the information remains accessible for future audit and forensic investigations is critical. Today, agency requirements call for saving system logs for a minimum of five years. But there’s no magic to that timeframe, which is arguably not long enough.
The records of many notorious trusted insiders who later went rogue – from Aldrich Ames at the CIA to Robert Hansen at the FBI to Harold T. Martin III at NSA suggest the first indications of trouble began a decade or longer before they were caught. It stands to reason, then, that longer-term tracking should make it harder for moles to remain undetected.
But how can agencies ensure data saved today will still be readable in 20 or 30 years? The answer is in normalizing data and standardizing the way data is saved.
“This is actually going on now where you have to convert your ArcSight databases into Elastic,” says David Sarmanian, an enterprise architect with General Dynamics Information Technology (GDIT). The company helps manage a variety of programs involving large, longitudinal databases for government customers. “There is a process here of taking all the old data – where we have data that is seven years old – and converting that into a new format for Elastic.”
Java Script Object Notation (JSON) is an open source standard for data interchange favored by many integrators and vendors. As a lightweight data-interchange format, it is easy for humans to read and write and also easy for machines to parse and generate. Non-proprietary and widely used, it is common in both web application development, java programming and in the popular Elasticsearch search engine.
To convert data to JSON for one customer, GDIT’s Sarmanian says, “We had to write a special script that did that conversion.” Converting to a common, widely used standard helps ensure data will be accessible in the future, but history suggests that any format used today is likely to change in the future – as will file storage. Whether in an on-premises data center or in the cloud, agencies need to be concerned about how best to ensure long-term access to the data years or decades from now.
“If you put it in the cloud now, what happens in the future if you want to change? How do you get it out if you want to go from Amazon to Microsoft Azure – or the other way – 15 years from now? There could be something better than Hadoop or Google, but the cost could be prohibitive,” says Sarmanian.
JSON emerged as a favored standard, supported by a diverse range of vendors from Amazon Web Services to Elastic and IBM to Oracle, along with the Elasticsearch search engine. In a world where technologies and businesses can come and go rapidly, its wide use is reassuring to government customers with a long-range outlook.
“Elasticsearch is open source,” says Michael Paquette, director of products for security markets with Elastic, developer of the Elasticsearch distributed search and analytical engine. “Therefore, you can have it forever. If Elasticsearch ever stopped being used, you can keep an old copy of it and access data forever. If you choose to use encryption, then you take on the obligation of managing the keys that go with that encryption and decryption.”
In time, some new format may be favored, necessitating a conversion similar to what Sarmanian is doing today to help their customer convert to JSON. Conversion itself will have a cost, of course. But by using an open source standard today, it’s far less likely that you’ll need custom code to make that conversion tomorrow.