Posted on Wednesday 26 September 2012 by Ulster Business
It's one of those phrases that, like the word "Cloud", is overused without proper thought of what it actually is. So first thing's first, let's define what Big Data means.
Big Data is a catch all term for data either too big or moving too quickly to be analysed and processed in a meaningful way. It can be too vast to store and may not fit within our commonly used data processing programs (database servers, spreadsheets, etc). While the data produced far exceeds the capacity and ability to process it, the costs of storage and processing have made inroads for us all to start gaining knowledge from all this stored data.
There is a plentiful supply of data from website log files, point of sale transactions, social media interactions and comments (think about the amount of data Twitter and Facebook generate alone) and then we're only scratching the surface of the geolocation data that our phones can produce. Also, let's not forget monitoring equipment within healthcare, banking transactions, flight data from a single flight, documents, digitised music; I could go on with the list but I think you're starting to see the picture. I've not touched on the data being created within your own organisation and it's worth sitting down and thinking about what is and can be created and how using Big Data tools you can start to gain insight.
It's all very well having all this data but if you can't learn anything from it then it's pretty much a waste of time having it. The volume, velocity and variety of data available determines the technical considerations of how you are going to process it.
Even before the term BigData existed large volumes of data were still around. One example that springs to mind is the Tesco Clubcard. On its launch in 1995 the volume of incoming data was so great that Dunn Humby (the marketing company behind the Clubcard) only processed 10% of the data then applied the results out to the remaining 90% of the Clubcard holders. Now I'm pretty convinced that with high speed internet and vastly improved point of sale product capture, the Clubcard system can now analyse in real time. From a customer perspective though we're so used to receiving our coupons once a quarter that to change that may fundamentally change the whole concept.
The key question any organisation must ask before considering mining and learning from their data is "what question are we trying to answer?" It's all very well processing this huge volume of data with all this cheap and readily available processing power "because we can" but without an aim in mind it becomes a waste of resource, time and money. By defining the question you start to think about how the data can prove the assumption instead of the other way around.
As I've previously mentioned the cost of processing and storage declines over time but it does require the bandwidth to move the data around. One of the reasons that the likes of Amazon Web Services for data processing is so popular is the very short distance that the data needs to travel from a storage service (Amazon S3 for example) to the processing service. With data being held in various clusters of machines it makes sense to have them as locally as possible to the processing solution.
Another thing to take into consideration is that of the quality of data. Unless you are the originator of the data and therefore know its layout and scheme, the data will usually require some form of cleaning or quality checking before knowledge can be gained from it. This becomes important if you are processing data from various sources. Data may use various types of date/time information that needs to be uniform before any insight is gained. This again takes time and effort to get right.
One of the key decisions to make is to process the data in real time (also known as streaming the data) or to batch up blocks of data for processing later. You may have heard about technologies such as Hadoop, which is commonly named in Big Data conversations. All Hadoop does is provide a means for mapping sections of data for processing (the map phase) and then consolidate all the processed results to a single file (the reduce phase). While Hadoop technologies are great for batch processing they're not really designed for real time processing.
Every organisation's need for using big data is different. From the technologies and talent available to the knowledge acquisition, there is no one perfect fit and time must be taken to define the aim of any big data exercise. Once the aim is defined you can refine the feedback loop of information and make better informed decisions that will help the organisation and your customers.
Jason Bell is founder of Datasentiment Limited (www.datasentiment.com) based in Northern Ireland. He can be contacted via email at firstname.lastname@example.org
"Every day, we create 2.5 quintillion bytes of data – so much that 90 per cent of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few."
"Big data" is not about size alone. This year's big data is next year's normal-sized data. Generally, volume quickly gives way to the more defining requirements of variety, velocity and complexity."
"By the end of 2012 more than 90 per cent of the Fortune 500 will likely have at least some Big Data initiatives under way."
"One in three business leaders don't trust the information they use to make decisions. How can you act upon information if you don't trust it? Establishing trust in big data presents a huge challenge as the variety and number of sources grows."
"The only reason for pursuing big data is to deliver against a business objective. For example, knowing how a large population responds to an event is useless unless a business organization can benefit from influencing that event or its outcomes."
"Not all industries are likely to benefit from big data projects equally. Not surprisingly, the first movers were Internet companies; in fact, the most popular big data tools are being built on top of software that was originally used to batch process data for search analysis. The fast follower sectors are likely to be public sector, financial services, retail and entertainment and media."