Big data: Buzzword or big deal?
You already have heard the buzz on big data again and again. If you do not collect and use the enormous amounts of data, which your company without any shadow of doubt is creating, you might as well close your business and go home before a competitor forces you to do so.
But what is big data? Is it when you are analysing the traffic on your website? Or is it transformed into big data, when you combine these numbers with data from your CRM-system? And is it in fact big data when Spotify or Netflix suggest new music or movies you might like?
Let’s start with defining what big data is. We had a talk with two data specialists who have worked with big data for many years and have their qualified guess on what lies behind the concept.
“Big data appears as a solution when the traditional approaches for data storing, processing and retrieving can no longer be applied because the amount of data is just too big. A big data case might be due to very frequent requests, a combination of several data sets or because the kind of stored values is more complex than just numeric values. It is hard to give a threshold to separate what big data is and is not. As a rule of thumb, if you are still wondering whether you have big data or not, the answer is probably not,” says Angel Diego Cuñado Alonso, Machine Learning Engineer at Tradeshift.
About the same definition comes from Johannes Scheibe, a Data Scientist at Ebay who has a Masters in Machine Learning and Computational Neuroscience. He points out four things that charactarises big data: High volume, high variety, high variability and high veracity.
Big or small data - who cares?
Angel Diego Cuñado Alonso and Johannes Scheibe agree on the fact that it does not make sense to put the label big data on all collections of data. Big data should be seen more as a challenge in collecting, storing and processing data. So trying to be a big data company just for the sake of it makes no sense, according to Johannes Scheibe.
“It doesn't matter if it’s small data or big data. What’s important is getting insights and finding a purpose for your business. If you have big data, it’s just a question about how to get that insight from the data, which is an entirely technical challenge. It makes no sense saying on a strategic level: We want to do big data. Either you have data and you can do something useful with it, or you don't,” explains Johannes Scheibe.
Because we all work in the cloud, and because more and more devices are attached to the cloud through the Internet of Things, many companies are in a situation where they are collecting more data. Whether they want it or not. And at the same time data has changed and become more complex.
“In old times, you had databases with columns and rows, where you stored numbers or words to perform simple queries usually in a local machine. Nowadays, we store information, images, videos, documents, music from different sources, and we want results in real time. Hence, the volume of data and its applications have increased drastically. In most cases, the old approaches cannot cope with this,” says Angel Diego Cuñado Alonso.
Start with defining the insights you need
If you are walking around with serious thoughts about big data, the trick is actually to turn the whole thing upside down. One way could be to say: What do we need to know? There is no reason to collect and structure data if it does not create value for the company.
“In today's world data is never perfect. I am a Machine Learning Engineer and I spend maybe 80 per cent of my time preparing data and putting it together. And maybe 20 per cent working with algorithms,” says Angel Diego Cuñado Alonso and gives an example.
For example, you might have collected data for five years but then find out you need an extra input before being able to draw a conclusion from it. So what do you do with the data you already have? Do you throw away five years’ data or do you try to guess the input for previous entries? No matter what, it is very time-consuming to work with “alignment” of data so it is consistent. The world moves fast and the needs can change, but having clear goals from the start can reduce this burden to a large extend.
"Let's not find solutions for problems we don't have yet; this would just add unnecessary complexity. First, a company needs to understand where it is and where it wants to be. What is our vision? What do our customers want? What are our needs, threats and opportunities? And, what do we want to do about it? You look at what data is available now and find out what extra data can be collected, either internally or externally. I want this - how can I do this faster, cheaper, scalable, flexible - and then you look at the available technologies, and if big data can help you, go for it” explains Angel Diego Cuñado Alonso.
“It's all about telling stories. Finding an insight and telling a story. This is what we see, and this is what it means. You need to be able to communicate this stuff,” adds Johannes Scheibe.
Large amounts of data is not big data
Therefore, for many companies it makes more sense finding a person who is capable of understanding the relations between numbers and who is focused on the company’s overall goals. In most situations, just analysing the company’s data is more than enough. It does not have to be labelled as big data just because you have a lot of data from different sources.
For many companies, hiring people to programme algorithms will be like using a sledgehammer to crack a nut, if your company creates data that can be handled just by sorting and filtering in an Excel spread sheet. Excel is powerful and in reality, it isn’t complicated or hard to create lists based on, let’s say, playing history.
“You should not start hiring data scientists because you think you want to work with big data. In my opinion, you should start with writing down what technologies you are using and what you need or want to know. What are you not able to answer today with the people you have? If you need someone who is really good at communicating results and doing stakeholder management, then you may not want to hire a computer scientist with a machine learning background. They tend to be bad at explaining things. So start with the business needs,” says Johannes Scheibe.
Let’s say you want to create a simple calculator for house prices based on postal codes and number of rooms. You can also add tax levels, grade averages from the local schools and the price of previously sold houses on the street.
But if you need to make a really effective calculator, you need to also look at the way the house is arranged and the condition of the house. You can do that by analysing floor plans and pictures and then you suddenly have data that cannot be categorised in a traditional Excel spread sheet, like you can with prices, quantity etc. That is big data!
Big data needs machine learning
If the diagnosis in your company is big data, you may need help from people who know how to operate in the lines between mathematics, statistics and coding.
“In some ways we are reaching the limit of human capabilities. A human alone can only handle a certain amount of information before it's too much for him or her. So machine learning comes to solve this problem. In machine learning the algorithms look at the data and it finds patterns, it predicts things,” says Angel Diego Cuñado Alonso.
Machine learning, deep learning and artificial intelligence are concepts that are integral in corporations such as Google, Facebook and NVIDIA. In order to be able to handle big data, you may need some sort of artificial intelligence to create logic in data, which cannot be put in columns.
A good example of this is a machine that sorts apples and oranges. With traditional coding you would need to write a lot of rules for how an orange and an apple look like. Colour, size, shape etc. But in most cases you would need too many rules for any one human being to grasp. For example – what if the orange is mouldy? The colour might be different, and it might have shrunken, but it is still an orange. That is why the answer would need to be machine learning.
“With machine learning you write an algorithm that will learn how to do the task from the data. We don't tell the algorithm about the colour or the size. No, we take an apple and say, this is an apple. We take an orange and say this is an orange. So with different apples in shapes and sizes the algorithm is going to learn how to recognize apples. It's going to create its own rules about how to perform this task,” explains Angel Diego Cuñado Alonso.
To outsource or not to outsource?
If you are about to start your data journey, you are also faced with multiple choices such as outsourcing or allocating in-house resources, decentralisation vs. centralisation. Both Johannes Scheibe and Angel Diego Cuñado Alonso have the view that you generally need to be careful when hiring expensive consultancies to drive the big data work. If you want to work seriously with big data, you need dedicated in-house employees.
“For me outsourcing is quite dangerous. It can only work if it is a one-time analysis. And even then you might make a huge report, pay a lot of money and still end up putting the report in the drawer and never look at it again. If it is a recurring thing, the issue is that if you don't have the skills in the company and you don't change the processes, don't change the culture, your way of thinking – then nothing changes. But you have an alibi. The report was the result - now it's in the drawer,” says Johannes Scheibe.
On the question as to whether the big data work is best put in a dedicated central unit or in decentralised units spread out in the organisation, it is actually depending on the company.
“There are pros and cons for each model. If you do it decentralised you are basically the only analyst in the team and you need to make sure the analysis you do is solid. You want to do some kind of quality assurance on analyses and ensure that you are using the same data as the analysts in the other teams,” explains Johannes Scheibe.
“So for someone else from a different business unit, who is only there to review your stuff, it can be quite difficult if you don't talk on a regular basis. So you want to make sure they come together and share insights and use the same methodology. You need to standardise your way of working”.
The danger of having a centralised team working on data, on the other hand, that the unit might lack an understanding of what goes on in the rest of the organisation. Therefore, it is vital to work with professional stakeholder management. To put it short: What do people want and how can we give it to them?
Finding the possibilities
Many companies today work with data. Big data or small data. Johannes Scheibe emphasises biotech and medtech as some of the areas where a lot of interesting things are happening these days.
“What's really rising at the moment is machine learning in pharma - drug testing. With the help of machine learning we can predict how humans will react to a given drug,” he tells and points out that with the Human Genome Project, which mapped the human genetic material, we will get even deeper understanding of the human body.
Apart from that he points out the financial sector as an obvious area for working with data in order to prevent fraud, just like areas inside law practice could be interesting as they are usually built on a foundation of absolute rules where a computer through machine learning can read thousands of pages and find relevant relations.