A cluster analysis is a process for analysing and presenting data that involves grouping together similar objects and breaking apart those that that are dissimilar. This is commonly used in data analysis. We experience cluster analysis in many situations in daily life. For example, if you are eating a meal out, then you will probably be grouped together around a table. This is a ‘cluster’ of sorts.
Likewise, we tend to cluster with people that are similar to us naturally. Interestingly, we also become more similar to the people that we spend most time with and less similar to out-groups. This is called convergence and divergence.
We use cluster analysis in classification very often. For instance, in classifying diseases we will often cluster together different conditions that have similar symptoms, similar at-risk groups, similar points of origin etc. This can help to create useful taxonomies which may then be useful in helping to find potential cures. In data analysis, cluster analysis can be used in order to look for patterns and trends. For instance, if you group together the visitors to a website who all have certain qualities in common, what trends do you see? What if you group them in different ways?
Finally, cluster analysis is also very commonly used in order to display data in a meaningful way or in an entertaining way. For example, we often see tag clouds in use on web pages, which are groups of keywords and terms that are used commonly throughout the site or even within an individual article.
Big data is a new buzzword that you currently hear a lot of when reading about business and the web, but what precisely does it mean? Well in short, big data means data that is so large and complex that it becomes difficult to handle using traditional means.
How does such data come about? Well if you have an online element to your business, or use any software, then you will find that you handle a lot of data. These systems are able to gather huge amounts of data automatically and that means you have a lot of powerful information that you could potentially use to great effect. The problem is that this data then needs to be analyzed to be of any use.
Let's use Wikipedia as an example. At the time of writing Wikipedia has in excess of 30,000,000 pages and has had well over a billion edits. In other words, this is not only a huge amount of information, but also a huge amount of data.
Conceivably Wikipedia might be able to benefit from some of this data. They might like to know when a page is most likely to be edited, how many times the average user is likely to make an edit, and how many articles they have that meet certain criteria. Now of course working out even an average time for an edit when you have over 1 billion edits to take into account is something that will take some processing power as well as some time. This huge number of edits then makes the data almost impossibly big and almost less useful as a result. What is more, is that Wikipedia could arguably use a lot of this data in order to improve its business and the service it provides to its audience.
For instance, by looking at which pages tend to be read next after a previous page, Wikipedia can better recommend more content to keep you on the page. This is where algorithms can then be applied in order to calculate this sort of thing. For instance, cluster analysis could be useful in order to group together ‘similar web pages’ and here you might even use the visitor metrics as a factor in that calculation.
Likewise, you could use a similar process in order to find which pages have the most engagement and are the most likely to encourage readers to donate money (Wikipedia’s primary source of income). By clustering these pages together Wikipedia could:
You could from this point use a Venn Diagram in order to visualize more than one cluster and to see where there is overlap. If the pages are clustered both into the ‘likely useful for this user’ and ‘high chance of donation’ groups, then we can use this information in order to choose the best page to show. Note that the Venn Diagram doesn’t have to be actually visualized but can instead be used simply as part of an algorithm.
Of course, you are not Wikipedia though, and chances are that your business won't be generating this much data. However, if you have a website with over 1,000 viewers a day even, that is already a lot of information. If you have a log in, or even if you are not online and just sell an awful lot of products across a wide chain of retail stores, you are still going to have to deal with that data.
So how do you deal with this data? Well there are a few options, one of which is simply to outsource your big data handling to a company that specializes in it. They will have the time and the tools necessary to deal with that information and they will dedicate their time to giving you every possible form of visualization or useful implication for that information. This is a good move as you might find that the information can identify patterns and information you perhaps were not previously looking for.
Another important consideration is how you collect that data and what you do with it at the time. Some data won't be useful for instance, and if you are collecting information that is only going to get in the way then you should set up a system so that useless information is ignored. Likewise, if you can calculate some things as you go then it will prevent you from having to do huge sums with gigantic amounts of data. This means creating the right software and refactoring in such a way that this software is quick and efficient - and that means using professional services once again.
The future for big data is bright and fascinating. For now, companies are unable to make full use of all the data they are collecting. Imagine all the potentially different clusters that could be created from the gigantic amount of information available on Wikipedia. All the articles under 100 words long. All the articles under 150 words long!
Machine learning is using this kind of data analysis to do everything from improving the ability of computers to navigate physical spaces (through computer vision) to improving voice recognition. When we marry these algorithms with the power of quantum or distributed cloud computing… the world will change beyond recognition.
As we have seen then, there is a lot of talk at the moment about how best to display data. Thanks to the huge amount of information that companies and individuals can now collect through the internet, finding elegant ways to convey trends and key points in data is an important challenge. In many cases, it is a challenge for sheer computational power apart from anything else!
But all of this talk tends to focus on a specific type of data: quantitative data. That is to say, that it deals with the use of numbers and how you can display values, scores and measurements. A set of numbers can easily be displayed as a chart or a graph, it can be conveyed through size or through colour and single figures such as percentages can easily make a big impact when written in large fonts and bright colours.
But there is another type of data too: qualitative data. That's data that takes the form of words, phrases and sentences, and which is classically much more difficult to deal with. If for instance you were to conduct an interview, and you were to ask people what they thought of the current government, you would end up with a range of detailed answers that contained a huge amount of opinions, facts, ideas and more. This information might actually be much more detailed and useful than simple numbers (which you would get by asking people to 'rate' the current government instead), but conveying it as a chart would be much more difficult.
One solution though is to use something known as 'qualitative analysis'. This is the method that researchers use to assess qualitative data, and it can be used to make qualitative data much more manageable.
Essentially this involves looking through the text you've accumulated, and identifying key words and phrases that come up time and again (counting synonyms in the same way). In the case of our government interviews you might notice words cropping up like 'reliable', 'effective' and 'satisfactory', as well as words like 'distrust', 'ineffective' and 'stupid' or 'tax' and 'eco friendly'. You would then count every time one of these words or a synonym appeared and that way you could reliably see the overall feelings of your interviewees as well as which topics were important to them.
Then you could use a word cloud, with size and colour used to denote the words that were most prominent and that would be a great way to quickly put across some qualitative data (or colour could represent the emotional intent of the word). A large red 'Distrust' next to a small green 'Effective' could paint just as impactful a picture as a pie chart – if not more of one.
There are of course many other methods you can use here too. From flow charts, to tables, to spider diagrams, to ergodic text. Ergodic text is a very interesting concept where the shape and structure of the formatting is changed in order to reflect the meaning of the text. For instance, if you are writing a description of a long, narrow hall, then your writing may itself start to begin long and thin and spindle its way up the page. Similarly, if you are writing about a house, then your writing may take on that shape. The book House of Leaves does this extraordinarily well.
We see a lesser example of ergodic text all the time – when important words are bolded or italicized for emphasis. It was once thought that the use of hyperlinks would completely transform the way that we read by allowing us to read in a ‘non linear fashion’. This has not yet come to fruition, but try to me inventive with the way you display your text. Remember, this is a form of ‘data’ and the web is a visual medium!
Thanks to the web we have a lot of stats, figures and numbers available. But what is the web made of? Words of course! There is a huge and unprecedented amount of qualitative information out there… so why not start tapping into it? Using a cluster analysis is just one way to group and display this information, and one tool you can use to make better sense of numerical data.