Any application that is created involves data, specifically in E-commerce applications. Such applications take data as an input and produce more data as a result thereby defining the application into data product. The term big data is coined for any collection of data sets that increases by time and could not be managed by the traditional databases.
These traditional database systems are not effective these days considering the scale of operation. The main aim of relational database architecture was to provide consistency for complex transactions which can be easily rolled back if one of the datasets fail. But managing and replication between the data servers involved in the transaction is difficult.
Data sets can be stored effectively in “No SQL databases “or “Non-relational databases”. Google’s BIGTABLE and Amazon’s DYNAMO uses this database as their model of structure. “Cassandra and Base “are the two products of among the millions to have established in providing an absolute consistency. Google have introduced “MAPREDUCE” approach which uses the divide and conquer strategy and for distributing the large problem in large cluster so as to effectively resolve it.
MapReduce has two phases as the name implies 1> Map 2> Reduce. The first phase involves processing of the input data one by one and transforming them into intermediate set. In second phase the intermediate set which is generated is further reduced to what is known as summarised sets which is the desired end product. An example to illustrate this process is the task of counting unique words in a document. In map phase the words are identified and given a count of 1 and in reduce phase the counts add up together for a word.
There are three distinct operations in this MapReduce.
- Loading data 2> map reduce 3> Extracting the result
Hadoop is the open source implementation of the above. It is agile and is used effectively in data analysis and follows “agile practices”. It is a batch system; monitor and control the job running. It processes data upon arrival and shows results in real time. These are useful in publishing the trending data that one can see in twitter. It features only soft real time reports as these trending does not require millisecond accuracy.
Another tool that is used by the data scientist is machine learning. The mobile and other web applications have started to incorporate the concept of artificial intelligence in ways such as face detection and picture detection like google googles etc. There are many libraries for machine learning such as Pybrain in python and weak in java.
Mechanical truck is an important tool which is used in machine learning. In this the data is further taken and labelled as training sets and later on is classified by human readable terms in order to machine to incorporate them.
The main problems of explosion of big data in companies are called as three “v’s”: volume, velocity and Varity. There is growing of data as time progresses and hence volume or capacity is store the data sets increases which also increases the process time and deliverables which defines the velocity also there is wide various formats involved in this process called as Varity.
In memory computing incorporates big data concept to move the data closer to the processors. In traditional analytics of data the data streams are too slow in processing and hence in memory computing address this problem. In other words it addresses the volume and velocity problems of big data.
Big data now current perspectives from O’Reilly radar. (2011). Sebastopol, CA: O’Reilly Media.
Ohlhorst, F. (2013). Big data analytics turning big data into big money. Hoboken, N.J.: John Wiley & Sons.
Big data. (n.d.). Retrieved January 12, 2015, from http://en.wikipedia.org/wiki/Big_data