Pages

Wednesday, January 7, 2015

Bigdata in the context of Enterprise Data Warehouse

Internet is full of articles on what is big data. For this post, I will focus on big data in the context of Enterprise data warehouse.

Enterprise data warehouses have been around for a long time. Many companies have made huge investments in building enterprise data warehouses. They bring many benefits such as integration and standardization of data from multiple sources, access to historical data, pre-aggregation, OLAP, isolation of analytics load from OLTP and so on so forth. Big data technologies bring benefits such as distributed storage and parallel processing of large volume of unstructured data. Large volume of unstructured data can be cheaply stored on HDFS and processed using map reduce and spark frameworks.

Here are some areas where big data technologies can augment traditional data warehousing :
  1. Organizations are looking at gathering newer types of data such as social media feeds, public data, web logs, opinions, reviews, etc,. These newer sources of data provide valuable insights about organization's customers, products and service offerings.
  2. With the advent of Internet of things, massive volumes of data is being generated by connected wearables, sensors, automotives, smart home devices, etc. Organizations are looking to capture and process this data in real-time to become more efficient and proactive. Some organizations are using real time data feeds for timely fraud detection.
  3. BI has evolved beyond simple reporting and analytics. Organizations are making use of deep machine learning algorithms to better understand their customers and come up with product recommendations or service offerings. More data (coupled with a good approach) yields better predictions and recommendations.
There are also use cases where big data technologies are being considered for improving existing warehousing processes and performance :
  1. ETL : Most warehousing solutions employ ETL for loading data into the warehouse from variety of operational data sources. ETL is also used for change data capture. With ETL, data is first transformed before being loaded into warehouse. Most ETL tools require separate hardware, which can be expensive. Alternate approach is to first load data directly and then run transformations in the database engine itself. Since hadoop provides cheap storage and processing, it can be used to dump raw data directly into hdfs and then transformations applied directly on the data in hdfs by running either map reduce or spark jobs.
  2. ODS : The volume of transactional data gathered by enterprises is increasing, putting pressure on batch window available to process the data. Warehouse practitioners are used to providing operational data stores to provide access to more recent data. ODS increases the cost and yet do not provide real-time insights. Hence, organizations are looking at distributed messaging frameworks (such as flume and kafka) to ingest large of volume data in real-time.
While map-reduce and spark provides distributed processing framework, there are abstractions such as HiveQL, Spark SQL and pig for users familiar with sql and scripting. For real time processing, systems such as spark streaming and storm provide distributed, fault-tolerant processing of incoming streams.

Here is an architecture of a integrated Enterprise data warehouse with big data technologies :



The top portion of the architecture diagram shows a traditional BI system with staging database, ODS, EDW and various components of a BI system. The middle portion of the diagram shows big data technologies to handle large volume of unstructured data coming from social media, weblogs, blogs, etc. It contains storage components such as HDFS/HBase and processing components such as map reduce/spark. Processed data can be loaded into EDW or accessed directly using low latency systems such as Impala. The bottom portion of the diagram shows stream processing. It consists of messaging frameworks such as kafka or flume and real-time stream processing systems such as storm or spark streaming.


Friday, July 11, 2014

Java8 Lambda expressions

What are lambda expressions :

Lambda expression is one of the new features introduced in Java8. Java is a strongly typed object oriented language. Functions were never first class citizens of Java language. The introduction of lambda expressions in Java8 enables treating functionality as method argument.  Lambdas do not actually allow functions to be passed around, but, they provide short cut to creating anonymous inner class containing a single method.  So, lambda expression is about writing concise and clear code. It also encourages functional style thinking.

How to use Java8 lambdas :

    1. Let’s look at an example of collection sorting: Collection.sort(List list, Comparator c )
    2. In Java7, you could either create seperate class which implements Comparator interface or create an anonymous inner class. Both of these solutions are verbose and looks cumbersome. Here is an example of using anonymous inner class for sorting of cityList by their name lenghts  :
Collections.sort(cityList, new Comparator(){
@Override
public int compare(String s1, String s2) {
return s1.length() - s2.length();
}
});
    1. The above code can be rewritten in Java8 using lambda expression as follows :
Collections.sort(cityList, (s1, s2) -> s1.length() - s2.length());


Writing lambda expressions :

    1. Lambda expression does not contain function declaration, name or a return type.
    2. It looks like a function. It’s input arguments need not have a type (but, you can supply one if you want).
    3. The type is inferred from context.
    4. The body of the function starts after the “->” sign
    5. If it is a simple/single line function then the keyword ‘return’ is not needed.
    6. If the body has a single statement (as in the example above), then, curly braces are not needed.
    7. If there is only a single input argument to the function, then, parenthesis are not needed.

Saturday, March 29, 2014

Test driven development, Automated unit testing and Continuous integration

Test driven development

Test driven development (TDD) is a development methodology where test cases drive development. The middle D in TDD signifies that tests drive development. You start by writing a test case, which will fail initially. You then define an interface and provide minimal implementation for the test to pass and then iterate over the implementation. By writing test cases first, you get better understanding of functional requirements and hence create better design.

Unit testing 

Cost of fixing a bug goes up exponentially based on when it is discovered. If a developer finds a bug when coding, it is very cheap to fix. If the bug is found by QE, it is more expensive to fix. If a customer finds a bug, it is expensive by an order of magnitude to fix it. Hence, even though there is upfront cost to creating unit tests, it will be easily recouped in a short period of time. Apart from cost savings, there are many other benefits to unit testing. It makes you create modular and cleaner interfaces. It increases your confidence in the product. It is especially helpful in ensuring functional integrity if your code base changing frequently.

Automated unit testing and Continuous integration

Automated unit testing is complementary to test driven development and can be carried out through continuous integration process. It consists of a servers which runs the build and then all unit tests either periodically or after every check-in. In addition to automated unit testing, continuous integration process can be setup to run a variety of static code analysis tools for finding bugs, checking code style, etc. It helps in monitoring and improving the quality of the code. Continuous integration also brings in the confidence that your code is ready to ship.

Sunday, October 6, 2013

Useful git commands

Here are some useful git commands :

File and directories related commands :
git add filename.txt  -> adds file to staging area index
git status -> shows the status of files in staging area
git reset HEAD filename.txt  -> removes file from current staging area index
git add . -> Looks at working tree and adds all those paths to the staged changes that are either changed or new (and not ignored). It does not state any ‘rm’ actions.
git add -u -> looks at all the currently tracked files and stages changes to those files if they are different or if they have been removed. It does not add any new files, it only stages changes to already tracked files.
git add -A -> equivalent to git add. and git add -u
git add * -> adds all files under the current folder
git rm file1.txt -> Removes a file


Branch related commands :
git branch –a  -> lists all local and remote branches
git branch –r   -> list only remote branches
git checkout mjk_branch1  -> Switch to mjk_branch1
git pull origin develop-> Pull ‘develop’ branch from origin. If it was already pulled earlier, then, it gets the latest content

Changing branches : Let’s say, you are working on a branch1, made some changes. Now, without checking your code in, you want to switch to another branch. Here are two ways of doing it :
If you want to discard the changes :
git reset --hard HEAD
git checkout newbranch
If you want to keep the changes:
git stash save
git checkout new_branch
//do code changes
git checkout old_branch
git stash pop

Check-in corrections :
Removing mistakenly added file from staging : git reset
Moving mistakenly committed files in local repo back to staging: get reset --soft HEAD^.  Next, you can also remove it from staging using git reset HEAD

Cleaning up a branch and removing all untracked files :
git reset --hard HEAD (restores the branch back to last push)
git clean -f -d (deletes all un-tracked, files in the current workspace)
git pull origin  

Sunday, January 6, 2013

Multiple Output files from mapper

    Typically, in a map reduce job, you generate one output file for a job. However,if you want to output to multiple files, then, you would use MultipleOutputs in your mapper.

    Here are the four things you need to do : 

    Declare MultipleOutputs (mos) variable
    Override setup and instantiate mos
    Override cleanup and close mos
    In the map method, mos.write.

Thursday, November 22, 2012

Hadoop distributed file system (HDFS)

HDFS is a distributed file system. It is loosely based on Google file system. It is one of the core components of Hadoop system. Hadoop cluster consists of Master name node and data nodes. The master name node stores metadata information about the blocks and the data nodes store the actual data. In the world of distributed computing, hardware and network failures can easily occur. Hence, it is essential that the system has in-built fault tolerance. HDFS provides fault tolerance by distributing data across multiple nodes, so, even if one of the nodes were to go down, the data is not lost. As the data gets loaded into HDFS, it is split into blocks, typically 64MB or 128MB. These blocks are replicated across multiple nodes (default is three nodes). From a user perspective, HDFS abstracts the networking aspect from the end users. So, even though the data is spread across multiple nodes, users do not have to worry about networking and other low-level infrastructure code. The data is local to the nodes and nodes themselves do not talk much with each other. Essentially, it is a share nothing architecture.

Key characteristics of HDFS :

Write once.
Does not support random writes.
Optimized for streaming reads.
Works well for small number of large files

Here are some examples of HDFS commands  as :

hadoop fs -ls
hadoop fs -mkdir hdfsTest
hadoop fs -copyFromLocal SherlockHolmes.txt hdfsTest/SH.txt
hadoop fs -cat hdfsTest/SH.txt | tail -n 2
hadoop fs -rm hdfsTest/SH.txt
hadoop fs -rmr hdfsTest

Monday, November 12, 2012

Pig latin

Pig is the data flow system for Hadoop. It provides a way to execute map reduce jobs without writing code in Java. Pig comes with a scripting language called Pig Latin. Pig provides an abstraction on top of map reduce. Pig interpreter, which runs outside of the hadoop system, decomposes a pig latin script into map reduce job and submits it to the hadoop cluster. Pig latin is suitable for people familiar with scripting languages. While both hive and pig provides abstraction on top of hadoop map-reduce, one big difference between the two is that pig does not have any concept of metadata. Pig loads datasets that can be modified using pig latin scripts. The pig latin scripts can be used to complex processing such as joins, group by, order by, etc., using simple constructs. Users can also create custom user defined functions and use them in pig latin scripts. There is an open source pig function library called piggy bank that can be downloaded freely.