So much talk – but so far not much action. When it comes to, everybody agrees on its business potential. However, many projects have yet to get past the pilot stage and still look set to remain in the ‘blue-sky thinking’ slot at the end of the CIO’s annual presentation. Why? Because, despite all the discussions, round tables and seminars, managing and analysing large volumes of different forms of data from multiple sources remains a complex issue.
Implementing and configuring a big data environment can take months. Yet the window of opportunity won’t stay open forever and now’s a good time to learn from the early adopters to avoid some of the pitfalls. So here are a few tips on how to get started – and do it effectively:
- can be small: We’ve all got so hung up on data volumes. However, often it’s the variety of data that is the challenge. You may have smaller data sets to exploit, but a broader range of sources and formats to deal with. Be sure to identify all relevant sources, however small, and don’t think that you necessarily have to scale your data computing cluster to hundreds of nodes straight away.
- All data is valuable: Transactional data used or generated by business applications is the obvious type to use. However, don’t forget the data hidden on servers or in log files, desktops or manufacturing systems, often referred to as ‘dark data’. There is also another even more obscure type; that’s generated by sensors and logs and usually purged after a certain time; the ‘exhaust fumes’ of your operations. Deploy collection mechanisms for both these data types so that they also contribute value.
- Some data can stay put: Hadoop is a great storage resource for large data volumes (and it is in itself distributed across clusters). But think ‘distribution beyond Hadoop’. You don’t always need to duplicate and replicate everything. Some data is already in the enterprise data warehouse and can be accessed quickly. Some of it might be better off staying in the same location where it was produced. You can make use of the ‘logical data warehouse’ concept in the big data world as well as in traditional data management.
- Explore new processing resources:Hadoop is not only a repository. It is also an engine that gives businesses the potential to process data and extract meaningful information. A broad ecosystem of tools and programming paradigms exist that cover all use cases of data manipulation. From MapReduce to Spark or from Pig to SQL-on-Hadoop, there are processing resources available that eliminate the need to get data out of the platform.
- Just get started: The best way to learn big data is to experience it. There are now sandbox platforms available that provide out of the box virtual big data environments with all the tools needed to start integrating big data straight away. These can include video tutorials, pre-built connectors for building prototypes and an open online community, all helping users to start scoping tasks and generating code with graphical tools which are far faster than hand coding.
As the value of big data becomes more widely exploited, the market is responding. This looks set to play a valuable role in helping to move projects from the sandbox into production – and to do it rapidly to allow users to start to reap the benefits straight away. It could be time for the big data implementations to finally see the light of day.
Article written by Yves de Montcheuil, Vice President Marketing at Talend
Join us at Big Data Analytics 2014, 13th November, Hotel Russell London where you can meet with Talend and discuss ongoing data analytics strategies within your organisation. Contact us to find out how you can attend the event.