data to content

Jun

2018

Leveraging Data Lakes for Content Creation | Guest Blog

What is a Data Lake?

In the early days of data, computers of the 1950-70s would collect data from physical repositories (cards, keyboard input, and tapes) and deal with it in memory, a virtual environment.

In the late 1970s, computers started to cluster data for processing within worksheet like structures, the grandfather of databases.

Next generation databases were to follow and the logic behind it was for data to be inputted in a logical manner according to previous human classification, so although raw data (in the sense it hadn’t been previously treated) went into the database tables, the decision of what would fit where and why was based on human criteria.

Databases are in fact a limited tool since they require specific contexts. There is a database for Accounting Data which must be a distinct entity from the Database for Logistics Data (although they may need to connect and have mutual referencing structures).

So, most of the systems and tools under operation today, still have their own database structures which interconnect and exchange data towards creating added value information.

The database evolved to larger clusters of like data, which are the data warehouses and, although the process may sometimes differ, data is grouped in the same manner while pertaining to a common context.

With the advent of BigData (for Data tends to grow, exponentially) and the innovation in data handling/ processing tools, we have reached a stage where large flows of several types of data may be channeled into extremely large data repositories (Data Lakes), in an ad-hoc manner which nevertheless allows correlation to be established under pre-defined “schema” where meaningful information is extracted from.

The main leverage of Data Lakes pertains the following characteristics:

While Data Warehouses store structured already processed data, Data Lakes also store semi or unstructured as well as completely raw and (apparently) unrelated data. And, no … it’s not a waste of time and space since most of the time relationships are only found after processing.
Data Warehouses architecture implies expensive large data volumes whereas Data Lakes are specifically designed for low-cost
Data Warehouses obey by a pre-defined structure in which data is stored (pre-defined data classes and families), while Data Lakes have great flexibility towards dynamic structural configuration. And that is more the manner in which humans gather and process information.
Data which goes into Data Warehouses is mature (has a clearly defined context), yet Data Lakes store data that is still maturing (still finding its place in the overall context).

By now you should be realizing that without Data Lakes, IoT or even AI would be heavy, slow, or even impossible technologies to implement in a cost-effective manner.

Think of it as an actual lake, into which data flows from several streams (M2M, log files, real-time data collection, CTI, … you name it …) and then you are able to run appropriated analytics on it, therefore extracting added value information.

Accessing data from different sources like DB2 based systems, SAP, or any other different IT Systems requires both specific tools as well as licensing, which represents time and money.

Extracting meaningful new added value information out of those distinct IT environments represents creating a detailed map of WHAT data is relevant towards WHICH requirements and then developing code that can gather such data in the appropriate sequence to produce information that shows new meaningful content.

Let’s imagine one example where your company needs to undergo a Market Analysis on trends and tendencies of a specific group of potential clients for whom you wish to promote a new item on your offering portfolio.

Such exercise, if done over existing corporate systems’ existing databases or Data Warehouses implies getting the accurate answers to the following questions:

Clearly identifying WHAT characterizes valid data, meaning data that pertains prospects who bear the high potential of becoming clients of such new product or service now entering the corporate portfolio.
WHERE in the different corporate systems’ Data Warehouses do such data exist?
WHICH types or families of data represent the most potential towards assertively qualifying the prospects with higher potential?
HOW shall the code that enables such accurate data collection and processing be developed (programming languages, processing logic leading to step stone creation of added value content and so on …)?
WHAT are the required computing and storage resources that enable the fastest possible processing of all the Data that may be found relevant?
WHEN could we expect the first results?
WHICH is the expectable percentage of “false positives”, deviations, and errors?
HOW much will it cost? Is it the wait worthwhile or once we finally get the information we need will the optimum time to go to market have passed?

If there was a cost-effective way to do it, by cross-referencing the data that you have about those prospects in your corporate systems (including saved conversations over the phone) with a “photographic profile” that you can extract about each of them from, let’s say social media, pouring everything into one single dynamically scalable repository (therefore with no need to constantly move data around between existing systems), within just a few hours or even minutes, what would that represent for the potential of successfully launching your new product or service?

A Data Lake needs to enable the features to embed in the ISASA concept which specifically represents the ability to:

Ingest data from several distinct flow streams through appropriate APIs or Batch processes.
Store such dynamically unforeseen in size amounts of Data on scalable repositories (the Lake) through all necessary protocols (NFS, CIFS, FTP, HDFS, other)
Analyze the Data by finding the relevant correlations according to your needs and expectations.
Surface relevant information in a user-friendly manner that better conveys what you need to see in the most straightforward effective way.
Act, in the most efficient and cost-effective manner leading you to reach the intended goals.

The Process

A Data Lake is not the solution to your problem, yet the IT Landscape that will allow you to get there fast and accurately, and that means as in most cases in life that you need to begin by clearly understanding the intended outcome and defining concrete tangible goals, so:

Start by determining your Business Objectives
Define and collect the data that will enable you to reach your Business Objectives
Identify what Success will look like for you

In the above-mentioned case, of your company going after clients for your new product or service, these would translate by:

Assertively leveraging the value proposition towards the prospects.
Targeting the right prospects (the ones who will most likely buy the product or service, therefore, becoming clients).
Increase Sales and Revenue.

Data Lake Architecture

Still considering our case, a company launching a new Portfolio Item, the Structure that needs to be set in place will resemble the following picture.

Data Sources

The Data Sources Layer provides data streams from Corporate systems or other sources of both structured or raw data via APIs or other plug-ins regardless of origin storage format.

Processing and Storage Layer

The processing layer starts with Security towards the data streams which mean:

Visibility classification and control (Authorizations and Permissions)
Multi-Level stratification: Clustering, DataSets, and Attributes
Labeling towards highly sensitive classes of data
Authorization Views

The appropriate processing structure needs to define which allows adequate processing of required data extracting valuable information. Here the following steps need to be defined:

Process and rules towards establishing Joins, Aggregations, and Correlations
Which will be Streaming and which will be Batching based
Error track& Tracing, Registry, and Resolution (Data Exceptions, Error Logs, Correction Tables)
False Positives detection (wrong data types, duplicated data, masking)
Required specific transformation algorithms that provide business-specific focused new data

Integration Layer

Upon achieving the collection of valuable information and data out of the processing over the Data Lake content it must be either integrated back on Corporate Systems and/ or made available for several types of user profiles, who according to their roles and responsibilities will use it to leverage and gain efficiencies, such as the role of Integration and Governance.

User Interfacing Layer

Distinct user profiles will need to visualize data and information in different manners, some more holistic while other more detailed and exposing details and correlations.

Support Tools

As previously mentioned, although Data Lake content processing could be achieved via entirely “homemade” software tools, developing the entire required suite would represent a huge project by itself each time some analysis which requires the use of a Data Lake is in order.

Fortunately, there are several “off the shelves” tools that can be merely configured towards the herein previously mentioned stages and layers of a Data Lake workflow, therefore minimizing the internal effort towards developing code that achieves certainly required functionalities. These can be visually represented towards their specific context of actuation as shown in the picture below.

A bit of advice; tools are growing as flowers in Springtime, make sure to choose not more than the necessary toolkit as per your specific needs and no single two which perform the same role.

Best Practice Towards Implementing a Data Lake

Here are the best practices key elements upon setting up the endeavor towards creating a Data Lake:

Focus on Architecture which best addresses available data
The Data Lake needs to be created according to available Data Sources and Type and not according to what is required. Although this may seem counterintuitive, the fact is that your Data structure at the end is not totally devisable when you start the process.
The Data Lake design must be guided by which type of data will you be able to dispose of during processing and not what will be useful.
Like the only way to eat an Elephant is to split it into small portions, you should do the same here, meaning Discovery & Ingestion, Storage & Administration, Quality & Transformation and Visualization & Interaction must be addressed separately.
Focus on native Data Types
All the architecture components must be able to deal with data in their native format.
All architecture components must be set in place while being fully aligned with the type of processes and procedures inherent to the specific Market Vertical requirements that the Data Lake will address.
Service and Governance
You must assure that the Data Lake structure and processing are capable of fast Data Ingestion independently of its source, not to create a bottleneck at the very beginning of the entire process.
Data profiling, security, and policies must be defined in advance as well as tagging, correlation rules, and workflows
Querying
Proper strong and agile querying processes must be defined and set in place, or the Data Lake will not be responsive as per your Time to Market requirements.
Proper “guidelines” (algorithms and codding) that allows swift discovery of correlations and unification criteria, must be clearly defined in advance.
Analytics that mirror human expectations and goals, are most relevant to allow gathering and clustering/ correlation of data that meets those human-driven
Metadata
A proper Metadata Catalog needs to be defined in advance, so the Data Lake has the capacity to properly allocate Data has it is being ingested.
And the process must be fully automated in order not to cause errors os processing blockages.

Bottom line, you must “teach” the Data Lake to “think”.

Data Lake “Core Structures”

How is it possible for Google or Facebook to deal with such large colossal volumes of information simultaneously?

Do you realize that Data will increase exponentially as the IoT expands? We are now gathering data from Cell Phones, Social Media, Cars, Media, Coffee Machines (!!!) and so many other sources that were unthinkable just 5 years ago.

How will we be able to correlate data from such a huge number of streams in a manner that allows us to “see the big picture” in a way never seen before?

Hadoop

In the beginning of our 21^st Century, Google was becoming unable to scale their “traditional” Database Engines and processing capabilities in any manner that could deal with the exponentially large amount of new Data flowing in every second, so a new approach was developed an algorithm (called Map Reduce), that started by splitting and channelling Data as it arrived towards separated processing capacities according to its classification. Meaning, XML would be forward for processing to dedicated processing clusters as well as Text, SQL, Logs, Objects and so on would flow to other separated dedicated processing clusters.

This approach was later on used to create an open source initiative that would further as faster foster the evolution of such new “Tech” in a manner that could timely generate results which could keep up with the momentum of Data Streams Growing Volumes, and Hadoop was born.

So, Hadoop allows Data to be processed in parallel in extremely large volumes while maintaining existing and establishing new meaningful correlations.

Only, Hadoop was heavily dependent of coding, in Java, to adapt it towards each specific new context and to overcome the time-consuming endeavor of doing so, the market saw the rise of previously mentioned Tools (Kafka, HIVE, Spark …).

Cassandra

We can say that Cassandra if Facebook’s SQL based Hadoop, and we wouldn’t be far from being accurate.

By default, and having had its development based on a Database Engine architecture, Cassandra works by default in multi-node Clustering mode, therefore the Data is not “merely” spread across processing nodes to split the necessary workload, it is done so at the Database Engine level, therefore not burdening the processing with yet that additional task.

This allows processing capacity to be automatically scaled just by adding new nodes to the DB cluster.

The cluster nodes are all primary nodes, therefore there are no Master and Slave instances, all have the same degree of priority and command & control over the Data handling process, which assures full fault tolerance in case of clustering downtime or communications interruption.

What Sets Them Apart

Hadoop was designed to deal with large amounts of unstructured Data bearing the capacity to process them in parallel and delivering results in a fast-effective manner, and for that purpose, it is usually implemented within a single IT Landscape which further enhances its Operational Approach to Data Processing and Management.

Cassandra, on the other hand, was designed to allow redundant multi-node + multi-geographical simultaneous fault-tolerant parallel data processing capacity, nevertheless, it requires a structured Data Model before its implementation.

This means that Hadoop is very useful for Data Ingestion, Analytics and initial Processing, while Cassandra can enter the workflow later on allowing and assuring further detailed Processing, Security, Storing and Consultation anytime/ anywhere, once the data has been cataloged in a structured manner.

Leveraging Business

Coming back to our company which is launching this new product, and considering an example that comprehends a complete range of available data, meaning existing clients how can a Data Lake supported on this Technological Landscape support the several teams on their roles?

The Data Lake will collect a full data catalog about the clients, like invoicing terms, overdue payments, preferences (via recorded phone conversations, the CRM system or Social Media), product handling capacity (out of the SCM corporate system) and so on …

Then the Data will be processed in a manner towards identifying high-potential prospects for the new product or service, meaning as an example: who has no open payments towards the company + has demonstrated an interest in features that the new product will deliver + has the SCM capacity to deal with it.v

This information can then be delivered in different formats, either CxO/ Management reporting style or on demand/ on click upon some phone contact from the client side to Customer Support Hot Lines or from the company Sales and Marketing Department towards the prospect, immediately identifying that current client as a high potential prospect to acquire the new product or service.

How valuable would it be for you, if anytime you either contact a client or the client contacts you, your IT intel installed capacity automatically and accurately informs you of the new additional business potential towards that client?

Where are we Heading?

Being a new field of Computing, which most large corporations do not use yet since their entire IT Landscape (which has been evolving over decades) is necessarily constituted by Systems in “silos” which then interchange data via API or other interconnect channels, there are several expectations like:

Can Data Lake ease the burden of ITL? By making it either less time consuming or complex?
Will we be able to apply several multiple “schemas” over one given Data Lake, simultaneously, and by doing so leveraging multiple “efficiencies” over existing data?
Will we be able to finally reach “cross-organizational” Data, so instead of having the Data about one given client or partner that Accounting accesses plus other Data about the same 3^rd party that Logistics accesses and so on … having a Full Integrated Data Profile about it, which everyone accesses although according to profile permissions? Since the Data Warehouse did not do the trick, isn’t it possible that a Data Lake is just a new name for Data Mart version 2.0?

The entire concept of capturing raw data and applying distinct schemas to it (logical circumstantial processing) is a very powerful one and in fact, it is precisely what our brain does. We gather raw data and based on our knowledge (rules gathered through past experience and learning) we are able to “issue” a new set of data that adds value to what we have collected and had stored as “knowledge”.

And this … has everything to do with Artificial Intelligence.

Guest IT Blog is from TenFold.com. We are not compensated by TenFold. We are sharing this company as a tool for our readers when they require an AI solution for customer management. We believe harnessing collective data and using it to create content is a useful factory seconds outcome to such software.

Apr

2017

Earth Day Data Dashboards

An Opportunity to Learn the Impact of Data Dashboards on Educational Marketing

Do you know how a Data Dashboard can push your Environmental Educational Marketing efforts to the next level? We do! In honor of Earth Day, we are sharing some of our favorite earth-centric Data Dashboards that you can take note from, model after, and create your own to do things like promote species conservation, educate on sustainable development, or build content for travelers that buy ecotourist packages.

What are data dashboards?

Data dashboards are a way to visualize key data in a way that numbers and statistics alone cannot convey to a general audience. Commonly, data dashboards are used for making business decisions and modeling data in a way that allows decision makers to visualize the information they need to know in order to make data-based educated decisions. Often these business dashboards display KPI (key perform indicators) and key data points across company departments that drive business decisions. Taking this same methodology and applying it to important worldwide topics and issues like the health of the planet, particular species, ecosystems as well as environmental and sustainable development can help educate and market on important efforts.

In honor of Earth Day, I would like to spotlight some great earth-centric data dashboards created by different designers who showcase their creativity in an inspirational way that you can learn from and begin developing your own online interactive tools for your audience.

Raise Awareness, Raise Funding

Raising awareness and education for a species, cause, or non-profit effort. Educational facilities, aquariums, an animal rescue center, or any related entity can use similar efforts to this beautifully built dashboard by Jonni Walker exploring Hawk Migration. In this Data Dashboard of the Swainson’s Hawk you can see the use of map interactivity, video, and well-displayed visuals and data. The chart is click-able and highlighting the migration journey of the Swainson’s Hawk through the Americas. By creating tools like this, you can up the interactivity factor of your content while raising awareness of a particular cause or species. You can even add an interactive component to your dashboard soliciting donations or data collection participation to continue the research and support efforts of your program.

Environmental and Sustainable Development

Urban Forests are a key value to city dwellers and large green spaces promote the sequestration and natural breakdown of common hazardous urban pollutants like ground-level ozone, sulfur dioxide, nitrogen dioxide, and carbon monoxide. Displaying these key-to-health urban forests and developing an educational program for citizens can allow City Planners and Eco-Development businesses alike to garner the support of residents and investors to opt-in to sustainable development. Using these interactive tools, companies can expand their innovations to inspire a wider audience to live better, be wiser, and invest intelligently in our cities, energy usage, and our homes. It would be interesting to see these innovative Prefab Sustainable Homes Companies like Hive Modular, Living Homes and IdeaBox drive educational efforts like Urban Reforestation to drive their combined efforts of education and sales to develop healthier cities and homes.

In this example Urban Forest of NYC Authors Ann Jackson and Josh Jackson have developed a hover over map that shows tree species locations as well as a bubble chart that updates carbon sequestration data.

Digitizing Your Ecotourism Content & Promoting Conservation Efforts

Ecotourism is a huge and growing sector of the Travel Industry. With growth in this more educationally driven travel cohort, more and more ecotourism companies are upping their educational game and turning to data dashboards for teaching ecotourists subjects in more interactive and clever ways. This data dashboard, Endangered Safari by RJ Andrews shows very interactively, the population stability as well as the endangered status of African mammals. This is a perfect example of using digital educational tools that travelers can pull up on their smart devices while taking a tour and learn as they journey along on their ecotravels during and after their trip. It is also content that will be highly shared amongst educational institutions, museums, and special interests groups related to the conservation efforts of your travel agency.

We hope you have learned a little bit more about how Data Dashboards can be used to help educate and market your environmentally focused efforts. If you need help Managing your Data Dashboard Projects, contact us. We’re always happy to geek out with your content and grow your efforts!

[su_button url=”mailto: info@armarketinghouse.com” background=”#dcae1d” size=”5″ center=”yes”]Give us a shout[/su_button]

Save

Mar

2017

Factory Seconds: Easy Data, Unique Stories

Waste not, want not — Big Marketing Plans with Leftover Factory Seconds Data.

Dredging up unique, shareable content to keep your audience engaged is a constant struggle for many companies and non-profits. Using factory seconds can solve your content crisis and boost your uniqueness factor.

What are Factory Seconds? Factory Seconds is a retail term used to describe imperfect products that cannot be sold at full price to the primary buyer, so a second buyer’s stream is created to sell these imperfect items at a lower cost through a secondary channel (think Marshall’s, Ross, or any other discount store).

What are Factory Seconds Data? Taking the idea of Factory Seconds and applying it to data means any data that you collect as a by-product of doing business. You can then repurpose this data and develop it into very unique, highly shareable stories that can improve your reach and ROI. For example, let’s look at an ice cream shop. As a by-product of data collected in sales receipts for the ice cream shop you know what ice cream flavors are popular for what seasons and on special holidays. Yes, you can use this information to make better business decisions (Factory First Data), however, you can also turn this data into Factory Seconds by creating interesting content for marketing your ice cream shop. You can do this by answering questions like: is there anything interesting about people’s choices? Is there a spike in particular flavors or patterns on particular days that you can relate to weather, politics, or something else funny or interesting? Turn this into your unique content, you will be known for the compelling stories that you develop around your own Factory Seconds Data!

Where to find your factory seconds – In order to use these nuggets of awesome content, you will need to start a catalog of the various types of data you collect, knowingly and unknowingly, and see what other types of data you have access to (facebook page demographics, other social media that you are followed on, almanacs, weather). Go beyond the demographics of your audience to see what trends are occurring. Here is a list of some places to start:

Patient or Client Database, CRM (Customer Relationship Management), etc.
Geographical Location Trends
Sales Receipts and Invoices
Events and Tradeshows in your Industry
Product Sales Trends
Services Sales Trends

Become the scientist of your Industry. Start to hypothesize some occurrences that you witness and have always had a hunch about and see if you have the data to possibly answer some of these questions and then, SHARE IT! People want to know trends, if they are already in the realm of wanting to know your business they will eat this information up and they will appreciate your special insight into your small or huge piece of the world and what that means to them. Make your information relatable, always relate it back to your audience in some way.

Example: Holistic Healthcare. Here is an industry-specific example of where you can start to harvest your Factory Seconds Data:

Your Patient Database. Whoah, whoah now. I am absolutely not implying that you break any kind of Privacy Laws. There are many ways to “de-identify” your patient list prior to any kind of data analysis. Here’s a link to the HIPPA Compliant De-identify Process before you get started.

Okay, now that you feel safe again, after you de-identify your patient database, sift through some measurements and see if you can find trends or any story in your data set based on local geography, a specific demographic, a particular ailment, specific treatments, etc. Once you have listed out the various stories your data can tell, develop that data into content (articles, stories, infographics, etc.) You can share this information with local or national news sources and develop yourself as a true Industry Leader.

The holistic healthcare industry is ripe for these types of Factory Seconds. Mainstream healthcare has so many widely shared studies, and research facilities validating health processes and procedures, but what about holistic practitioners? There is a huge lack of evidence-based information available to the droves of people who wish to learn more about natural paths to health and healing.

Factory Seconds are our favorite types of content to develop! You never know what kind of amazingly unique stories you will discover from your data. Inspired to geek out now? Click the link below to start the process of developing your Factory Seconds Data into content. Or, you can always enlist our help!

[su_button url=”http://eepurl.com/cEOWt1″ background=”#dcae1d” size=”5″]Download Your Free Data to Content Checklist[/su_button]

We help businesses and organizations leverage their data into content. Tell your story!

[su_button url=”https://armarketinghouse.com/contact-us/request-a-quote/” background=”#dcae1d” size=”5″]WORK WITH US![/su_button]