Posts by Masha Finkelstein:


My Summer Internship at Hortonworks

Hortonworks Summer Internship 2012

As a first time intern, I can undoubtedly say that Hortonworks was the perfect place for me to gain real world work experience and have the chance to team up with many incredibly talented, driven people. Of course, I didn’t get to fully interact with everyone in the company in the three months that I was here but even after such a short time it is clear to me that it is the welcoming atmosphere and the determined team here that have allowed Hortonworks to achieve so many goals in just over a year.

During this summer, I was awarded the opportunity to be part of something big, something that is gaining impressive momentum in the world of technology and will not be slowing down any time soon. I have received insightful information from people who are overflowing with innovative ideas for how to utilize the big data of today’s world and this has provided me with knowledge that I did not expect to gain from a big data company.

Throughout the course of my internship, I was able to learn about and work in various areas of marketing, with my main focus being on creating blog content. John Kreisa, the VP of marketing, was very helpful in explaining how important this was for bringing attention to Hortonworks and the best ways to do this in the social media sphere. Writing blogs, alone, was a very educational experience because I was able to explore and research real world use cases in which big data was present.

What I found most intriguing was that big data is everywhere now (and if it’s not somewhere, then it will be in a few years). Cancer research becomes efficient and provides more and more substantial results. Retail shoppers receive a personal shopping experience catered to them, and only to them. A hospital patient is provided the most up-to-date medical care and information possible. Police departments learn to harness extensive amounts of information and crime is decreased significantly. The list goes on. It isn’t at all just about big tech companies coming out with the next cool gadget (although this is important too).

So much positive change can happen with the emergence of big data. Three months ago I would have considered this a bit of a far-fetched or even cheesy statement… But now that I know about so many companies and organizations that really are learning how to harness big data in order to benefit the health and safety of people and the environment, I am truly excited for the developments that will arise in the near future, especially with the help of Apache Hadoop. Global change really is attainable.

Of course the most impressive aspect of my internship that really showed me the magnitude of how big big data really is, was Hadoop Summit 2012. I cannot stress enough how lucky I am to have been part of this event. Over 2,200 people gathered on June 13th and 14th to discuss the most recent and innovative developments in big data that were being made possible by Hadoop. I was impressed by how many people were there to educate themselves on the workings of Hadoop and its power in today’s business world. Lecture rooms were overflowing, people were exchanging ideas, connections were being made… The world was climbing the Summit of innovative progress.

Along with this, I was able to see the amount of work that went into planning such an impressive event. Sponsors, press releases, the venue itself, registration, catering, displays, the awesome party at the Tech Museum and so many other equally important things had to be taken care of. Denise Maudru, the marketing director in charge of the event, gave me some small tasks to do before the Summit like preparing dinner name cards and totaling up registrants and payments but these were miniscule parts of a much bigger project and I was really impressed by how Denise and the rest of the Summit planning group organized everything so perfectly and thoughtfully.

After the Summit, I got to work with Masha Finkelstein, the interactive marketing manager, on analytical and inbound marketing. She gave me a sense of how marketing flows work and I helped her create some e-mail templates and landing pages in Marketo. I learned that inbound marketing is actually a pretty sensitive aspect of marketing and has to be both perfect and efficient.

Thousands of leads have to receive the correct e-mail with the correct links and if they click on one of those links they have to get the correct landing page and this landing page has to take them to the next correct landing page and, in my eyes, it is all really a plethora of errors waiting to happen… However, Masha showed me how to keep the errors at bay and how to simultaneously filter out unnecessary leads to move along the marketing process more quickly. After this, Steven further explained how strict sales has to be with incoming leads in order to not waste time with people that may not become Hortonworks customers in the future.

Additionally, I helped the HR team to come up with some revamping ideas for the careers page with the goal of bringing a more pronounced sense of community and culture to our website. We stemmed off the Hortonworks Dr. Seuss theme and jointly came up with some great, creative concepts which could engage potential applicants and employees and show them the fun side of Hortonworks. This project is still in the works but hopefully these ideas will come to life on the careers page very soon.

After three months, I am still far from being familiar with all of the workings of the company. Although I was a marketing intern, I still believe it is necessary to be aware of all of the bolts and screws that make up the bigger picture. Even still, I have learned more and have gotten more real world experience in these past three months than I have in any of my other summers combined. Sometimes it is difficult to notice change within one’s self over a long period of time. Yet, I realize that I have become more outgoing, less afraid, more driven by my future, and generally more aware of the world around me ever since I nervously walked through the Hortonworks doors to a smiling and welcoming Rachele on my first day as a marketing intern.

It has been a liberating, educational, and inspiring experience and I owe it to the people that have been part of it and have made it so wonderful for me.

Thanks Hortonworks!

Hadoop: Your Partner in Crime

Pre-crime? Pretty close…

If you have seen the futuristic movie Minority Report, you most likely have an idea of how many factors and decisions go into crime prevention. Yes, Pre-crime is an aspect of the future but even today it is clear that many social, economic, psychological, racial, and geographical circumstances must be thoroughly considered in order to make crime prediction even partially possible and accurate. The predictive analytics made possible with Apache Hadoop can significantly benefit this area of government security.

The essence of crime prevention is to understand and narrow down thousands of “what if” cases to a manageable and plausible handful of scenarios. Crime can happen anywhere and can be categorized as anything from cyber fraud to kidnapping, which provides a lot of combinations for possible misdemeanors or felonies. With the help of big data analytics, government agencies can zone in on certain areas, demographics, and age groups to pick out specific types of crimes and move towards decreasing the one trillion dollar annual cost of crime in the United States.

Zach Friend, a crime analyst for the Santa Cruz Police Department, explained that there aren’t enough cops on the streets due to insufficient funds. Not only that, but many police departments are still technologically behind in the crime-monitoring field, so big data analytics tools could be a huge step forward for police all over the country. Evidence and information about cases could be stored much more efficiently, police action could be more proactive, and crime awareness could be much more prevalent.

Who’s on the case?

The Crime and Corruption Observatory (created by the European company, FuturICT) is pushing for this kind of development and aims to predict the dynamics of criminal phenomena by running massive data mining and large-scale computer simulations. The Observatory is structured as a network that involves scientists from varying fields – “from cognitive and social science to criminology, from artificial intelligence to complexity science, from statistics to economics and psychology”.

This Observatory will be used through the framework of the developing Living Earth Simulator project – “a big data and supercomputing project that will attempt to uncover the underlying sociological and psychological laws that underpin human civilization.” The project, funded by the European Union, is an impressive advancement in technology, which will not only aid in pin pointing crime but will also effectively utilize the big data of today’s world.

PredPol has made predictive crime analytics available to police departments so that “pre-crime”, in a sense, could be put into action. Zach Friend explains, “We’re facing a situation where we have 30 percent more calls for service but 20 percent less staff than in the year 2000, and that is going to continue to be our reality. So we have to deploy our resources in a more effective way. This model does that.” PredPol allows law enforcement agencies to collect and organize data about crimes that have already happened and to use this data to predict future incidents in certain areas at a radius of 500 square foot blocks. It may not be the same as knowing the exact perpetrator, victim, and cause of the crime ahead of time as was possible in Minority Report but it is an impressive step towards perfecting crime prediction.

The Santa Cruz Police Department, which is using PredPol’s software, has already seen significant improvements in police work. SCPD began by locating areas of possible burglaries, battery, and assault and handing out maps of these areas to officers so they could patrol them. Since then, the department has seen a 19% decrease in these types of crimes.

PredPol software is able to make calculations about crimes based on previous times and locations of other incidents while cross-referencing these with criminal behavior and patterns. Here is an example of how large-scale this could get: George Mohler, a UCLA mathematician who was testing the effectiveness of PredPol, looked at 5,000 crimes which required 5,000! comparisons (i.e. 5,000 x 4,999 x 4,998…). With impressive results already materializing from calculations like these, it is exciting to think how much more accurate predictive crime analytics could become.

Hadoop lays down the law

With Apache Hadoop, perfecting crime prevention becomes an attainable goal. CTOlabs presented some very important points in a recent white paper about big data and law enforcement, showing how Hadoop could be beneficial to smaller police departments that don’t have very much financial leeway. The LAPD for example, is very well-funded and can afford to work with companies such as IBM to develop crime predicting techniques.

Smaller or less advanced departments, however, do not have the financial advantage to use supercomputers or extensive command centers and will use less efficient techniques (such as simple spreadsheets and homegrown databases) to keep track of all of the information involved in law enforcement. “Nationwide, agencies and departments have to reduce their resources and even their manpower but are expected to continue the trend of a decreasing crime rate. To do so requires better service with fewer resources.” Open source presents an extremely effective and less expensive option – Apache Hadoop is the super hero that can save the day, one cluster at a time.

With Hadoop’s capability to store and organize data, police departments can filter through unnecessary information in order to focus on the aspects of crime that are more important. By applying advanced analytics to historical crime patterns, weather trends, traffic sensor data, and a wealth of other sources, police can place patrol cops in areas with higher crime probability instead of evenly distributing man power throughout quiet and dangerous neighborhoods. This conserves money, effort, and time. Hadoop can also help organize a number of other factors such as police back up, calls for service, or screening for biases and confounding variables. Phone calls, videos, historical records, suspect profiles, or any other important information that is necessary for law agencies to keep for a long time can be systematized and referenced whenever need be.

Increasing public safety through effective use of technology is not a panacea but it is here and is an effective tool in combating crime. Apache Hadoop serves as a foundation for this new approach and, most importantly, it is accessible to a wider range of police departments all over the country and the world. Yes, predictive policing and crime prevention still have a lot of room for development and have yet to tackle issues like specific crimes that depend on interpersonal relationships or random events. However, it is all very possible, especially with the use of Hadoop as a predictive analytics platform. Crime can be stopped. No PreCogs necessary.

Healthcare Goes Big

Earlier, in the “Big Data in Genomics and Cancer Treatment” blog post, I explored how the extensive amount of information in DNA analysis mostly comes from the vast array of characteristics associated with people’s DNA make up and with different cancer variations. The case with today’s healthcare is very similar. Each patient is unique and has thorough medical history records that allow doctors to make evaluations and recommendations for future treatments. These records also contain various drugs, therapies, diets, and regimens that must coincide with the patient’s condition and which, if not followed correctly, could endanger the patient’s life.

“Doctor, can I have some of that Big Data?”

Currently, the medical field is overflowing with big data and there is huge potential for improvement in treatment quality and overall patient experience. With the use of big data analytics, health care and pharmaceutical companies could significantly advance the services that they offer their patients.

Through big data analytics, there could be much more control over hospital operations. According to the Top Ten Innovations for 2012 article for the 2012 Medical Innovation Summit, data could “track outcomes for clinical and surgical procedures, including length of stay, readmission rates, infection rates, mortality, and comorbidity prevention”. The article also stated that, “Healthcare big data requires advanced technologies to efficiently process it with tolerable elapsed time, so organizations can create, collect, search, and share data, while still ensuring privacy.” This brings up a very important point: how can healthcare organizations take advantage of the benefits that many big data technologies provide while also ensuring privacy?

Healthcare companies must be able to balance the privacy of their individual patients with the overall health of the population. To meet that need, companies can still analyze patient records through the HIPAA-compliant privacy framework – a security framework that eliminates patient identification and still allows data analysis. This framework complies with federal law and, most importantly, brings about exciting improvements at various hospitals in their goals to improve today’s healthcare. Aside from that, there is also the Nationwide Health Information Network (NHIN) Exchange – a way for healthcare professionals to securely exchange information while following specific standards, services, and policies. The NHIN Exchange is helping to achieve the Health Information Technology for Economic and Clinical Health act (HITECH) of 2009.

Big Data Role Models in Healthcare

With the help of big data companies, hospitals and pharmacies have the ability to understand a wide range of patient data, which decreases the chances of missing any warning signs or medical miscalculations. So far, for example, New York Presbyterian Hospital has decreased potentially fatal blood clot cases by approximately 25% and the Seton Healthcare Family hospital has been able to predict (and prepare for) probable congestive heart failure cases. A clinical analytics company called Humedica is focusing specifically on congestive heart failure and has developed a predictive analytic model that also allows doctors to be aware of high-risk CHF patients before they are admitted into a hospital. The Sax Institute has also launched a project called The Secure Unified Research Environment (SURE), which allows health researchers to access patients’ medical information (identities are protected) through a data center. While still in its infancy, this project will compile a lot of research about the consistency of care in respect to the age, wealth, and overall living condition of various patients, so that doctors will be able to analyze all the factors contributing to a patient’s medical case.

Another very impressive project is IBM’s Watson (yes, the one that became popular after a game of “Jeopardy!”)– a computing system that can be used as a tool for doctors and researchers in the medical field. Watson is capable of analyzing the meaning and context of human language, allowing doctors to have an evidence-providing adviser on patient conditions in near real time. To give you a sense of its power: Watson is able to examine about one million books and analyze the information in them all in about three seconds. This kind of speed and precision can prove very helpful for doctors when they are faced with difficult medical cases, especially ones that require quick treatment. This big data system can positively change doctor-patient communication and help to facilitate efficient health care.

Explorys, a healthcare data company, has already developed a secure software platform with the help of Apache Hadoop and offers doctors the ability to aggregate, analyze, manage, and research all of the information they need to make the right decisions every day. Through its platform, Explorys has compiled an extensive healthcare database, which is already being used by 11 other major healthcare companies.

Apixio, another medical search company, uses Hadoop to analyze structured and unstructured data to provide meaningful results when healthcare professionals search specific issues in Apixio’s Medical Information Navigation Engine (MINE). Any kind of data can be put through MINE (forms, CT scans, emails) and doctors can then extract the information they need based on specific symptoms. Vishnu Vyas, a natural language scientist at Apixio, explained MINE as “Google for doctors, only better, because it’s patient-centric and determines how data relate to one another.”

Big Potential

Bill Schmarzo, chief technology officer at EMC, shared a helpful list of Big Data Business Opportunities in health care. Here are a few:

  • Ability to access any data source, no matter where it is located, using new federated query, data discovery and semantic management technologies.  This allows health care providers to gain a more timely, more complete understanding of the patient’s current situation so that they can prescribe the appropriate and most effective treatments.
  • New instrumentation opportunities to increase the amount and real-time nature of data being captured about patients’ health care (blood monitoring, smart toothbrushes, etc.)
  • In-memory capabilities to facilitate real-time, life-saving decisions at the point of care, especially in high stress, immediate need areas like the emergency room.
  • Real-time monitoring of key patient health care metrics that leverages in-memory computing to more rapidly evaluate incoming patient data streams (from the multitude of new health metrics capturing sensors), flag areas of concern, and score potential health-related issues.

By harnessing big data, healthcare industries can see significant benefits and accelerate development, particularly by using the power of Apache Hadoop. Healthcare in the United States is costly and, as an open source platform, Hadoop makes big data analytics affordable. Professionals could revolutionize their medical businesses and provide the best care possible to their patients. Most importantly, if healthcare companies learned to manage big data efficiently, there could be a wider availability of data and, consequently, a much more global knowledge of patient treatments, therapies, and drugs. For the healthcare world, in this case, Apache Hadoop may be just what the doctor ordered.

Hadoop: A Powerful Weapon for Retailers

Big Data Shopping Bag

With big data basking in the limelight, it is no surprise that large retailers have been closely watching its development… and more power to them! By learning to effectively utilize big data, retailers can significantly mold the market to their advantage, making themselves more competitive and increasing the likelihood that they will come out on top as a successful retailer. Now that there are open source analytical platforms like Hadoop, which allow for unstructured data to be transformed and organized, large retailers are able to make smart business decisions using the information they collect about customers’ habits, preferences, and needs.

As IT industry analyst Jeff Kelly explained on Wikibon, “Big Data combined with sophisticated business analytics have the potential to give enterprises unprecedented insights into customer behavior and volatile market conditions, allowing them to make data-driven business decisions faster and more effectively than the competition.” Predicting what customers want to buy, without a doubt, affects how many products they want to buy (especially if retailers add on a few of those wonderful customer discounts). Not only will big data analytics prove financially beneficial, it will also present the opportunity for customers to have a more individualized shopping experience.

This all sounds very promising but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage. In order for retailers to thrive in the market, they must learn to manage and hone in on all (or at least most) of these facets of business, which can be difficult if you keep in mind the amount of data that each channel generates. Sam Sliman, president at Optimal Solutions Integration, summarizes it perfectly: “Transparency rules the day. Inconsistency turns customers away. Retailer missteps can be glaring and costly.” By making fast market decisions, retailers can increase sales, win and maintain customers, improve margins, and boost market share, but this can really only be done with the right business analytics tools.

Who’s doing it right?

One impressive example of analytics usage is @WalmartLabs, which deals with the social and mobile aspects of retail to redefine commerce for Walmart and help its customers have a more positive shopping experience. Through its Social Genome knowledge base, @WalmartLabs zones in on entities, relationships, and events in the social world (for instance, a tweet about a specific movie title) in order to send out appropriate suggestions to customers. “We do this using public data on the Web, proprietary data, and a lot of social media. From such data we identify interesting entities and relationships, extract them, augment them with as much information as we can find, then add them to the Social Genome.” @WalmartLabs uses its own, in-house data platform called Muppet that is meant to process data at lightning speed.

Sears is another retailer that is focused on the advantages of big data and is using Hadoop to develop its business. If you were able to make it to Hadoop Summit 2012, you had the chance to see Phil Shelley speak about the company’s use of Hadoop and provide some interesting insight about the benefits of the open source platform (If you couldn’t make it, you can find the session slides here). Through Hadoop, Sears is able to compare and organize information about product availability, competitor’s prices, local economic conditions, etc. Before Hadoop, Sears was only using 10% of the information it had in store and was using most of its money and resources on running price elasticity algorithms. Rachael King of the CIO Journal explains, “The company now offloads data from its mainframe computers onto servers using Hadoop to run algorithms that analyze the data and feeds the results back into the mainframe. The retailer is able to use 100% of the data it collects.”

 Big Insights

Eric Williams, CIO at Catalina Marketing, offered some helpful information in an interview by Alison Bolen of SAS. According to Williams, retailers can use big data and business analytics to answer questions like, “what products are selling, what’s the association of one product to another, what do my consumers look like, what is the marketplace doing?” With 20,000 new products being introduced in the United States annually it is essential for companies to sort the information about all of these products in order to gauge which ones worked great and which ones were a total flop.

The sorting of information through platforms like Hadoop will allow for extensive feedback in finance, marketing, operations, sales, and other areas of a business, which, in turn, will offer a more “per-customer profitability” approach. Sales associates will be able to access information on the spot (through a mobile device, for example) about which products are up-and-coming or which items a customer may be interested in based on the questions they might ask in the store. So, not only will online shopping continue to become more and more personalized but in-store experiences will also be highly sensitive to what each customer is looking for.

A white paper by Keplar LLP goes through the process of using Hadoop for retail business analytics and offers a list of ways to use the information that is collected through different channels:

  • Learn more about the customer, including who she is, how she engages with the product, company or brand, how she feels about the product and what role she plays in evangelizing it to others
  • Identify ways to better tailor the product and service to that customer or customer segment, improving customer loyalty
  • Identify ways to improve the product for all users, by comparing the way that this customer used it with other customers. Are there particular workflows that customers struggled with or abandoned?
  • Identify new products / services to offer that customer, or a segment of customers made up of people like her
  • Grow customer lifetime value, and hence profit

The white paper explains that both consumer and product analytics are significantly affected by the presence of big data and to manage both of these, Hadoop is a great solution. Since it uses a parallel structure, Hadoop can run various analyses on smaller data sets which makes it easy for retailers to compare and contrast various products, customer feedback, and the mass of social information that is generated every minute of every day. The possibilities for a thriving retail business are endless.

Without a platform like Hadoop, retailers have to spend big bucks on designing the appropriate data warehouses for the information they collect. Hadoop doesn’t require a pre-defined schema, so storing and interpreting unstructured data like product descriptions or social media conversations between users becomes considerably easier.

 Ready to Check Out?

In such a consumer-driven society it seems almost necessary to establish a system of organization that could help make sense of consumer behaviors and trends; Apache Hadoop is a smart (and affordable) way to do this. With the social and technological worlds advancing at such an incredible speed, online, mobile, and social consumerism is becoming more of a norm rather than an option. Retail companies can truly receive the most from their business (and provide a positive experience for customers) if they happily open their arms to the big data coming their way and simultaneously understand how to transform this data into a positive business model.

Big Data in Genomics and Cancer Treatment

 

Why genomics?

Big data. These are two words the world has been hearing a lot lately and it has been in relevance to a wide array of use cases in social media, government regulation, auto insurance, retail targeting, etc. The list goes on. However, a very important concept that should receive the same (if not more) recognition is the presence of big data in human genome research.

Three billion base pairs make up the DNA present in humans. It’s probably safe to say that such a massive amount of data should be organized in a useful way, especially if it presents the possibility of eliminating cancer. Cancer treatment has been around since its first documented case in Egypt (1500 BC) when humans began distinguishing between malignant and benign tumors by learning how to surgically remove them. It is intriguing and scientifically helpful to take a look at how far the world’s knowledge of cancer has progressed since that time and what kind of role big data (and its management and analysis) plays in the search for a cure.

The most concerning issue with cancer, and the ultimate reason for why it still hasn’t been completely cured, is that it mutates differently for every individual and reacts in unexpected ways with people’s genetic make up. Professionals and researchers in the field of oncology have to assert the fact that each patient requires personalized treatment and medication in order to manage the specific type of cancer that they have. Elaine Mardis, PhD, co-director of the Genome Institute at the School of Medicine, believes that it is essential to identify mutations at the root of each tumor and to map their genetic evolution in order to make progress in the battle against cancer. “Genome analysis can play a role at multiple time points during a patient’s treatment, to identify ‘driver’ mutations in the tumor genome and to determine whether cells carrying those mutations have been eliminated by treatment.”

Read More