Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
Hortonworks Customer
Yahoo Japan Corporation

Key Highlights

icon6.png

Nearly 6000 nodes

icon6.png

Managing 75 petabytes

icon6.png

Nearly 400 internal users

Building a Stable Platform for Big Data in Business

Yahoo Japan Corporation (Yahoo Japan) manages Yahoo! JAPAN, the largest portal site in Japan, once leveraged Apache Hadoop as a platform for storing and operating ever-increasing amounts of data, such as user activity history. In order to cope with the rapid proliferation of data and to ensure stable use of large clusters, it adopted Hortonworks Data Platform (HDP) to provide a mission-critical platform moving forward. This has allowed Yahoo Japan to build a platform to store and analyze ever-increasing amounts of data.

Building an Analytical Platform to Scale

The portal site Yahoo! JAPAN collects diverse data sets. Huge amounts of data are stored every day, such as access logs, search keywords, product information, purchase histories, and bidding information for auctions. Currently, Yahoo Japan operates three clusters. The largest cluster contain 3,800 nodes, and the total number of nodes is close to 6,000. Yahoo Japan stores a total of 75PB of data and a maximum of 37PB of data in a single cluster.

Business Challenges:
• Rapid proliferation of data
• Unstable operation of large clusters
• Need to improve technology level

Deployments Results:
• Improved performance
• Stable operation of large clusters
• Improved skill levels of internal engineers

Yahoo Japan has been actively using big data since the early days. In 2008, it had already adopted Apache™ Hadoop® for data storage and analysis. Tetsuya Hibino, manager of the Data Platform Group, Data & Science Solutions Department of Yahoo Japan Corporation has the following comments regarding the efforts made for data analysis using Yahoo Japan’s Hadoop implementation.

“We installed Hadoop when it first appeared as an open source technology. It was initially used in the tabulation of search logs for each department but because there are more internal Hadoop users and the number of clusters has also increased, we decided to consolidate the clusters in 2011.”

However, the initial 500 consolidated clusters soon reached their resource limits. When the clusters increased to 1,000 nodes, cross-use of data became possible and the number of internal users increased further. In February 2014, Yahoo Japan’s clusters exceeded 3,000 nodes. Currently, about 300-400 internal users use Hadoop for data analysis on a daily basis in Yahoo Japan but this number is expected to increase further in the future.

“When we exceeded 3,000 nodes, various problems occurred during the operation. The technical difficulties vary when the system gets larger in size. We would like to stabilize the system together with partners with expertise on big data utilization,” adds Hibino.

Partnership to Increase Internal Technical Skills

Yahoo Japan needed to build a stable data analysis platform to support the operation of huge amounts of data. As such, it decided to partner Hortonworks and adopt HDP.

According to Hibino, “We discussed the deployment of a specific distribution for stable operation but in order to cope with ever-increasing data, we thought it would be necessary to build up our internal technical skills as well. Our partnership with Hortonworks allows us to tap into the high level of technical skills of the Hortonworks team, and together with our team, we are able to come up with solutions for the building and operation of data analysis platforms in line with our business targets. We also receive appropriate advice from committers and increased the technical skills of our management layer.”

Kenji Fujimoto, Leader of Development No. 2 Grid, Data Platform Group, Data & Science Solutions Department of Yahoo Japan Corporation, has the following comments about the partnership with Hortonworks. “In deploying HDP, we discussed whether to set up and migrate new clusters or to update existing ones. In the end, we decided to set up new HDP clusters and close existing clusters after migration. Firstly, we started with the migration of the cluster for the advertising business and subsequently extended this method to other businesses. This time, we had to set up large clusters too but the advice from the visiting Hortonworks architects was very helpful with regard to efficient use of hardware.”

Hibino adds, “As a result, we were able to increase the throughput per cost to 2.4 times.”

Improved Performance Of Large Clusters

The Hadoop ecosystem is deployed across-the-board in Yahoo Japan to store and analyze various data in large clusters. In particular, Apache Hive on Apache Tez contributes the most to performance improvement. Compared to the environment before deployment, throughput increased by about 30 times.

“This may not be completely due to the effects of Hive on Tez as application improvements also play a part but there has been an obvious improvement in performance,” says Hibino. Apache Ambari, an open source management tool, is also used effectively.

According to Fujimoto, “We were using Ambari right from the start. Recently, we have also automated our systems using Ambari APIs. We hope to use Ambari more extensively for cluster maintenance in the future.”

Operations have also improved with installation of HDP. Fujimoto elaborates, “Basically, the Hortonworks distribution provides completed verification of the version compatibility for each component. I think it is important that the operations are guaranteed.”

Hibino also feels positive about this partnership. “One of the objectives of this partnership was the improvement of internal technical skills. We were able to gain effective know-how through the advice of Hortonworks architects, exchanges on the community service (HCC), and discussions with committers familiar with the OSS community.”

Future Outlook

By consolidating clusters and analyzing a variety of data in ever increasing volumes, Yahoo Japan was able to develop Yahoo! DMP (Data Management Platform) as a service, which provides the most appropriate approach for its audiences through more precise targeting.

As the use of data analysis platforms progresses further, such platforms may be required to handle real-time processing instead of batch processing depending on the usage of internal users. Hibino has the following comments in view of the setting up of a data analysis environment that can offer flexible support to businesses.

“We will be considering frameworks in which the processing engines will change according to the type and purpose of data, such as increasing processing speed by on-memory. We also want to use such platforms for data processing such as for the backend of machine learning or deep learning,” says Hibino Yahoo! JAPAN is committed to solving issues faced by the Japanese people and society through the power of the Internet, and have created various services to serve these needs. It will continue to challenge itself to create new hopes for the future.

Information Analysis Platform

Information Analysis Platform

About Yahoo Japan Corporation

Yahoo! JAPAN is committed to solving issues faced by the Japanese people and society through the power of the Internet, and have created various services to serve these needs. It will continue to challenge itself to create new hopes for the future.