The network and security teams at your company do not allow internet access from the machines where you plan to install Hadoop. What do you do? How do you install your Hadoop cluster without having access to the public software packages? Apache Ambari supports local repositories and in this post we’ll look at the configuration needed for that support.
When installing Hadoop with Ambari, there are three repositories at play: one for Ambari – which primarily hosts the Ambari Server and Ambari Agent packages) and two repositories for the Hortonworks Data Platform – which hosts the HDP Hadoop Stack packages and other related utilities.
Whether it’s the Ambari repository, or the HDP repositories, below we summarize two options to build a local repository. For more background, you can review this Hortonworks document that covers installing Hadoop in data centers with network restrictions. The document contains a good amount of details on building local repositories, as well as information regarding where to get the Ambari and HDP repository tarballs (if you choose Option 2 below).
Option 1: If you can get temporary internet access, you can use the public repository to build the local repository via “reposync”. Basically, you can “reposync” the packages – which means sync all of the software packages from the public repository to your local host, construct the repository by using linux tools to create the necessary repodata and host all of those packages from your apache web server to have a local repo.
Option 2: If you cannot get temporary internet access, you can download a repository tarball which contains all of the software packages in tarball form, extract into your apache web server for hosting and voila, you have a local repo.
Regardless of your choice above, the end result is having a local repository inside of your network that is addressable by a Base URL – a URL to the directory where the repodata directory of the repository is located.
During Ambari Server setup, Ambari will optionally download and install the JDK. The JDK is hosted publicly but if you do not have internet access, you need to download the JDK and install the JDK on your hosts. And when you run Ambari Server setup, specify the -j option to indicate the location of your JDK.
ambari-server setup -j /path/to/your/installed/jdk
Note: This is the JDK install scenario we typically see. Hosts already have a JDK installed and by using the -j option, you instruct Ambari to use that already-installed JDK instead of trying to download and install the JDK from the internet.
For Ambari to install the Hortonworks Data Platform (HDP) Stack, you need the HDP repository available. So with the HDP Stack local repository Base URL in hand (that you created earlier), and with the Ambari Server installed + setup, start the Ambari Cluster Install Wizard.
Login and on the Select Stack wizard screen, there is an area for Advanced Repository Options.
Expand the Options area and you’ll see (by default) the Base URLs for the HDP Stack public repositories. Since HDP supports multiple operating systems (OS), and each set of OS packages are in their own repositories, there is a Base URL per OS.
Based on what OS (or OSes) you plan to use in your cluster, replace the public Base URL with your local repository Base URL (that you created earlier). You can uncheck the OSes you do not plan to use in your cluster. Click Next and continue along with the cluster install process.
Ambari will validate your Ambari Server host can reach this repository and that the Base URL points to a valid repository (just in case you mistyped or misconfigured your local repository). And during host registration, if any of the hosts you plan to include in the cluster use a different OS than one(s) you specified in Advanced Repository Options, you will see a warning.
After validation, click Next and continue with your install. After you click Deploy and Ambari installs the Hadoop packages on your hosts, each host will access the local repository to obtain the packages and not go out to the internet.
That’s about it. I should point out that local repositories are not only for installing Hadoop without internet access. Local repositories can also minimize internet bandwidth usage when downloading software packages which helps make cluster installs faster. Also, by having a local repository available for a specific Stack version, you can rest assured you have software packages for a Hadoop Stack available for installs in the future. I think you’ll agree that local repositories are critical when you do not have internet access and will come in handy to help speed package installs.
Get started today using the latest Ambari release. And as always, to find out more about Ambari, please visit the Apache Ambari Project page. You can also join the Ambari User Group and attend Meetup events.