Last week I tried to import some Microsoft SQL Server data into the Sandbox. As a Microsoft based developer I didn’t have experience with Linux nor Hadoop and it was difficult to get it working. I got it to work with the steps below, hopefully other MS developers can use it to get enthusiastic about Hadoop
We can use Sqoop to import data, this is already installed on the Sandbox. Sqoop commands can be executed via Bash in the web browser Hue shell or directly on the VM (logon root/hadoop).
The Sqoop command to import data into the HDFS is: (– = 2 dashes!)
sqoop import –connect –table –username –password
A JDBC connection string is required to connect to the MS SQL server. This means a JDBC driver for SQL Server needs to be installed on the Hadoop environment. The Sandbox runs on CentOS, which is a Linux distribution (basic commands). These are the steps I followed:
1. Find the JDBC drivers for MS SQL on Microsoft, currently sqljdbc_4.0.2206.100_enu.tar.gz.
2. To download it to the VM, you need specific rights, so we need to switch accounts and create a new session with the root user. Type su – root and enter the password hadoop.
3. Move to the sqoop directory: cd /usr/lib/sqoop
4. Download the driver with this command: wget http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz
5. Unzip: gunzip sqljdbc_4.0.2206.100_enu.tar.gz
6. Extract tar: tar –xvf sqljdbc_4.0.2206.100_enu.tar. This doesn’t seem to work in Bash, so you need to login to the VM itself to process this step…
7. Copy the jar files to the lib: cp /sqljdbc_4.0/enu/*.jar /usr/lib/sqoop/lib
Now the JDBC SQL Server driver is ready to be used. Because Sqoop tries to connect to your SQL Server from ‘the outside’, you need to adjust your local firewall settings. Make sure port 1433 is open for ‘public’.
The sqoop command to import can look like this:
sqoop import –connect “jdbc:sqlserver://192.168.25.1:1433;databasename=mssqldatabase;” –table mytable –username sqluseraccount –password mypwd –hive-import
The IP address is the IP of my local machine. I added the optional – -hive-import flag which means that after the import into HDFS, it will import the data into a Hive table (and yes, sending password like this is insecure).
The import can take a while, good luck!