Import from Microsoft SQL Server into the Hortonworks Sandbox using Sqoop
This tutorial is from the Community part of tutorial for Hortonworks Sandbox - a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. This community tutorial submitted by flacrosse with source available at Github. Feel free to contribute edits or your own tutorial and help the community learn Hadoop.If the code returns Shared Memory as the net_transport (as shown below), you’ll need to go into the SQL Server Configuration Manager, disable Shared Memory, make sure TCP/IP is enabled, and reboot the server. Once you have Shared Memory disabled the code should return an IP address. Make note of the IP address and the Local TCP Port number as we’ll be using them later in the Sqoop connection string. NOTE: It’s probably a good idea at this point to make sure the user is able to connect to the server and access some data. Log in through Management Studio with the credentials and do a SELECT against a table you want to import. If you can’t get to the data this way, Sqoop isn’t going to be able either.Apache Sqoop project page.
SELECT CONNECTIONPROPERTY('net_transport') AS net_transport, CONNECTIONPROPERTY('protocol_type') AS protocol_type, CONNECTIONPROPERTY('auth_scheme') AS auth_scheme, CONNECTIONPROPERTY('local_net_address') AS local_net_address, CONNECTIONPROPERTY('local_tcp_port') AS local_tcp_port, CONNECTIONPROPERTY('client_net_address') AS client_net_address
Download and unzip the driver
You should now see a folder in the directory called sqljdbc_4.0 when you run the ls command.
curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz
Once this runs you should be able to see the driver in the list of available Sqoop drivers
cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib
Now that the driver has been added I recommend restarting the Sandbox completely. I’m sure there are ways to restart certain services so that the new driver becomes available but I had a much easier time just rebooting before continuing on. AdventureWorks database.
So for my connection I run the following code, using the IP Address and Port Number I saw earlier along with the user name and password I set up
sqoop list-databases --connect jdbc:sqlserver://<IP Address>:<Port Number> --username <Username> --password <Password>
With that command, Sqoop should return a listing of all the available databases to connect to.
sqoop list-databases --connect jdbc:sqlserver://192.168.56.1:1433 --username hadoop --password hadoop1
So for my import I end up with the following connection string
sqoop import --connect "jdbc:sqlserver://<IP Address>;database=<Database Name>;username=<Username>;password=<Password>" --table <Table Name> --hive-import -- --schema <Schema Name>
Note: the schema argument is specific to the driver and must be passed at the end of the command after an empty -- (dash dash) argument. If you are importing from the default schema of the database (like dbo) you do not need to specify the schema with this additional argument. After you execute the command Hadoop is going to go to work and kick off a MapReduce job. You’ll see a lot of information flying past your eyes, but eventually you be left with the following results indicating your table has been successfully imported. (Use Shift+PageUp/PageDown if you want navigate through the command prompt information that flew past which is especially useful in debugging).
sqoop import --connect "jdbc:sqlserver://192.168.56.1:1433;database=AdentureWorks2008R2;username=hadoop;password=hadoop1" --table SalesOrderDetail --hive-import -- --schema Sales
SELECT * FROM SalesOrderDetail