Python: Connect To Hadoop

We can connect to Hadoop from Python using PyWebhdfs package. For the purposes of this post we will use version 0.4.1. You can see all API’s from here.

To build a connection to Hadoop you first need to import it.

from pywebhdfs.webhdfs import PyWebHdfsClient

Then you build the connection like this.

HDFS_CONNECTION = PyWebHdfsClient(host=##HOST## port='50070', user_name=##USER##)

To list the contents of a directory you do this.

HDFS_CONNECTION.list_dir(##HADOOP_DIR##)

To pull a single file down from Hadoop is straight forward. Notice how we have the “FileNotFound” brought in. That is important when pulling a file in. You don’t actually need it but “read_file” will raise that exception if it is not found. By default we should always include this.

from pywebhdfs.errors import FileNotFound

try:
	file_data = HDFS_CONNECTION.read_file(##FILENAME##)
except FileNotFound as e:
	print(e)
except Exception as e:
	print(e)