Oliver

Javascript: Map

You can use the map function to return different results from an array to return a new array with new data.

For example if you want to build an array of controls you could do the following.

var newControls = myDataArray.map(function(rec, index){
    	        return <div></div>;
    	    });

AWS: Send Simple Email Service

This entry is part 5 of 5 in the series AWS & Java

If you want to send an email using AWS’ Simple Mail then you need to do the following. This is a very basic example.

Import the following:

import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.simpleemail.AmazonSimpleEmailService;
import com.amazonaws.services.simpleemail.AmazonSimpleEmailServiceClientBuilder;
import com.amazonaws.services.simpleemail.model.Body;
import com.amazonaws.services.simpleemail.model.Content;
import com.amazonaws.services.simpleemail.model.Destination;
import com.amazonaws.services.simpleemail.model.Message;
import com.amazonaws.services.simpleemail.model.SendEmailRequest;

Setup Connection to AWS Simple Email Service

final AmazonSimpleEmailService simpleEmailService = AmazonSimpleEmailServiceClientBuilder.standard().withRegion(myRegion)
.withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(accessKeyId, secretKey)))
.build();

Setup Email:

final SendEmailRequest request = new SendEmailRequest().withDestination(new Destination().withToAddresses(TO)).withSource(FROM)
.withMessage(new Message().withSubject(new Content().withCharset("UTF-8").withData(SUBJECT))
.withBody(new Body().withText(new Content().withCharset("UTF-8").withData(BODY))));

Send Email:

simpleEmailService.sendEmail(request);

AWS: Java Post to Kinesis Queue

This entry is part 4 of 5 in the series AWS & Java

Posting to an AWS Kinesis Queue is rather simple and straight forward. As always you should refer to AWS Documentation.

Put Multiple Records On Queue

Import the following

import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.kinesis.AmazonKinesis;
import com.amazonaws.services.kinesis.AmazonKinesisClientBuilder;
import com.amazonaws.services.kinesis.model.PutRecordsRequest;
import com.amazonaws.services.kinesis.model.PutRecordsRequestEntry;
import com.amazonaws.services.kinesis.model.Record;

Put Records

AmazonKinesisClientBuilder clientBuilder = AmazonKinesisClientBuilder.standard().withRegion(myRegion).withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(myAccessKeyId, mySecretKey)));
AmazonKinesis kinesisClient = clientBuilder.build();
PutRecordsRequest putRecordsRequest = new PutRecordsRequest();
putRecordsRequest.setStreamName(myQueue);
List putRecordsRequestEntryList  = new ArrayList<>(); 
 
 
//You can put multiple entries at once if you wanted to
PutRecordsRequestEntry putRecordsRequestEntry  = new PutRecordsRequestEntry();
putRecordsRequestEntry.setData(ByteBuffer.wrap(myData));
putRecordsRequestEntry.setPartitionKey(myKey);
putRecordsRequestEntryList.add(putRecordsRequestEntry);
 
 
putRecordsRequest.setRecords(putRecordsRequestEntryList);
PutRecordsResult putResult = kinesisClient.putRecords(putRecordsRequest);

Put Single Record On Queue

Import the following

import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.kinesis.AmazonKinesis;
import com.amazonaws.services.kinesis.AmazonKinesisClientBuilder;
import com.amazonaws.services.kinesis.model.PutRecordRequest;
import com.amazonaws.services.kinesis.model.Record;

Put Record

AmazonKinesisClientBuilder clientBuilder = AmazonKinesisClientBuilder.standard().withRegion(myRegion).withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(myAccessKeyId, mySecretKey)));
AmazonKinesis kinesisClient = clientBuilder.build();
PutRecordRequest putRecordRequest = new PutRecordRequest();
putRecordRequest.setStreamName(myQueue);
 
putRecordRequest.setData(ByteBuffer.wrap(data.getBytes("UTF-8")));
putRecordRequest.setPartitionKey(myKey);
 
PutRecordResult putResult = kinesisClient.putRecord(putRecordRequest);

You now have put a record(s) onto the queue congratulations!

HortonWorks: Kerberize Ambari Server

This entry is part 7 of 7 in the series HortonWorks

You may want to integrate Kerberos authentication into your Ambari Server implementation. If you do follow the next few steps. It’s that easy.

Step 1: Stop Ambari Server

sudo ambari-server stop

Step 2: Create keytab file

ktutil
 
addent -password -p ##USER##@##DOMAIN##.COM -k 1 -e RC4-HMAC
 
# Enter password
 
wkt ##USER##.keytab
q
 
$ sudo mkdir /etc/security/keytabs
$ mv ##USER##.keytab /etc/security/keytabs

Step 3: Test Keytab. You should see the ticket once you klist.

kinit -kt /etc/security/keytabs/ambarisa.keytab -a ambarisa@AERYON.COM
klist

Step 4: Run Ambari Server Kerberos Setup

sudo ambari-server setup-kerberos

Follow the prompts. Say true to enabling kerberos. The keytab file will be the /etc/security/##USER##.keytab file. You should be able to leave the rest defaults. Save the settings and you are done.

Step 5: Remove the kinit ticket you created that way you can make sure you kerberos authentication is working correctly.

kdestroy

Step 6: Start Ambari Server

sudo ambari-server start

Step 7: Validate Kerberos. You should see your ticket get created and you should now be able to login with no issues.

klist

HortonWorks: Install YARN/MR

This entry is part 6 of 7 in the series HortonWorks

This tutorial guides you through installing YARN/MapReduce on Hortonworks using a multi node cluster setup with Ubuntu OS.

Step 1: Go to “Stack and Version”. Then click “Add Service” on YARN. You will notice that “MapReduce2” comes with it.

Step 2: Assign Masters I usually put the ResourceManager, History Server and App Timeline Server all on the secondary namenode. But it is totally up to you how you setup your environment.

Step 3: Assign Slaves and Clients I put NodeManagers on all the datanodes and Client’s on all servers. Up to you though. This is what worked for me and my requirements.

Step 4: During Customize Services you may get the warning that Ambari Metrics “hbase_master_heapsize” needs to be increased. I recommend doing this change but it’s up to you and what makes sense in your environment.

Step 5: Follow the remaining steps and installation should complete with no issues. Should an issue arise review the error and if it was just a turning on connection error then you may not have any issues and it just needs all services to be stopped and started again. Please not Ambari Metrics may report errors but they should clear in around 15 minutes.

HortonWorks: Rack Awareness

This entry is part 5 of 7 in the series HortonWorks

If you want to setup rack awareness it is really straight forward and easy.

Step 1: Login to Ambari Server

Step 2: Click “Hosts”

Step 3: Go to each server and modify “rack” setting.

Step 4: Shut down cluster and restart.

Happy racking!

HortonWorks: Install Hadoop

This entry is part 4 of 7 in the series HortonWorks

This tutorial guides you through installing Hadoop on hortonworks using a multi node cluster setup with Ubuntu OS.

Hosts File:

Ensure every server has the FQDN of all the servers to be in the cluster.

sudo nano /etc/hosts

SSH (Ambari Server)

You will do the following to all the servers in the Hadoop cluster.

ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh-copy-id -i ~/.ssh/id_rsa.pub ##USER##@##FQDN##
ssh ##USER##@##FQDN##

Pre-Requisites: (not Ambari Server)

Java 8:

sudo apt-get install openjdk-8-jdk

Chrony:

sudo apt-get install chrony

Disable HugePage:

sudo su
echo never > /sys/kernel/mm/transparent_hugepage/enabled
exit

Install HDFS Service:

You will need to login to Ambari Server and click “Launch Install Wizard”. For the most part you will just follow the prompts. The major hurdles is that in the “Install Options” section make sure you put the FQDN (IE: host@domain.com). You will also need to get the SSH Private Key from the Ambari Server you just did during pre requisites from this location /home/##USER##/.ssh/id_rsa. Make sure you also set the SSH User Account to what you used during SSH creation. If for any reason it fails you can click the status to find out what failed and rectify the problem. As long as you did the pre-requisites you should be fine.

ZooKeeper / Ambari Metrics

As you install HDFS you will notice that Ambari Metrics and ZooKeeper get installed automatically. This is a good thing and you want it. ZooKeeper keeps all configs in sync and Ambari Metrics lets you easily monitor the system.

Assign Masters

You will need to setup how you want your masters to look. I usually have three zookeepers. Your secondary name node should go on a separate server. But it is totally up to you how you design your cluster. Have fun!

Assign Slaves / Clients

Your slaves (aka DataNodes) I don’t put any on my namenode or secondary namenode or my zookeeper servers. I leave my datanodes to perform that action alone. I also install clients on namenode, secondary namenode and all datanodes. Up to you how you configure it just have fun while doing it!

Key Config Optional Changes

Once you get to the customize services section. You can for the most part leave this as is and just do the password areas. But I do recommend reviewing the following and update as needed.

HDFS: NameNode/DataNode/Secondary NameNode directories
ZooKeeper: ZooKeeper directory

Deploy

Deploy should work with no issues. If there is issues sometimes you don’t need to worry about it. Such as connection issue. As long as it installed if it didn’t start right away and that was the connection issue then it may start once completed. You should also note that Ambari Metrics shows errors directly after starting. That is expected and no need to worry it will clear itself.

HortonWorks: SSL Setup

This entry is part 3 of 7 in the series HortonWorks

If you want to use SSL with Ambari Server (note this is not with Hadoop yet) then follow the below steps. Please note this does not cover the creation of a SSL Cert as there are many tutorials on how to create self signed certs, etc available.

Step 1: Stop the Ambari Server

sudo ambari-server stop

Step 2: Run Ambari Server Security Setup Command

sudo ambari-server setup-security

Select option 1 during the prompts and note that you cannot use port 443 for https as that is reserved in Ambari. The default is 8443 and that is what they recommend. Enter path to your cert /etc/ssl/certs/hostname.cer file. Enter path to your encrypted key /etc/ssl/private/hostname.key file. Follow the rest of the prompts.

Step 3: Start Ambari Server

sudo ambari-server start

Step 4: Login to Ambari Server now available at https://hostname:8443

HortonWorks: Ambari LDAP Integration

This entry is part 2 of 7 in the series HortonWorks

If you want to use LDAP with your Ambari Server then follow the below steps.

Step 1: Stop the Ambari Server to setup LDAP integration

sudo ambari-server stop

Step 2: Run Ambari Server LDAP Setup Command. This will require a bunch of settings to be set. Consult your IT department for your specific settings.

sudo ambari-server setup-ldap

Step 3: Create the groups and users text files and add the users you want to add comma separated to users and groups comma separated to the groups file.

nano ~/groups.txt
nano ~/users.txt

Step 4 (Optional): You may need to adjust the “SERVER_API_HOST” value to your ambari server hostname. Default is 127.0.0.1 which is technically your host but sometimes it complains and you need to make this modification.

sudo nano /usr/lib/python2.6/site-packages/ambari_server/serverUtils.py

Step 5: Import Groups/Users from the text files created in step 3. You will need to start the ambari server first.

sudo ambari-server start
#Import groups
sudo ambari-server sync-ldap --groups groups.txt
#Import users
sudo ambari-server sync-ldap --users users.txt

Step 6: Login to Ambari and got to manage ambari and you will see your new users and groups.

HortonWorks: Ambari Server Installation

This entry is part 1 of 7 in the series HortonWorks

I have been playing around with HortonWorks sandbox and thought it about time I attempt installation on a multi node cluster. Feel free to reach out to me for further support or information. I will be documenting more in the coming weeks.

Add Repo:

wget -O /etc/apt/sources.list.d/ambari.list http://public-repo-1.hortonworks.com/ambari/ubuntu16/2.x/updates/2.6.0.0/ambari.list
sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com B9733A7A07513CAD
sudo apt-get update

Java 8

sudo apt-get install openjdk-8-jdk

Ambari Install

sudo apt-get install ambari-server

Configure Ambari

I recommend installed as non root. In fact do not do any of the defaults for users and passwords. But it is totally up to you. I will document more as I learn more.

sudo ambari-server setup

Ambari Start / Stop / Restart

sudo ambari-server restart
sudo ambari-server start
sudo ambari-server stop

Ambari Server Log Directory

I suggest changing the ambari log directory as well.

sudo vi /etc/ambari-server/conf/log4j.properties
 
#Look for the property "ambari.log.dir" and change it.
#Don't forget to create the folder you point to and ensure that chown to the user that is running ambari.

Ambari Agent Log Directory

As well change the ambari agent log directory.

sudo vi /etc/ambari-agent/conf/ambari-agent.ini
 
#Look for "logdir" and change to a directory that exists and has permissions.

Non-Root Install

Running as non-root user you will have to change the run directory because otherwise during a system reboot you may get folders not being able to be created. Because /var/run/ gets deleted on each restart and it get’s rebuilt. Do the following:

sudo mkdir -p /home/##NONROOTUSER##/run/ambari-server
sudo chown ##NONROOTUSER##:root /home/##NONROOTUSER##/run/ambari-server

Then you have to edit the ambari.properties file.

sudo vi /etc/ambari-server/conf/ambari.properties
 
#Change the following to the folder you created above.
bootstrap.dir
pid.dir
recommendations.dir

Postgres: dblink

Sometimes we need to connect to external database from within a query. There is an easy way to do this however before doing this make sure it meets your business requirements.

SELECT *
FROM dblink('dbname=##DBNAME## port=##PORT## host=##DBSERVER## user=##DBUSER## password=##PWD##', 'SELECT my fields FROM ##TABLE##') AS remote_data(my integer, fields varchar)

Hadoop: Secondary NameNode

By default a secondary namenode runs on the main namenode server. This is not ideal. A secondary namenode should be on it’s own server.

First bring up a new server that has the exact same configuration as the primary namenode.

Secondary NameNode:

nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Remove property “dfs.namenode.secondary.http-address” and “dfs.namenode.name.dir” as they are unneeded.

Then add the following property. Making sure to change to the path you will store your checkpoints in.

<property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>file:/usr/local/hadoop_store/data/checkpoint</value>
</property>

NameNode:

nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Then add the following property. Making sure to change ##SECONDARYNAMENODE##

<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>##SECONDARYNAMENODE##:50090</value>
    <description>Your Secondary NameNode hostname for http access.</description>
</property>

Now when you stop and start the cluster you will see the secondary name node now start on the secondary server and not on the primary namenode server. This is what you want.

Hadoop: Rack Awareness

If you want your multi node cluster to be rack aware you need to do a few things. The following is to be done only on the master (namenode) only.

nano /home/myuser/rack.sh

With the following contents

#!/bin/bash
 
# Adjust/Add the property "net.topology.script.file.name"
# to core-site.xml with the "absolute" path the this
# file. ENSURE the file is "executable".
 
# Supply appropriate rack prefix
RACK_PREFIX=myrackprefix
 
# To test, supply a hostname as script input:
if [ $# -gt 0 ]; then
 
CTL_FILE=${CTL_FILE:-"rack.data"}
 
HADOOP_CONF=${HADOOP_CONF:-"/home/myuser"}
 
if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then
 echo -n "/$RACK_PREFIX/rack "
 exit 0
fi
 
while [ $# -gt 0 ] ; do
 nodeArg=$1
 exec< ${HADOOP_CONF}/${CTL_FILE}
 result=""
 while read line ; do
 ar=( $line )
 if [ "${ar[0]}" = "$nodeArg" ] ; then
 result="${ar[1]}"
 fi
 done
 shift
 if [ -z "$result" ] ; then
 echo -n "/$RACK_PREFIX/rack "
 else
 echo -n "/$RACK_PREFIX/rack_$result "
 fi
done
 
else
 echo -n "/$RACK_PREFIX/rack "
fi

Set execute permissions

sudo chmod 755 rack.sh

Create the data file that has your rack information. You must be very careful not to have too many spaces between the host and the rack.

namenode_ip 1
secondarynode_ip 2
datanode1_ip 1
datanode2_ip 2

The last step is to update core-site.xml file located in your hadoop directory.

nano /usr/local/hadoop/etc/hadoop/core-site.xml

Set the contents to the following of where your rack.sh file is located.

  <property>
    <name>net.topology.script.file.name</name>
    <value>/home/myuser/rack.sh</value>
  </property>

Python: Run Process

If you want to run a jar from python or really any process. You do so by leveraging subprocess package.

from subprocess import Popen, PIPE

Then you need to call Popen. If you want to set java memory you can do so using -Xms and -Xmx in between java and -jar.

#bufsize of 1 is line buffered
#stdout and stderr to PIPE is to pipe the output of std out and std error to the PIPE so you can get the output
result = Popen(['java -jar myapp.jar'], stdout=PIPE, stderr=PIPE, shell=False, bufsize=1)

If you want your process to wait until finished you will need to call wait.

result.wait()

If you pushed the stderr and stdout then you can check the output.

if result.stdout is not None:
    for line in result.stdout:
        print(line)
 
if result.stderr is not None:
    for line in result.stderr:
        print(line)

Python: Logging

If you want to do some basic logging to a file, etc. You can use the logging package that comes with python. Here are some of the basic ways to log.

You first have to import the package.

import logging

You can setup your own logging configuration but for this we will just use the basic setup and log to a file.

#If you are going to have multiple handlers you should setup your handler
logging.root.handlers = []
 
#The file to log to
log_file = /mnt/log/
 
#Setup the config with the level to log up to
logging.basicConfig(filename=log_file, level=logging.INFO)

Then you setup your logger

logger = logging.getLogger('my_awesome_log')

If you want your log to truncate after a certain size then you must add the handler for truncating the log and back. If you do not use the rotatingfilehandler then the log will increase till your drive runs out of space.

handler = RotatingFileHandler(log_file, maxBytes=1024, backupCount=1)
logger.addHandler(handler)

If you also want to log to console you will need to add an additional handler for the console setting the level to log.

console = logging.StreamHandler()
console.setLevel(logging.INFO)
logger.addHandler(console)

That’s it a basic example of how to use the logging package.

Python: Multiprocessing Pool

Sometimes we want to run a method using multiple processors to process our code due to a costly function. Below is an example of how you could do it. There is other api’s you could use like ‘map’ but here is just one example.

from multiprocessing import Pool
# Sets the pool to utilize 4 processes
pool = Pool(processes=4)
result = pool.apply_async(func=my_method, args=("some_info",))
# Performs the aync function
data = result.get()
pool.close()

Python: Selenium Tests

Selenium is a great way to test your UI. It is compatible with different browsers. I will show you two.

Gecko Driver Installation:

Make sure you are using latest version. At the time of this writing it is 0.19.0.

wget https://github.com/mozilla/geckodriver/releases/download/v0.19.0/geckodriver-v0.19.0-linux64.tar.gz
sudo tar -xvzf geckodriver-v0.19.0-linux64.tar.gz
sudo chmod +x geckodriver
cp geckodriver /usr/local/bin/
sudo cp geckodriver /usr/local/bin/

You can use phantomjs, firefox, chrome, etc.

PhantomJS Installation:

sudo mv phantomjs-2.1.1-linux-x86_64.tar.bz2 /usr/local/share/.
cd /usr/local/share/
sudo tar xjf phantomjs-2.1.1-linux-x86_64.tar.bz2
sudo ln -s /usr/local/share/phantomjs-2.1.1-linux-x86_64 /usr/local/share/phantomjs
sudo ln -s /usr/local/share/phantomjs/bin/phantomjs /usr/local/bin/phantomjs

Firefox Installation:

sudo apt-get update
wget https://ftp.mozilla.org/pub/firefox/releases/50.0/linux-x86_64/en-US/firefox-50.0.tar.bz2
sudo tar -xjf firefox-50.0.tar.bz2
sudo rm -rf /opt/firefox
sudo mv firefox /opt/firefox
sudo mv /usr/bin/firefox /usr/bin/firefoxold
sudo ln -s /opt/firefoxX/firefox /usr/bin/firefox

Firefox Headless Installation:

sudo apt-get install xvfb
pip3 install pyvirtualdisplay==0.2.1

Selenium Installation:

pip3 install selenium==3.6.0

PyUnit Selenium Test Examples:

Setup:

#If you are using headless firefox
from pyvirtualdisplay import Display
#The selenium imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import unittest, os, time
 
class MySeleniumTests(unittest.TestCase):
    @classmethod
    def setUpClass(self):
        self.server_url = "http://" + os.getenv("WEBSITE_URL", 'localhost:5000')
 
    def setUp(self):
        #if you are using firefox headless browser
        display = Display(visible=0, size=(1080, 720))
        display.start()
		
        #Firefox selenium driver.
        self.driver = webdriver.Firefox()
		
        #PhantomJS selenium driver
        self.driver = webdriver.PhantomJS()
		
        self.driver.implicitly_wait(60)
        self.driver.set_page_load_timeout(60)
        self.driver.set_window_size(1080, 720)
        self.base_url = self.server_url
 
        self.driver.get(self.base_url + "/")
		
        #If your site has a login then you need to set the username and password first.
        self.driver.find_element_by_id("user").clear()
        self.driver.find_element_by_id("user").send_keys(USERNAME)
        self.driver.find_element_by_id("password").clear()
        self.driver.find_element_by_id("password").send_keys(PWD)
        self.driver.find_element_by_id("submit").click()
        time.sleep(1)
 
    def tearDown(self):
        self.driver.quit()
 
if __name__ == "__main__":
    unittest.main()

Test Title:

self.driver.get(self.server_url)
self.assertIn("MySite", self.driver.title)

Find Class:

self.assertTrue(WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, "my-awesome-class"))))

Find ID:

self.assertTrue(WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.ID, "myId"))))

Find Partial Text:

self.driver.find_element_by_partial_link_text("My Text On Page")

Find Element Contains Text:

self.assertTrue('MyText' in self.driver.find_element_by_id('container').text)

Click Element:

self.driver.find_element_by_id('myId').click()

Wait Element To Show:

self.assertTrue(WebDriverWait(self.driver, 10).until(EC.text_to_be_present_in_element((By.ID, 'MyID'), "Text To See")))

xPath Click Second Element:

self.driver.find_element_by_xpath("(//div[@class='my-awesome-class'])[1]").click()

Clear Input:

self.driver.find_element_by_id("myId").clear()

Send Data To Input:

self.driver.find_element_by_id("myId").send_keys('My New Data')

AWS: Java S3 Upload

This entry is part 3 of 5 in the series AWS & Java

If you want to push data to AWS S3 there are a few different ways of doing this. I will show you two ways I have used.

Option 1: putObject

import com.amazonaws.AmazonClientException;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.ClientConfiguration;
import com.amazonaws.auth.AWSCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
 
ClientConfiguration config = new ClientConfiguration();
config.setSocketTimeout(SOCKET_TIMEOUT);
config.setMaxErrorRetry(RETRY_COUNT);
config.setClientExecutionTimeout(CLIENT_EXECUTION_TIMEOUT);
config.setRequestTimeout(REQUEST_TIMEOUT);
config.setConnectionTimeout(CONNECTION_TIMEOUT);
 
AWSCredentialsProvider credProvider = ...;
String region = ...;
 
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withCredentials(credProvider).withRegion(region).withClientConfiguration(config).build();
 
InputStream stream = ...;
String bucketName = .....;
String keyName = ...;
String mimeType = ...;
 
//You use metadata to describe the data.
final ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentType(mimeType);
 
//There are overrides available. Find the one that suites what you need.
try {
	s3Client.putObject(bucketName, keyName, stream, metaData);
} catch (final AmazonClientException ex) {
	//Log the exception
}

Option 2: MultiPart Upload

import com.amazonaws.AmazonClientException;
import com.amazonaws.event.ProgressEvent;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.ClientConfiguration;
import com.amazonaws.auth.AWSCredentialsProvider;
import com.amazonaws.event.ProgressEventType;
import com.amazonaws.event.ProgressListener;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.transfer.TransferManager;
import com.amazonaws.services.s3.transfer.TransferManagerBuilder;
import com.amazonaws.services.s3.transfer.Upload;
 
ClientConfiguration config = new ClientConfiguration();
config.setSocketTimeout(SOCKET_TIMEOUT);
config.setMaxErrorRetry(RETRY_COUNT);
config.setClientExecutionTimeout(CLIENT_EXECUTION_TIMEOUT);
config.setRequestTimeout(REQUEST_TIMEOUT);
config.setConnectionTimeout(CONNECTION_TIMEOUT);
 
AWSCredentialsProvider credProvider = ...;
String region = ...;
 
AmazonS3 s3Client = AmazonS3ClientBuilder.standard().withCredentials(credProvider).withRegion(region).withClientConfiguration(config).build();
 
InputStream stream = ...;
String bucketName = .....;
String keyName = ...;
long contentLength = ...;
String mimeType = ...;
 
//You use metadata to describe the data. You need the content length so the multi part upload knows how big it is
final ObjectMetadata metaData = new ObjectMetadata();
metaData.setContentLength(contentLength);
metaData.setContentType(mimeType);
 
TransferManager tf = TransferManagerBuilder.standard().withS3Client(s3Client).build();
tf.getConfiguration().setMinimumUploadPartSize(UPLOAD_PART_SIZE);
tf.getConfiguration().setMultipartUploadThreshold(UPLOAD_THRESHOLD);
Upload xfer = tf.upload(bucketName, keyName, stream, metaData);
 
ProgressListener progressListener = new ProgressListener() {
	public void progressChanged(ProgressEvent progressEvent) {
		if (xfer == null)
			return;
		
		if (progressEvent.getEventType() == ProgressEventType.TRANSFER_FAILED_EVENT || progressEvent.getEventType() == ProgressEventType.TRANSFER_PART_FAILED_EVENT) {
			//Log the message
		}
	}
};
 
xfer.addProgressListener(progressListener);
xfer.waitForCompletion();

PIG: Testing

Apache PIG analyzes large data sets. There are a variety of ways of processing data using it. As I learn more about it I will put use cases below.

JSON:

REGISTER 'hdfs:///elephant-bird-core-4.15.jar';
REGISTER 'hdfs:///elephant-bird-hadoop-compat-4.15.jar';
REGISTER 'hdfs:///elephant-bird-pig-4.15.jar';
REGISTER 'hdfs:///json-simple-1.1.1.jar';
 
loadedJson = LOAD '/hdfs_dir/MyFile.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]);
rec = FOREACH loadedJson GENERATE json#'my_key' as (m:chararray);
DESCRIBE rec;
DUMP rec;
 
--Store the results in a hdfs dir. You can have HIVE query that directory
STORE rec INTO '/hdfs_dir' USING PigStorage();

Python: MRJob

If you use hadoop and you want to run a map reduce type job using Python you can use MRJob.

Installation:

pip install mrjob

Here is an example if you run just the mapper code and you load a json file. yield writes the data out.

from mrjob.job import MRJob, MRStep
import json
 
class MRTest(MRJob):
    def steps(self):
        return [
            MRStep(mapper=self.mapper_test)
        ]
 
    def mapper_test(self, _, line):
        result = {}
        doc = json.loads(line)
 
        yield key, result
 
if __name__ == '__main__':
    MRTest.run()